Philip May - Data Science and IT#
I’m Philip May, data scientist expert and open source enthusiast with an NLP focus. I come from Germany and work for Deutsche Telekom.
This website is a mixture of documentation, blog and personal notes.
Website Topics
Blog#
18 November - Pandas apply
I often use Pandas to process NLP data. In many cases I want to create a new column from the information in an existing column. For example, if I want to have the number of characters or tokens.
12 October - Options for Date Encoding
Some data, such as strings, must be encoded to be used in machine learning models. Here we explore the different options for encoding date fields.
23 July - Python Installation and Package Management with conda and pip
This article is about installing Python and package management. It is a subjective article and represents my own opinion and experience. The article is structured by several recommendations.
23 February - Anomalies in the MLSUM Dataset
While evaluating the ml6team/mt5-small-german-finetune-mlsum summarization model, my colleague Michal Harakal and I noticed that in many cases this model for summarization simply reproduces the first sentence of the input text. Instead, it should generate an independent summary of the whole text.
22 February - Clean German Wikipedia Text Corpus released
Today I published a new Wikipedia-based German text corpus. It is to be used for NLP machine learning tasks.
20 February - LightGBM with Optuna: Demo released
This week I published a project to show how to combine LightGBM and Optuna efficiently to train good models. The purpose of this work is to be able to be reused as a template for new projects.
10 April - German colossal, cleaned Common Crawl corpus (GC4) released
Philipp Reißel (ambeRoad) and me published the largest German text corpus within the German NLP Group: The German colossal, cleaned Common Crawl corpus
01 December - Training and Evaluation of our German Electra Language Model Talk
Together with Philipp Reissel from ambeRoad I gave a talk about the training and evaluation of our open-source German Electra NLP language model.
My Open Source Contributions#
This is an overview of my open source models, datasets, projects and contributions.
Models#
German Electra NLP model, joined work with Philipp Reißel (ambeRoad)
Talk about this model:
BEYOND BERT – Challenges and Potentials in the Training of German Language Models
These models are trained on our GC4 corpus.
Joined work with Stefan Schweter (schweter.ml) and Philipp Schmid (Hugging Face).
This model is intended to compute sentence (text) embeddings for English and German text. These embeddings can then be compared with cosine-similarity to find sentences with a similar semantic meaning.
A bilingual summarization model for English and German. It is based on the multilingual T5 model google/mt5-small.
Datasets#
This is a German text corpus which is based on Common Crawl. The text corpus has the size of 454 GB packed. Unpacked it is more than 1 TB. It has been cleaned up and preprocessed and can be used for various tasks in the NLP field. The dataset is joined work with Philipp Reißel (ambeRoad).
Machine translated multilingual translations and
the English original of the STSbenchmark dataset.
Translation has been done with deepl.com.
This dataset is available on GitHub and
as a Hugging Face Dataset.
This is a dataset of more than 21 million German paraphrases. These are text pairs that have the same meaning but are expressed with different words. This dataset can be used for example to train semantic text embeddings. To do this, for example, SentenceTransformers and the MultipleNegativesRankingLoss can be used.
Tools to extract and clean the Wikipedia texts to transform them into a text corpus for self-supervised NLP model training. Includes also a prepared corpus for English and German language.
This repository contains two datasets:
A labeled multi-domain (21 domains) German and English dataset with 25K user utterances for human-robot interaction. It is also available as a Hugging Face dataset: deutsche-telekom/NLU-Evaluation-Data-en-de
A dataset with 1,127 German sentence pairs with a similarity score. The sentences originate from the first data set.
This is a few-shot training dataset from the domain of human-robot interaction. It contains texts in German and English language with 64 different utterances (classes). Each utterance (class) has exactly 20 samples in the training set. This leads to a total of 1280 different training samples.
The dataset is intended to benchmark the intent classifiers of chat bots in English and especially in German language. We are building on our deutsche-telekom/NLU-Evaluation-Data-en-de data set.
Projects#
Models and training code for cross-lingual sentence representations like T-Systems-onsite/cross-en-de-roberta-sentence-transformer
This Python package implements tools for LightGBM. In the current version lightgbm-tools focuses on binary classification metrics.
Tools for machine learning in cloud environments. At the moment it is only a tool to easily handle Amazon S3.
This project uses the census income data and fits LightGBM models on it. It is not intended to bring super good results, but rather as a demo to show the interaction between LightGBM, Optuna and HPOflow. The usage of HPOflow is optional and can be removed if wanted. We also calculare the feature importances with SHAP (SHapley Additive exPlanations).
smart-prom-next is a Prometheus metric exporter for S.M.A.R.T. values of hard disks.
The MLflow Docker image.
MLflow does not provide an official Docker image. This project fills that gap.
Python tool to support lazy imports
This is Black for Python docstrings and reStructuredText (rst). It can be used to format docstrings (Google docstring format) in Python files or reStructuredText.
conda-forge release of Hyperopt
Pull Requests#
google-research/electra: add toggle to turn off
strip_accents
#88opensearch-project/opensearch-py: add Sphinx to generate Code Documentation #112 - also see API Reference
hyperopt/hyperopt: add progressbar with tqdm #455
mlflow/mlflow: add possibility to use client cert. with tracking API #2843
Archived Projects#
Tools for Hugging Face / Transformers