This is an overview of my open source projects, contributions, models and datasets.
Projects
HPOflow
Tools for Optuna,
MLflow and
the integration of both
XLSR – Cross-Lingual Sentence Representations
Models and training code for cross-lingual sentence representations like
T-Systems-onsite/cross-en-de-roberta-sentence-transformer
Transformer-Tools
Tools for Hugging Face / Transformers
LightGBM Tools
This Python package implements tools for LightGBM.
In the current version lightgbm-tools focuses on binary classification metrics.
ML-Cloud-Tools
Tools for machine learning in cloud environments.
At the moment it is only a tool to easily handle Amazon S3.
Census-Income with LightGBM and Optuna
This project uses the census income data and
fits LightGBM models on it.
It is not intended to bring super good results, but rather as a demo to show the interaction between
LightGBM, Optuna and
HPOflow. The usage of HPOflow is optional and can be removed if wanted.
We also calculare the feature importances
with SHAP (SHapley Additive exPlanations).
S.M.A.R.T. Prometheus Metrics Exporter
smart-prom-next is a Prometheus metric exporter for
S.M.A.R.T. values of hard disks.
MLflow Image
The MLflow Docker image.
MLflow does not provide an official Docker image. This project fills that gap.
Lazy-Imports
Python tool to support lazy imports
Style-Doc
This is Black for Python docstrings and reStructuredText (rst). It can be used to format
docstrings (Google docstring format)
in Python files or reStructuredText.
PyCharm Community Edition IDE for Python with bundled JRE
An Arch Linux package (AUR)
conda-forge/hyperopt-feedstock
conda-forge release of Hyperopt
Models
T-Systems-onsite/cross-en-de-roberta-sentence-transformer
This model is intended to compute sentence (text) embeddings
for English and German text. These embeddings can then be compared with cosine-similarity
to find sentences with a similar semantic meaning.
german-nlp-group/electra-base-german-uncased
German Electra NLP model,
joined work with Philipp Reißel
(ambeRoad)
GermanT5/t5-efficient-gc4-german-base-nl36
The first pure German T5 model.
Joined work with Stefan Schweter (Bayerische Staatsbibliothek / Open Source @ DBMDZ) and Philipp Schmid (Hugging Face).
T-Systems-onsite/mt5-small-sum-de-en-v2
This is a bilingual summarization model for English and German.
It is based on the multilingual T5 model google/mt5-small.
Pull Requests
- add classifier_dropout to classification heads: #12794
- add option for subword regularization in sentencepiece tokenizer: #11149, #11417
- add strip_accents to basic BertTokenizer: #6280
- refactor slow sentencepiece tokenizers and add tests: #11716, #11737
- more fixes and improvements
- add MLflow integration callback: #1028
- trial level suggest for same variable with different parameters give warning: #908
- more fixes and improvements
- add callback so we can do pruning and check for nan values: #327
- add option to pass params to tokenizer: #342
- always store best_score: #439
- fix for OOM problems on GPU with large datasets: #525
SetFit - Efficient Few-shot Learning with Sentence Transformers
- add option to normalize embeddings #177
- add option to set
samples_per_label
#196 - add warmup_proportion param - make warmup_steps configurable #140
- add option to use amp / FP16 #134
- add num_epochs to train_step calculation #139
- add more loss function options #159
Other Fixes and Improvements
- google-research/electra: add toggle to turn off
strip_accents
#88 - opensearch-project/opensearch-py: add Sphinx to generate Code Documentation #112 - also see API Reference
- deepset-ai/FARM: various fixes and improvements
- hyperopt/hyperopt: add progressbar with tqdm #455
- mlflow/mlflow: add possibility to use client cert. with tracking API #2843
Datasets
The German colossal, cleaned Common Crawl corpus (GC4 corpus)
This is a German text corpus which is based on Common Crawl.
The text corpus has the size of 454 GB packed. Unpacked it is more than 1 TB.
It has been cleaned up and preprocessed and can be used for various tasks in the NLP field.
The dataset is joined work with Philipp Reißel
(ambeRoad).
STSb Multi MT
Machine translated multilingual translations and
the English original of the STSbenchmark dataset.
Translation has been done with deepl.com.
This dataset is available on GitHub and
as a Hugging Face Dataset.
German Backtranslated Paraphrase Dataset
This is a dataset of more than 21 million German paraphrases.
These are text pairs that have the same meaning but are expressed with different words.
This dataset can be used for example to train semantic text embeddings.
To do this, for example, SentenceTransformers
and the MultipleNegativesRankingLoss
can be used.
Wikipedia 2 Corpus
Tools to extract and clean the Wikipedia texts to transform them into a text corpus for self-supervised NLP model training.
Includes also a prepared corpus for English and German language.
NLU Evaluation Data - German and English + Similarity
This repository contains two datasets:
NLU-Data-Home-Domain-Annotated-All-de-en.csv
: This dataset contains a labeled multi-domain (21 domains) German and English dataset with 25K user utterances for human-robot interaction. It is also available as a Hugging Face dataset: deutsche-telekom/NLU-Evaluation-Data-en-deNLU-Data-Home-Domain-similarity-de.csv
: This dataset contains German sentence pairs fromNLU-Data-Home-Domain-Annotated-All-de-en.csv
with semantic similarity scores.
deutsche-telekom/NLU-few-shot-benchmark-en-de
This is a few-shot training dataset from the domain of human-robot interaction.
It contains texts in German and English language with 64 different utterances (classes).
Each utterance (class) has exactly 20 samples in the training set.
This leads to a total of 1280 different training samples.
The dataset is intended to benchmark the intent classifiers of chat bots in English and especially in German language. We are building on our deutsche-telekom/NLU-Evaluation-Data-en-de data set.