Posts in 2022
Options for Date Encoding
Wednesday, October 12, 2022 in Blog
Categories:
Some data, such as strings, must be encoded to be used in machine learning models. Here we explore the different options for encoding date fields. Photo by Behnam Norouzi on Unsplash The general options to encode the time dimension like the birth …
Python Installation and Package Management with conda and pip
Saturday, July 23, 2022 in Blog
Categories:
This article is about installing Python and package management. It is a subjective article and represents my own opinion and experience. The article is structured by several recommendations. Recommendation 1: Never install Python This sounds a bit …
Migration from Sphinx to Hugo
Sunday, July 17, 2022 in Blog
Categories:
After a long period of unsatisfaction, I finally found a successor for my Sphinx based website. The primary reason for my dissatisfaction with Sphinx was that it is not possible to integrate a proper blog functionality. Now you could say that of …
Anomalies in the MLSUM Dataset
Wednesday, February 23, 2022 in Blog
Categories:
While evaluating the ml6team/mt5-small-german-finetune-mlsum summarization model, my colleague Michal Harakal and I noticed that in many cases this model for summarization simply reproduces the first sentence of the input text. Instead, it should …
Clean German Wikipedia Text Corpus released
Tuesday, February 22, 2022 in Blog
Today I published a new Wikipedia-based German text corpus. It is to be used for NLP machine learning tasks. The corpus is based on a database dump. This was unpacked with WikiExtractor. Then a script is provided to split the texts into sentences. …
LightGBM with Optuna: Demo released
Sunday, February 20, 2022 in Blog
Categories:
This week I published a project to show how to combine LightGBM and Optuna efficiently to train good models. The purpose of this work is to be able to be reused as a template for new projects. LightGBM is a library that uses gradient-boosted …
Posts in 2021
German colossal, cleaned Common Crawl corpus (GC4) released
Saturday, April 10, 2021 in Blog
Philipp Reißel (ambeRoad) and me published the largest German text corpus within the German NLP Group: The German colossal, cleaned Common Crawl corpus GC4 is a German text corpus based on Common Crawl. It has been cleaned and preprocessed and can be …
Posts in 2020
Talk: Training and Evaluation of our German Electra Language Model
Tuesday, December 01, 2020 in Blog
Categories:
Together with Philipp Reissel from ambeRoad I gave a talk about the training and evaluation of our open-source German Electra NLP language model. Here we explain how to train and use language models in general. We tell what is done differently with …