Posts by Philip May

Pandas apply

I often use Pandas to process NLP data. In many cases I want to create a new column from the information in an existing column. For example, if I want to have the number of characters or tokens.

Read more ...


Options for Date Encoding

Some data, such as strings, must be encoded to be used in machine learning models. Here we explore the different options for encoding date fields.

Read more ...


Python Installation and Package Management with conda and pip

This article is about installing Python and package management. It is a subjective article and represents my own opinion and experience. The article is structured by several recommendations.

Read more ...


Anomalies in the MLSUM Dataset

While evaluating the ml6team/mt5-small-german-finetune-mlsum summarization model, my colleague Michal Harakal and I noticed that in many cases this model for summarization simply reproduces the first sentence of the input text. Instead, it should generate an independent summary of the whole text.

Read more ...


Clean German Wikipedia Text Corpus released

Today I published a new Wikipedia-based German text corpus. It is to be used for NLP machine learning tasks.

Read more ...


LightGBM with Optuna: Demo released

This week I published a project to show how to combine LightGBM and Optuna efficiently to train good models. The purpose of this work is to be able to be reused as a template for new projects.

Read more ...


German colossal, cleaned Common Crawl corpus (GC4) released

Philipp Reißel (ambeRoad) and me published the largest German text corpus within the German NLP Group: The German colossal, cleaned Common Crawl corpus

Read more ...


Training and Evaluation of our German Electra Language Model Talk

Together with Philipp Reissel from ambeRoad I gave a talk about the training and evaluation of our open-source German Electra NLP language model.

Read more ...