Options for Date Encoding

Some data, such as strings, must be encoded to be used in machine learning models. Here we explore the different options for encoding date fields.

Photo by Behnam Norouzi on Unsplash

Photo by Behnam Norouzi on Unsplash

The general options to encode the time dimension like the birth date of a customer or the production time of a product are:

  1. separate encoding of year, month and maybe also day and weekday
  2. relative to a certain point in time in the past - e.g. number of days after January 1st 1900
  3. relative to “today” - e.g. number of days before today

Pros and cons: separate encoding of year, month and maybe also day and weekday

If you believe in astrology this might be your favorite to encode a birth date since the month is preserved. If you want to encode a production date it also might be useful to encode the weekday. That is because there might be a relation between product quality in production and weekday. Parts manufactured on Mondays may have the most severe quality variations.

The disadvantage is that you need multiple columns to encode the date. Furthermore, this approach also suffers from a concept drift problem (see below).

Pros and cons: relative to a certain point in time in the past

This is easy to calculate because the “point in time in the past” (January 1st 1900 for example) is a fixed point in time. This contrasts with the encoding which is relative to “today”. But the problem with this encoding is the following:

There are circumstances that in reality are not related to the date itself, but much more to the age. The remaining service life of a technical device is much more directly related to its age than to its production date. Whether a customer is interested in an airplane trip or a train ticket also depends on age and not so much on the date of birth. So, if you represent the date of birth relative to a time in the past, then the resulting model would have a built-in concept drift.

For example, two predictions are made for the same person with his or her date of birth. One prediction on January 2022 and one on January 2023. Then the person is obviously one year older at the second prediction in January 2023. But this would not be visible in the encoding of the date of birth (if you encode it relative to a point in time in the past). The model would therefore experience a concept drift and would have to be re-trained.

Pros and cons: relative to “today”

This would be the encoding of choice if there is a relation between age and prediction. It would prevent the concept drift described above. The disadvantage of the coding is that the reference day “today” is very dynamic and not fixed. So you have to be very careful how you set “today”.

A distinction is made between the generation of the training data (validation- and testdata) and the prediction at production time. The prediction at production time is easy to understand. The “today” is just the day where the prediction is made. However, generating the training data is a bit more difficult. “Today” must not be the day on which the training data was generated. Instead, the day “today” is relative to the day on which the label was “created”. The easiest way to explain this is to use an example:

We assume that we have parts (for example hard disks) that were produced on a certain date. Now we want to predict when the hard disks have another week to live. For this we have the times at which a hard disk fails. Let’s take a very simplified example:

The hard disk was manufactured on January 1, 2022. It has already failed on January 25, 2022. We want to create a training data set that represents day 7 before the failure. The “today” in this case is January 18, 2022, so the production date of the hard disk, which we represent relative to “today”, is encoded with a 17. This is exactly the age (in days) of the hard disk on the day “today”.


If your prediction is correlated with the age of something, choose the encoding we call “relative to today”. This removes the concept drift but is a bit tricky to calculate. If there is an actual relation to the birth or production date, month or the day of the week then encode this information as described above.

Python Installation and Package Management with conda and pip

This article is about installing Python and package management. It is a subjective article and represents my own opinion and experience. The article is structured by several recommendations.

Recommendation 1: Never install Python

This sounds a bit strange but the first recommendation is to never install Python itself. The reason is that otherwise you would commit to a single very concrete Python version. However, you don’t want that in principle, because there are different packages that have different version requirements.

But how do you install Python without installing it?

Recommendation 2: Use conda to install and manage Python

You should use conda to install and manage Python:

Conda is an open source package management system and environment management system that runs on Windows, macOS, Linux and z/OS.

To install conda you use Miniconda.

After installation, conda can now be used to create various so-called environments. In each of these environments you can install a different Python version. In addition, you can install other Python packages in the environments. You can switch between the environments with a single command and you can also delete them easily if necessary.

More details about the use and installation of conda you can find on my conda page.

Recommendation 3: Disable conda automatic base Activation

After the conda installation, the so-called base environment is automatically activated in every shell. If you now install a package - without explicitly activating another environment before - then the package will be installed into this base environment. This clutters up the base environment and is annoying. So to force an explicit environment activation you can disable conda automatic base activation. This is done with the following command: conda config --set auto_activate_base false

Recommendation 4: Never install Anaconda

Anaconda also includes conda. During the installation, however, numerous other packages are installed completely unnecessarily. This is the reason why Anaconda is just an unnecessary and completely bloated software that I cannot recommend to anyone. Nothing more needs to be said about this.

Recommendation 5: Do not use conda to install Packages

Conda can be used not only to manage environments and different Python versions, but also to install Python packages like NumPy or pandas.

Very soon after I started with Python and Data Science, I wrote the first bug reports for different Python packages. So when you write such a bug report the maintainers also ask for the used version. When I wrote “conda package version x.y” I always got the following knee-jerk answer: “Please install the pip version x.y of the package and try again.”

The reason is that the conda packages are potentially completely different from the pip packages. Many maintainers release only unofficially or not at all a conda version of their software. Then the conda package is maintained by someone completely different.

Recommendation 6: Use pip to install Packages

To avoid the problem described above, I always use pip for package installation. Conda is then only used to create and manage the environments and to install Python. In some places on the internet you can find that there might be problems if you combine pip and conda. I can not confirm this.

Migration from Sphinx to Hugo

After a long period of unsatisfaction, I finally found a successor for my Sphinx based website. The primary reason for my dissatisfaction with Sphinx was that it is not possible to integrate a proper blog functionality. Now you could say that of course there is ABlog for Sphinx which provides blog functionality. However, ABlog seems to me to be poorly maintained. So it says on the GitHub site:

This version is maintained with the aim to keep it working for SunPy’s website and thus new features or bugfixes are highly unlikely from the SunPy maintainers. […]

For some time now, I have been considering Hugo. Especially because it is ranked very high on Jamstack.org. But also because it offers the possibility to integrate documentation and blog with each other. The only disadvantage of Hugo is that, unlike Sphinx, it cannot be used to document Python code.

After I had decided on Hugo, however, a much more complicated decision now had to be made. This was which theme I wanted to use. The theme should not be a pure blog theme. It should make documentation and blog work together.

After some searching and trying I came across google/docsy. The functionality seems to be comprehensive and it is sufficiently well maintained. It is also used to document some other important projects. Such as:

Overall, I am very happy with the new solution. Perhaps the color scheme could be improved a bit. But that will all come later.

Anomalies in the MLSUM Dataset

While evaluating the ml6team/mt5-small-german-finetune-mlsum summarization model, my colleague Michal Harakal and I noticed that in many cases this model for summarization simply reproduces the first sentence of the input text. Instead, it should generate an independent summary of the whole text.

Photo by Sandy Millar

Photo by Sandy Millar

This extractive behavior should not happen with this model type. This is because it is an abstractive summarization model. Abstractive summarization methods attempt to create a summary by interpreting the input text using NLP techniques to generate a new, shorter text that contains the most important information from the original text.

We tried to get to the bottom of this behavior and looked at the training data set. The data set is called MLSUM. It is available in different languages. However, we have focused on the German language. Our analysis revealed the following:

The German training data contains 220,887 pairs of text and summary. In 126,270 (more than half) of them the summary is completely included in the text. This is very bad for training abstract summary models. The model does not learn to generate a summary but to extract a sentence. Since this sentence is usually even the first sentence, when using MLSUM one does not train a summarization model, but a “first sentence extractor”.

This is a good example of how dangerous it is to use supervised learning data in a careless way. Our solution was to simply remove the summary sentences from the text. So the text can be used for training after all.

For better reproducibility, the code of this evaluation is published as a Colab Notebook. We have also published the resulting model called deutsche-telekom/mt5-small-sum-de-en-v1 under open-sorce license.

Clean German Wikipedia Text Corpus released

Today I published a new Wikipedia-based German text corpus. It is to be used for NLP machine learning tasks.

The corpus is based on a database dump. This was unpacked with WikiExtractor. Then a script is provided to split the texts into sentences. This is done by using SoMaJo. Each line of the text corpus contains one single sentence. Between each Wikipedia article is a blank line.

For splitting sentences we have tested SoMaJo extensively and it produces better results than other much more popular classic NLP tools like spaCy.

Both the code for preprocessing and the corpus itself can be downloaded from GitHub here: https://github.com/GermanT5/wikipedia2corpus

LightGBM with Optuna: Demo released

This week I published a project to show how to combine LightGBM and Optuna efficiently to train good models. The purpose of this work is to be able to be reused as a template for new projects.

LightGBM is a library that uses gradient-boosted tree-based learning algorithms to train models on flat data. Compared to neural networks, it is very fast and not worse for many use cases. It is my preferred supervised learning tool for tabular data. So basically all data that are not images, time series or texts.

Optuna is a framework for automatic hyperparameter optimization. Optuna is my favorite tool for hyperparameter optimization. The dataset used is the census income data. The task is to predict whether income exceeds $50K/yr.

On top of that we apply the SignificanceRepeatedTrainingPruner. This is a tool to detect bad hyperparameter sets as early as possible during cross validation. Thus, the cross validation can be aborted at an early stage and does not have to be run through completely. This procedure is commonly referred to as pruning. It saves time, money and CO2 and ultimately also allows a deeper search in the hyperparameter space. The use of the SignificanceRepeatedTrainingPruner is optional and can easily be omitted.

So if you are curious now: The project is available on GitHub under MIT license as part of the Deutsche Telekom open source effort: https://github.com/telekom/census-income-lightgbm

German colossal, cleaned Common Crawl corpus (GC4) released

Philipp Reißel (ambeRoad) and me published the largest German text corpus within the German NLP Group: The German colossal, cleaned Common Crawl corpus

GC4 is a German text corpus based on Common Crawl. It has been cleaned and preprocessed and can be used for various tasks in NLP. For example for self-supervised training of language models.

The text corpus has the size of 454 GB packed. Unpacked it is more than 1 TB. This makes it the largest German language corpus. For comparison, the complete German Wikipedia pages are about 6.5 GB of text. The preprocessing took more than 50,000 CPU hours and about 400 TB of network traffic to the Common Crawl S3 bucket.

Many thanks to iisys (the Institute of Information Systems Hof University) for hosting this dataset.

Talk: Training and Evaluation of our German Electra Language Model

Together with Philipp Reissel from ambeRoad I gave a talk about the training and evaluation of our open-source German Electra NLP language model.

Here we explain how to train and use language models in general. We tell what is done differently with our model or the tokenizer than with other models, explain the composition of our text corpus and illustrate the evaluation and comparison of the language model.

Slides and video are available: