Python Installation and Package Management with conda and pip

This article is about installing Python and package management. It is a subjective article and represents my own opinion and experience. The article is structured by several recommendations.

Recommendation 1: Never install Python

This sounds a bit strange but the first recommendation is to never install Python itself. The reason is that otherwise you would commit to a single very concrete Python version. However, you don’t want that in principle, because there are different packages that have different version requirements.

But how do you install Python without installing it?

Recommendation 2: Use conda to install and manage Python

You should use conda to install and manage Python:

Conda is an open source package management system and environment management system that runs on Windows, macOS, Linux and z/OS.

To install conda you use Miniconda.

After installation, conda can now be used to create various so-called environments. In each of these environments you can install a different Python version. In addition, you can install other Python packages in the environments. You can switch between the environments with a single command and you can also delete them easily if necessary.

More details about the use and installation of conda you can find on my conda page.

Recommendation 3: Disable conda automatic base Activation

After the conda installation, the so-called base environment is automatically activated in every shell. If you now install a package - without explicitly activating another environment before - then the package will be installed into this base environment. This clutters up the base environment and is annoying. So to force an explicit environment activation you can disable conda automatic base activation. This is done with the following command: conda config --set auto_activate_base false

Recommendation 4: Never install Anaconda

Anaconda also includes conda. During the installation, however, numerous other packages are installed completely unnecessarily. This is the reason why Anaconda is just an unnecessary and completely bloated software that I cannot recommend to anyone. Nothing more needs to be said about this.

Recommendation 5: Do not use conda to install Packages

Conda can be used not only to manage environments and different Python versions, but also to install Python packages like NumPy or pandas.

Very soon after I started with Python and Data Science, I wrote the first bug reports for different Python packages. So when you write such a bug report the maintainers also ask for the used version. When I wrote “conda package version x.y” I always got the following knee-jerk answer: “Please install the pip version x.y of the package and try again.”

The reason is that the conda packages are potentially completely different from the pip packages. Many maintainers release only unofficially or not at all a conda version of their software. Then the conda package is maintained by someone completely different.

Recommendation 6: Use pip to install Packages

To avoid the problem described above, I always use pip for package installation. Conda is then only used to create and manage the environments and to install Python. In some places on the internet you can find that there might be problems if you combine pip and conda. I can not confirm this.

Migration from Sphinx to Hugo

After a long period of unsatisfaction, I finally found a successor for my Sphinx based website. The primary reason for my dissatisfaction with Sphinx was that it is not possible to integrate a proper blog functionality. Now you could say that of course there is ABlog for Sphinx which provides blog functionality. However, ABlog seems to me to be poorly maintained. So it says on the GitHub site:

This version is maintained with the aim to keep it working for SunPy’s website and thus new features or bugfixes are highly unlikely from the SunPy maintainers. […]

For some time now, I have been considering Hugo. Especially because it is ranked very high on Jamstack.org. But also because it offers the possibility to integrate documentation and blog with each other. The only disadvantage of Hugo is that, unlike Sphinx, it cannot be used to document Python code.

After I had decided on Hugo, however, a much more complicated decision now had to be made. This was which theme I wanted to use. The theme should not be a pure blog theme. It should make documentation and blog work together.

After some searching and trying I came across google/docsy. The functionality seems to be comprehensive and it is sufficiently well maintained. It is also used to document some other important projects. Such as:

Overall, I am very happy with the new solution. Perhaps the color scheme could be improved a bit. But that will all come later.

Anomalies in the MLSUM Dataset

While evaluating the ml6team/mt5-small-german-finetune-mlsum summarization model, my colleague Michal Harakal and I noticed that in many cases this model for summarization simply reproduces the first sentence of the input text. Instead, it should generate an independent summary of the whole text.

Photo by Sandy Millar

Photo by Sandy Millar

This extractive behavior should not happen with this model type. This is because it is an abstractive summarization model. Abstractive summarization methods attempt to create a summary by interpreting the input text using NLP techniques to generate a new, shorter text that contains the most important information from the original text.

We tried to get to the bottom of this behavior and looked at the training data set. The data set is called MLSUM. It is available in different languages. However, we have focused on the German language. Our analysis revealed the following:

The German training data contains 220,887 pairs of text and summary. In 126,270 (more than half) of them the summary is completely included in the text. This is very bad for training abstract summary models. The model does not learn to generate a summary but to extract a sentence. Since this sentence is usually even the first sentence, when using MLSUM one does not train a summarization model, but a “first sentence extractor”.

This is a good example of how dangerous it is to use supervised learning data in a careless way. Our solution was to simply remove the summary sentences from the text. So the text can be used for training after all.

For better reproducibility, the code of this evaluation is published as a Colab Notebook. We have also published the resulting model called deutsche-telekom/mt5-small-sum-de-en-v1 under open-sorce license.

Clean German Wikipedia Text Corpus released

Today I published a new Wikipedia-based German text corpus. It is to be used for NLP machine learning tasks.

The corpus is based on a database dump. This was unpacked with WikiExtractor. Then a script is provided to split the texts into sentences. This is done by using SoMaJo. Each line of the text corpus contains one single sentence. Between each Wikipedia article is a blank line.

For splitting sentences we have tested SoMaJo extensively and it produces better results than other much more popular classic NLP tools like spaCy.

Both the code for preprocessing and the corpus itself can be downloaded from GitHub here: https://github.com/GermanT5/wikipedia2corpus

LightGBM with Optuna: Demo released

This week I published a project to show how to combine LightGBM and Optuna efficiently to train good models. The purpose of this work is to be able to be reused as a template for new projects.

LightGBM is a library that uses gradient-boosted tree-based learning algorithms to train models on flat data. Compared to neural networks, it is very fast and not worse for many use cases. It is my preferred supervised learning tool for tabular data. So basically all data that are not images, time series or texts.

Optuna is a framework for automatic hyperparameter optimization. Optuna is my favorite tool for hyperparameter optimization. The dataset used is the census income data. The task is to predict whether income exceeds $50K/yr.

On top of that we apply the SignificanceRepeatedTrainingPruner. This is a tool to detect bad hyperparameter sets as early as possible during cross validation. Thus, the cross validation can be aborted at an early stage and does not have to be run through completely. This procedure is commonly referred to as pruning. It saves time, money and CO2 and ultimately also allows a deeper search in the hyperparameter space. The use of the SignificanceRepeatedTrainingPruner is optional and can easily be omitted.

So if you are curious now: The project is available on GitHub under MIT license as part of the Deutsche Telekom open source effort: https://github.com/telekom/census-income-lightgbm

German colossal, cleaned Common Crawl corpus (GC4) released

Philipp Reißel (ambeRoad) and me published the largest German text corpus within the German NLP Group: The German colossal, cleaned Common Crawl corpus

GC4 is a German text corpus based on Common Crawl. It has been cleaned and preprocessed and can be used for various tasks in NLP. For example for self-supervised training of language models.

The text corpus has the size of 454 GB packed. Unpacked it is more than 1 TB. This makes it the largest German language corpus. For comparison, the complete German Wikipedia pages are about 6.5 GB of text. The preprocessing took more than 50,000 CPU hours and about 400 TB of network traffic to the Common Crawl S3 bucket.

Many thanks to iisys (the Institute of Information Systems Hof University) for hosting this dataset.

Talk: Training and Evaluation of our German Electra Language Model

Together with Philipp Reissel from ambeRoad I gave a talk about the training and evaluation of our open-source German Electra NLP language model.

Here we explain how to train and use language models in general. We tell what is done differently with our model or the tokenizer than with other models, explain the composition of our text corpus and illustrate the evaluation and comparison of the language model.

Slides and video are available: