German colossal, cleaned Common Crawl corpus (GC4) released

Philipp Reißel (ambeRoad) and me published the largest German text corpus within the German NLP Group: The German colossal, cleaned Common Crawl corpus

GC4 is a German text corpus based on Common Crawl. It has been cleaned and preprocessed and can be used for various tasks in NLP. For example for self-supervised training of language models.

The text corpus has the size of 454 GB packed. Unpacked it is more than 1 TB. This makes it the largest German language corpus. For comparison, the complete German Wikipedia pages are about 6.5 GB of text. The preprocessing took more than 50,000 CPU hours and about 400 TB of network traffic to the Common Crawl S3 bucket.

Many thanks to iisys (the Institute of Information Systems Hof University) for hosting this dataset.