This is Philip May’s page. It is about the following topics:
News about My Work¶
April 18, 2021: German NLP Group releases The German colossal, cleaned Common Crawl corpus (GC4 corpus). A preprocessed (450 GB zipped) German text corpus based on Common Crawl. Many thanks to iisys (the Institute of Information Systems Hof University) for hosting this dataset.
March 29, 2021: Adding the machine translated multilingual STS benchmark (STSb) dataset called stsb_multi_mt as a HuggingFace dataset: https://huggingface.co/datasets/stsb_multi_mt
March 21, 2021: Started website for German NLP Group as a GitHub Page: https://german-nlp-group.github.io/
March 20, 2021: Published the machine translated multilingual STS benchmark (STSb) dataset under open-source license: PhilipMay/stsb-multi-mt
February 22, 2021: Together with Philipp Reissel from ambeRoad I released the 2nd version of our German Electra NLP language model. Training is continued for an additional 734,000 steps to a total of 1,500,000 steps. An extensive evaluation on the GermEval18 Coarse dataset shows that this is the best performing German language model of its size: german-nlp-group/electra-base-german-uncased
December 1, 2020: Together with Philipp Reissel from ambeRoad I gave a talk about the training and evaluation of our open-source German Electra NLP language model. Here is the recording on YouTube: https://www.youtube.com/watch?v=cxgrTd2AQis and the
October 24, 2020: Published a cross-lingual model for English and German semantic sentence embeddings under open-source license: T-Systems-onsite/cross-en-de-roberta-sentence-transformer
September 22, 2019: On arXiv.org I publiched my first paper: Improved Image Augmentation for Convolutional Neural Networks by Copyout and CopyPairing
Feedback and Questions¶
If you have questions or comments about this content please feel free to open a GitHub issue.