NLP Datasets
Links
- XGLUE: https://microsoft.github.io/XGLUE/
- XNLI: https://github.com/facebookresearch/XNLI
- XNLI 1.0
- multi lang
- 2490 dev per lang., 5010 test per lang.
- XNLI-15way: 10,000 multi lingual parallel sentences without label
- XNLI 1.0
- Stanford Natural Language Inference (SNLI) Corpus: https://nlp.stanford.edu/projects/snli/
- The Multi-Genre NLI Corpus: https://cims.nyu.edu/~sbowman/multinli/
- PAWS: https://github.com/google-research-datasets/paws
- PAWS-X: https://github.com/google-research-datasets/paws/tree/master/pawsx
- lang: de, en, es, fr, ja, ko, zh
- 29401 train, 2000 dev, 2000 test - size can be slightly different between languages
- sentence1, sentence2, label
- dev & test overlaps!
- label: binary
- STS benchmark
- Original dataset: https://ixa2.si.ehu.eus/stswiki/index.php/STSbenchmark
- STSb Multi MT: https://github.com/PhilipMay/stsb-multi-mt
- Paper: https://arxiv.org/abs/1708.00055
- CORD19STS
German Sentiment
- Model: https://huggingface.co/oliverguhr/german-sentiment-bert
- Paper: http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.202.pdf
- GitHub: https://github.com/oliverguhr/german-sentiment
- Data: https://github.com/oliverguhr/german-sentiment/issues/3#issuecomment-700942718
- More Data: https://github.com/WladimirSidorenko/CGSA
More
Last modified July 16, 2022: fix headlines in ML doc (a533e7c)