NLP Datasets
Links
- XGLUE: https://microsoft.github.io/XGLUE/
- Lincense: non-commercial research purposes only
- XNLI: https://github.com/facebookresearch/XNLI
- XNLI 1.0
- multi lang
- 2490 dev per lang., 5010 test per lang.
- XNLI-15way: 10,000 multi lingual parallel sentences without label
- Lincense: non-commercial
- XNLI 1.0
- Stanford Natural Language Inference (SNLI) Corpus: https://nlp.stanford.edu/projects/snli/
- English language
- 3 classes: neutral, contradiction, entailment
- The Multi-Genre NLI Corpus: https://cims.nyu.edu/~sbowman/multinli/
- English language
- 3 classes: neutral, contradiction, entailment
- PAWS: https://github.com/google-research-datasets/paws
- PAWS-X: https://github.com/google-research-datasets/paws/tree/master/pawsx
- lang: de, en, es, fr, ja, ko, zh
- 29401 train, 2000 dev, 2000 test - size can be slightly different between languages
- sentence1, sentence2, label
- dev & test overlaps!
- label: binary (paraphrase or not paraphrase)
- STS benchmark
- Original dataset: https://ixa2.si.ehu.eus/stswiki/index.php/STSbenchmark
- STSb Multi MT: https://github.com/PhilipMay/stsb-multi-mt
- Paper: https://arxiv.org/abs/1708.00055
- CORD19STS
German Sentiment
- Model: https://huggingface.co/oliverguhr/german-sentiment-bert
- Paper: http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.202.pdf
- GitHub: https://github.com/oliverguhr/german-sentiment
- Data: https://github.com/oliverguhr/german-sentiment/issues/3#issuecomment-700942718
- More Data: https://github.com/WladimirSidorenko/CGSA
More
Last modified April 9, 2023: Update nlp-datasets.md (cbc7daf)