Awesome Resources for IndicNLP
Common Resources
OPUS the open parrallel corpus
A Dravidian Etymological Dictionary
Byte Pair Encoding - Pretrained for 275 language
FastText word vectors for 157 languages
Indian Language Technology Proliferation and Deployment Center
Center For Indian Language Technology - CFILT FB page
Indian Institute of Language Studies (IILS)
Central Institute of Indian Languages
Central Institute of Indian Languages
Research Papers
Survey:Natural Language Parsing For Indian Languages
Language Specific
Malayalam
mlmorph - Malayalam Morphological Analyzer using Finite State Transducer
Tamil
Datasets
Other projects
Open Tamil Suite of tools for operating on tamil text.
Tokenizer, Language model and Classifier for Tamil language by Ravi Annaswamy
Scrapers
ML models
Text Classification model in Pytorch: Can be easily applied to other datasets, infact the linked repository also contains a dataset for film reviews in tamil.
Bengali
- Contains Wikipedia Articles Dataset (72,374 articles) and scripts which were used to scrape Wikipedia and clean that dataset
- Contains Language Model with Perplexity ~41
- Contains Bengali News Classification Model with 94% accuracy
Scrapers
Telgu
Telugu-NLP - Contains NLP tools developed for telugu
Research Papers and Data
Research Papers in Bengali NLP
Collection of Repositories
Language | Repository | Perplexity of Language model | Wikipedia Articles Dataset | Classification accuracy | Classification Kappa score |
---|---|---|---|---|---|
Hindi | NLP for Hindi | ~36 | 55,000 articles | ~79 (News Classification) | ~30 (Movie Review Classification) |
Punjabi | NLP for Punjabi | ~13 | 44,000 articles | ~89 (News Classification) | ~60 (News Classification) |
Sanskrit | NLP for Sanskrit | ~6 | 22,273 articles | ~70 (Shloka Classification) | ~56 (Shloka Classification) |
Gujarati | NLP for Gujarati | ~34 | 31,913 articles | ~91 (News Classification) | ~85 (News Classification) |
Kannada | NLP for Kannada | ~70 | 32,997 articles | ~94 (News Classification) | ~90 (News Classification) |
Malyalam | NLP for Malyalam | ~26 | 12,388 articles | ~94 (News Classification) | ~91 (News Classification) |
Nepali | NLP for Nepali | ~32 | 38,757 articles | ~97 (News Classification) | ~96 (News Classification) |
Odia | NLP for Odia | ~27 | 17,781 articles | ~95 (News Classification) | ~92 (News Classification) |
Marathi | NLP for Marathi | ~18 | 85,537 articles | ~91 (News Classification) | ~84 (News Classification) |
Bengali | NLP for Bengali | ~41 | 72,374 articles | ~94 (News Classification) | ~92 (News Classification) |