All Projects → vanangamudi → awesome-resources-for-indic-nlp

vanangamudi / awesome-resources-for-indic-nlp

Licence: GPL-3.0 license
No description, website, or topics provided.

Awesome Resources for IndicNLP

Common Resources

OPUS the open parrallel corpus

A Dravidian Etymological Dictionary

Byte Pair Encoding - Pretrained for 275 language

FastText word vectors for 157 languages

Indian Language Technology Proliferation and Deployment Center

Center For Indian Language Technology - CFILT FB page

Indian Institute of Language Studies (IILS)

Central Institute of Indian Languages

Central Institute of Indian Languages

OpenSLR Speech datasets

Research Papers

Survey:Natural Language Parsing For Indian Languages

Language Specific

Malayalam

mlmorph - Malayalam Morphological Analyzer using Finite State Transducer

Tamil

Datasets

Datasets in tamil text

Other projects

Open Tamil Suite of tools for operating on tamil text.

Tokenizer, Language model and Classifier for Tamil language by Ravi Annaswamy

Scrapers

  1. Tamil Etymological Dictionary
  2. Newspaper Crawlers

ML models

Text Classification model in Pytorch: Can be easily applied to other datasets, infact the linked repository also contains a dataset for film reviews in tamil.

Bengali

Bangla2Vec

Bengali News Classification

NLP for Bengali

  • Contains Wikipedia Articles Dataset (72,374 articles) and scripts which were used to scrape Wikipedia and clean that dataset
  • Contains Language Model with Perplexity ~41
  • Contains Bengali News Classification Model with 94% accuracy

Scrapers

Bengali News Channel Scraper

Telgu

Telugu-NLP - Contains NLP tools developed for telugu

Research Papers and Data

Research Papers in Bengali NLP

Collection of Repositories

Language Repository Perplexity of Language model Wikipedia Articles Dataset Classification accuracy Classification Kappa score
Hindi NLP for Hindi ~36 55,000 articles ~79 (News Classification) ~30 (Movie Review Classification)
Punjabi NLP for Punjabi ~13 44,000 articles ~89 (News Classification) ~60 (News Classification)
Sanskrit NLP for Sanskrit ~6 22,273 articles ~70 (Shloka Classification) ~56 (Shloka Classification)
Gujarati NLP for Gujarati ~34 31,913 articles ~91 (News Classification) ~85 (News Classification)
Kannada NLP for Kannada ~70 32,997 articles ~94 (News Classification) ~90 (News Classification)
Malyalam NLP for Malyalam ~26 12,388 articles ~94 (News Classification) ~91 (News Classification)
Nepali NLP for Nepali ~32 38,757 articles ~97 (News Classification) ~96 (News Classification)
Odia NLP for Odia ~27 17,781 articles ~95 (News Classification) ~92 (News Classification)
Marathi NLP for Marathi ~18 85,537 articles ~91 (News Classification) ~84 (News Classification)
Bengali NLP for Bengali ~41 72,374 articles ~94 (News Classification) ~92 (News Classification)
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].