All Projects → Kartikaggarwal98 → Indian_ParallelCorpus

Kartikaggarwal98 / Indian_ParallelCorpus

Licence: other
Curated list of publicly available parallel corpus for Indian Languages

Projects that are alternatives of or similar to Indian ParallelCorpus

banglanmt
This repository contains the code and data of the paper titled "Not Low-Resource Anymore: Aligner Ensembling, Batch Filtering, and New Datasets for Bengali-English Machine Translation" published in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), November 16 - November 20, 2020.
Stars: ✭ 91 (+295.65%)
Mutual labels:  neural-machine-translation, parallel-corpus, parallel-corpora, low-resource-languages, low-resource-machine-translation
BSD
The Business Scene Dialogue corpus
Stars: ✭ 51 (+121.74%)
Mutual labels:  corpus, parallel-corpus, parallel-corpora
Filipino-Text-Benchmarks
Open-source benchmark datasets and pretrained transformer models in the Filipino language.
Stars: ✭ 22 (-4.35%)
Mutual labels:  corpus, low-resource-languages
Code Docstring Corpus
Preprocessed Python functions and docstrings for automated code documentation (code2doc) and automated code generation (doc2code) tasks.
Stars: ✭ 137 (+495.65%)
Mutual labels:  corpus, neural-machine-translation
ilmulti
Tooling to play around with multilingual machine translation for Indian Languages.
Stars: ✭ 19 (-17.39%)
Mutual labels:  indian-languages, multilingual-translation
OneStopEnglishCorpus
No description or website provided.
Stars: ✭ 38 (+65.22%)
Mutual labels:  corpus
DeepSentiPers
Repository for the experiments described in the paper named "DeepSentiPers: Novel Deep Learning Models Trained Over Proposed Augmented Persian Sentiment Corpus"
Stars: ✭ 17 (-26.09%)
Mutual labels:  corpus
folia
FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for proces…
Stars: ✭ 56 (+143.48%)
Mutual labels:  corpus
open-discourse
Open Discourse is the first fully comprehensive corpus of the plenary proceedings of the federal German Parliament (Bundestag).
Stars: ✭ 47 (+104.35%)
Mutual labels:  corpus
dialogue-datasets
collect the open dialog corpus and some useful data processing utils.
Stars: ✭ 24 (+4.35%)
Mutual labels:  corpus
SSAN
How Does Selective Mechanism Improve Self-attention Networks?
Stars: ✭ 18 (-21.74%)
Mutual labels:  neural-machine-translation
OpenDialog
An Open-Source Package for Chinese Open-domain Conversational Chatbot (中文闲聊对话系统,一键部署微信闲聊机器人)
Stars: ✭ 94 (+308.7%)
Mutual labels:  corpus
pytorch basic nmt
A simple yet strong implementation of neural machine translation in pytorch
Stars: ✭ 66 (+186.96%)
Mutual labels:  neural-machine-translation
PubMed-PICO-Detection
PubMed PICO Element Detection Dataset
Stars: ✭ 37 (+60.87%)
Mutual labels:  corpus
fastmorph
Fast corpus search engine originally made for the Corpus of Written Tatar language
Stars: ✭ 14 (-39.13%)
Mutual labels:  corpus
thai-language
computer tools for thai language
Stars: ✭ 20 (-13.04%)
Mutual labels:  corpus
fuzzing-corpus
My fuzzing corpus
Stars: ✭ 120 (+421.74%)
Mutual labels:  corpus
nepali-translator
Neural Machine Translation on the Nepali-English language pair
Stars: ✭ 29 (+26.09%)
Mutual labels:  parallel-corpus
SpiCE-Corpus
An open-access corpus of conversational bilingual speech in Cantonese and English
Stars: ✭ 33 (+43.48%)
Mutual labels:  corpus
Attention-Visualization
Visualization for simple attention and Google's multi-head attention.
Stars: ✭ 54 (+134.78%)
Mutual labels:  neural-machine-translation

Parallel Corpus for Indian Languages

Available parallel data for training machine translation models in indic languages: Hindi, Bengali, Gujarati, Gondi, Kannada, Manipuri, Marathi, Malayalam, Oriya, Punjabi, Sanskrit, Tamil, Telugu.

Assamese-X

  1. Samaantar Corpus
  2. As-En PMIndia Corpus
  3. As-En Backtranslated Tatoeba Challenge: Parallel data obtained by backtranslation on monolingual data. Row asm-eng.

Bengali-X

  1. Samaantar Corpus
  2. Bn-En BEUT Parallel corpus: 2.75million pairs of bengali-english sentences @EMNLP 2020
  3. Bn-En Project Anuvaad
  4. Bn-En Indian Parallel Corpora
  5. CVIT-IIITH PIB Multilingual Corpus: en, gu, hi, ml, mr, or, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
  6. CVIT-IIITH Mann ki Baat Corpus: en, gu, hi, ml, mr, or, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
  7. Bn-En Indian-Language Dataset
  8. Bn-En Asian Language Treebank (ALT) Parallel Corpus
  9. Bn-En PMIndia Corpus
  10. Bn-En OPUS: Set source as en and target as bn
  11. Bn-En SUPARA 0.8M: Requires an IEEE DataPort Subscription
  12. Bn-En Backtranslated Tatoeba Challenge: Parallel data obtained by backtranslation on monolingual data. Row ben-eng.

Gujarati-X

  1. Samaantar Corpus
  2. Gu-En WikiTitles Parallel Corpus : wikititles-v1.gu-en.tsv.gz
  3. Gu-En Project Anuvaad
  4. Gu-En Tsardia
  5. CVIT-IIITH PIB Multilingual Corpus: en, bn, hi, ml, mr, or, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
  6. CVIT-IIITH Mann ki Baat Corpus: en, bn, hi, ml, mr, or, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
  7. Gu-En Shahparth123
  8. Gu-En PMIndia Corpus
  9. Gu-En Bible Corpus
  10. Gu-En OPUS: Set source as en and target as gu
  11. Gu-En Backtranslated Tatoeba Challenge: Parallel data obtained by backtranslation on monolingual data. Row guj-eng.

Gondi-X

  1. Gondi-Hindi Parallel Corpus

Hindi-X

  1. Samaantar Corpus
  2. Hi-En IITB Parallel Corpus: v3.0 released !!
  3. Hi-En Project Anuvaad
  4. Hi-En Indian Parallel Corpora
  5. CVIT-IIITH PIB Multilingual Corpus: en, bn, gu, ml, mr, or, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
  6. CVIT-IIITH Mann ki Baat Corpus: en, bn, gu, ml, mr, or, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
  7. Hi-En Asian Language Treebank (ALT) Parallel Corpus
  8. Hi-En PMIndia Corpus
  9. Hi-En Bible Corpus
  10. Hi-En Wiki Matrix Comparable Corpus
  11. Hi-En OPUS: Set source as en and target as hi. [ Some of the corpus are part of IITB Parallel Corpus.]
  12. Hi-En Backtranslated Tatoeba Challenge: Parallel data obtained by backtranslation on monolingual data. Row hin-eng.
  13. IIITH Code-Mix Hi-En Corpus
  14. Hi-En Flickr 8k: Multimodal Dataset
  15. Hi-San parallel corpus: Hindi-Sanskrit monolingual and parallel data from Ramayana, Rigveda, Bhagvad Gita, etc.

Kannada-X

  1. Samaantar Corpus
  2. Kn-En Project Anuvaad
  3. Kn-En PMIndia Corpus
  4. Kn-En Bible Corpus
  5. OPUS: Set source as en and target as kn
  6. Kn-En Backtranslated Tatoeba Challenge: Parallel data obtained by backtranslation on monolingual data. Row kan-eng.

Manipuri-X

  1. Mn-En PMIndia Corpus

Marathi-X

  1. Samaantar Corpus
  2. Mr-En Project Anuvaad
  3. CVIT-IIITH PIB Multilingual Corpus: en, bn, gu, hi, ml, or, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
  4. CVIT-IIITH Mann ki Baat Corpus: en, bn, gu, hi, ml, or, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
  5. Mr-En PMIndia Corpus
  6. Mr-En Bible Corpus
  7. Mr-En OPUS: Set source as en and target as mr
  8. Mr-En Backtranslated Tatoeba Challenge: Parallel data obtained by backtranslation on monolingual data. Row mar-eng.

Malayalam-X

  1. Samaantar Corpus
  2. Ml-en Project Anuvaad
  3. Indian Parallel Corpora
  4. CVIT-IIITH PIB Multilingual Corpus: en, bn, gu, hi, mr, or, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
  5. CVIT-IIITH Mann ki Baat Corpus: en, bn, gu, hi, mr, or, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
  6. Ml-en Indian-Language Dataset
  7. Ml-en English_Malayalam_ParallelCorpora
  8. Ml-en PMIndia Corpus
  9. Ml-en Bible Corpus
  10. Ml-en OPUS: Set source as en and target as ml
  11. Ml-En Backtranslated Tatoeba Challenge: Parallel data obtained by backtranslation on monolingual data. Row mal-eng.

Oriya-X

  1. Samaantar Corpus
  2. Or-En MTEnglish2Odia
  3. Or-En OdiEnCorp 2.0
  4. Or-En OdiEnCorp 1.0
  5. Or-En IndoWordnet Parallel Corpus
  6. CVIT-IIITH PIB Multilingual Corpus: en, bn, gu, hi, ml, mr, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
  7. CVIT-IIITH Mann ki Baat Corpus: en, bn, gu, hi, ml, mr, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
  8. Or-En PMIndia Corpus
  9. Or-En OPUS: Set source as en and target as or
  10. Or-En Backtranslated Tatoeba Challenge: Parallel data obtained by backtranslation on monolingual data. Row ori-eng.

Punjabi-X

  1. Samaantar Corpus
  2. Pu-En Project Anuvaad
  3. Pu-En Punjabi-English Corpus
  4. Pu-En PMIndia Corpus
  5. Pu-En OPUS: Set source as en and target as pa
  6. Pu-En Backtranslated Tatoeba Challenge: Parallel data obtained by backtranslation on monolingual data. Row pan-eng.

Sanskrit-X

  1. San-Hi parallel corpus: Sanskrit Hindi monolingual and parallel data from Ramayana, Rigveda, Bhagvad Gita, etc.

Tamil-X

  1. Samaantar Corpus
  2. Ta-En Project Anuvaad
  3. Ta-En Indian Parallel Corpora
  4. Ta-En National Language Process Center
  5. Ta-En EnTam
  6. CVIT-IIITH PIB Multilingual Corpus: en, bn, gu, hi, ml, mr, or, pa, te, ur. [Source-code, pretrained models and other resources also available.]
  7. CVIT-IIITH Mann ki Baat Corpus: en, bn, gu, hi, ml, mr, or, pa, te, ur. [Source-code, pretrained models and other resources also available.]
  8. Ta-En Indian-Language Dataset
  9. Ta-En Multiple Dataset Links
  10. Ta-En PMIndia Corpus
  11. Ta-En Parallel Corpus
  12. Ta-En PMIndia Corpus
  13. Ta-En OPUS: Set source as en and target as ta
  14. Ta-En Backtranslated Tatoeba Challenge: Parallel data obtained by backtranslation on monolingual data. Row tam-eng.

Telugu-X

  1. Samaantar Corpus
  2. Te-En Project Anuvaad
  3. Te-En Indian Parallel Corpora
  4. CVIT-IIITH PIB Multilingual Corpus: en, bn, gu, hi, ml, mr, or, pa, ta, ur. [Source-code, pretrained models and other resources also available.]
  5. CVIT-IIITH Mann ki Baat Corpus: en, bn, gu, hi, ml, mr, or, pa, ta, ur. [Source-code, pretrained models and other resources also available.]
  6. Te-En Indian-Language Dataset
  7. Te-En PMIndia Corpus
  8. Te-En Bible Corpus
  9. Te-En OPUS: Set source as en and target as te
  10. Te-En Backtranslated Tatoeba Challenge: Parallel data obtained by backtranslation on monolingual data. Row tel-eng.

Other Resources

  1. PMIndia Parallel Corpus Creation: Code for creating a parallel corpus from pmindia.gov.in. [Paper Link]
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].