All Projects → kbatsuren → CogNet

kbatsuren / CogNet

Licence: other
CogNet: a large-scale, high-quality cognate database for 338 languages, 1.07M words, and 8.1 million cognates

Projects that are alternatives of or similar to CogNet

xl-sum
This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021.
Stars: ✭ 160 (+515.38%)
Mutual labels:  low-resource-languages, multilinguality
wordnet
Stand-alone WordNet API
Stars: ✭ 39 (+50%)
Mutual labels:  wordnet
kanji-frequency
Kanji usage frequency data collected from various sources
Stars: ✭ 92 (+253.85%)
Mutual labels:  corpus-linguistics
LMMS
Language Modelling Makes Sense - WSD (and more) with Contextual Embeddings
Stars: ✭ 79 (+203.85%)
Mutual labels:  wordnet
kontext
An advanced, extensible web front-end for the Manatee-open corpus search engine
Stars: ✭ 50 (+92.31%)
Mutual labels:  corpus-linguistics
TurkishWordNet
Turkish WordNet KeNet
Stars: ✭ 32 (+23.08%)
Mutual labels:  wordnet
Mecab Ipadic Neologd
Neologism dictionary based on the language resources on the Web for mecab-ipadic
Stars: ✭ 2,408 (+9161.54%)
Mutual labels:  language-resources
wn
A modern, interlingual wordnet interface for Python
Stars: ✭ 119 (+357.69%)
Mutual labels:  wordnet
NLIDB
Natural Language Interface to DataBases
Stars: ✭ 100 (+284.62%)
Mutual labels:  wordnet
Hierarchical-Word-Sense-Disambiguation-using-WordNet-Senses
Word Sense Disambiguation using Word Specific models, All word models and Hierarchical models in Tensorflow
Stars: ✭ 33 (+26.92%)
Mutual labels:  wordnet
m3gm
Max-Margin Markov Graph Models for WordNet (EMNLP 2018)
Stars: ✭ 40 (+53.85%)
Mutual labels:  wordnet
Wordbook
Wordbook is a dictionary application built for GNOME.
Stars: ✭ 56 (+115.38%)
Mutual labels:  wordnet
Pattern
Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.
Stars: ✭ 8,112 (+31100%)
Mutual labels:  wordnet
goclassy
An asynchronous concurrent pipeline for classifying Common Crawl based on fastText's pipeline.
Stars: ✭ 81 (+211.54%)
Mutual labels:  corpus-linguistics
corpusexplorer2.0
Korpuslinguistik war noch nie so einfach...
Stars: ✭ 16 (-38.46%)
Mutual labels:  corpus-linguistics
nerus
Large silver standart Russian corpus with NER, morphology and syntax markup
Stars: ✭ 47 (+80.77%)
Mutual labels:  corpus-linguistics
NatLang
NatLang is an English parser with an extensible grammar
Stars: ✭ 20 (-23.08%)
Mutual labels:  wordnet
gf-wordnet
A WordNet in GF
Stars: ✭ 15 (-42.31%)
Mutual labels:  wordnet
ungoliant
🕷️ The pipeline for the OSCAR corpus
Stars: ✭ 69 (+165.38%)
Mutual labels:  corpus-linguistics
ws4j
WordNet Similarity for Java provides an API for several Semantic Relatedness/Similarity algorithms
Stars: ✭ 41 (+57.69%)
Mutual labels:  wordnet

CogNet: a large-scale, high-quality cognate database for 338 languages

alt text

CogNet is a large-scale database of cognate pairs: CogNet v2 contains 8.1 million cognates in 338 languages, 38 writing systems, and 91285 concepts. Its quality is manually evaulated at 94% precision. It was automatically constructed from wordnets and dictionaries contained within the UKC resource, as described in our paper. For 37 different orthographies of 338 languages, we used the Wiktionary tranlisteration, WikTra, developed by Wiktionary linguists and developers.

UKC resource is at http://ukc.disi.unitn.it, and more details of CogNet is at http://cognet.ukc.disi.unitn.it/

Why cognates are important?

In Computational Linguistics: improve the cross-lingual NLP tasks, e.g., word translation, bilingual lexicon induction, cross-lingual knowledge transfer. alt text

How can I explore cognate data?

Besides downloading the entire CogNet as a structured text file, you can also use the Linguarena website to display and browse (currently an older version of) cognate data interactively on a world map, as also shown in the figure above. The fish example is on this link

CogNet Format

Each line represents one instance of a pair of cognate words. Columns are separated by TAB.

Column Description
concept id A code used by Princeton WordNet 3.0 to represent a meaning (called a synset)
language 1 the 3-letter iso code for the first language
word 1 a word in the language 1
language 2 the 3-letter iso code for the second language
word 2 a word in the language 2
transliteration 1 a romanized word for the first word
tranlisteration 2 a romanized word for the second word

For example,

concept id lang 1 word 1 lang 2 word 2 translit 1 translit 2
n14996158 glg polipropileno jpn ポリプロピレン - poripuropiren
n06566077 nep सफ्टवेर kas سافٹویٚیَر saphtawera saftoeyar
n07062058 eng song jpn ソング - songu
n02506148 cmn mon заан xiang zaan

License

This tool is available under the Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) License. Read more about this license from https://creativecommons.org/licenses/by-nc-sa/4.0/.

References

Please cite the following articles if you find this resource useful:

Khuyagbaatar Batsuren, Gábor Bella, and Fausto Giunchiglia – CogNet: A large-scale cognate database, Proceedings of The 57th Annual Meeting of the Association for Computational Linguistics (ACL), 2019. https://aclweb.org/anthology/papers/P/P19/P19-1302/

Khuyagbaatar Batsuren, Gábor Bella, and Fausto Giunchiglia - A large and evolving cognate database. Lang Resources & Evaluation (2021). https://doi.org/10.1007/s10579-021-09544-6

Acknowledgements

Thanks to Global Wordnet Association (GWA) and all the wordnet developers. http://globalwordnet.org/resources/wordnets-in-the-world/

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].