kbatsuren / CogNet

Licence: other

CogNet: a large-scale, high-quality cognate database for 338 languages, 1.07M words, and 8.1 million cognates

Projects that are alternatives of or similar to CogNet

This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021.

Stars: ✭ 160 (+515.38%)

Mutual labels: low-resource-languages, multilinguality

wordnet

Stand-alone WordNet API

Stars: ✭ 39 (+50%)

Mutual labels: wordnet

kanji-frequency

Kanji usage frequency data collected from various sources

Stars: ✭ 92 (+253.85%)

Mutual labels: corpus-linguistics

LMMS

Language Modelling Makes Sense - WSD (and more) with Contextual Embeddings

Stars: ✭ 79 (+203.85%)

Mutual labels: wordnet

kontext

An advanced, extensible web front-end for the Manatee-open corpus search engine

Stars: ✭ 50 (+92.31%)

Mutual labels: corpus-linguistics

TurkishWordNet

Turkish WordNet KeNet

Stars: ✭ 32 (+23.08%)

Mutual labels: wordnet

Mecab Ipadic Neologd

Neologism dictionary based on the language resources on the Web for mecab-ipadic

Stars: ✭ 2,408 (+9161.54%)

Mutual labels: language-resources

A modern, interlingual wordnet interface for Python

Stars: ✭ 119 (+357.69%)

Mutual labels: wordnet

NLIDB

Natural Language Interface to DataBases

Stars: ✭ 100 (+284.62%)

Mutual labels: wordnet

Hierarchical-Word-Sense-Disambiguation-using-WordNet-Senses

Word Sense Disambiguation using Word Specific models, All word models and Hierarchical models in Tensorflow

Stars: ✭ 33 (+26.92%)

Mutual labels: wordnet

m3gm

Max-Margin Markov Graph Models for WordNet (EMNLP 2018)

Stars: ✭ 40 (+53.85%)

Mutual labels: wordnet

Wordbook

Wordbook is a dictionary application built for GNOME.

Stars: ✭ 56 (+115.38%)

Mutual labels: wordnet

Pattern

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Stars: ✭ 8,112 (+31100%)

Mutual labels: wordnet

goclassy

An asynchronous concurrent pipeline for classifying Common Crawl based on fastText's pipeline.

Stars: ✭ 81 (+211.54%)

Mutual labels: corpus-linguistics

corpusexplorer2.0

Korpuslinguistik war noch nie so einfach...

Stars: ✭ 16 (-38.46%)

Mutual labels: corpus-linguistics

nerus

Large silver standart Russian corpus with NER, morphology and syntax markup

Stars: ✭ 47 (+80.77%)

Mutual labels: corpus-linguistics

NatLang

NatLang is an English parser with an extensible grammar

Stars: ✭ 20 (-23.08%)

Mutual labels: wordnet

gf-wordnet

A WordNet in GF

Stars: ✭ 15 (-42.31%)

Mutual labels: wordnet

ungoliant

🕷️ The pipeline for the OSCAR corpus

Stars: ✭ 69 (+165.38%)

Mutual labels: corpus-linguistics

ws4j

WordNet Similarity for Java provides an API for several Semantic Relatedness/Similarity algorithms

Stars: ✭ 41 (+57.69%)

Mutual labels: wordnet

View All Similar Projects ➔

CogNet: a large-scale, high-quality cognate database for 338 languages

CogNet is a large-scale database of cognate pairs: CogNet v2 contains 8.1 million cognates in 338 languages, 38 writing systems, and 91285 concepts. Its quality is manually evaulated at 94% precision. It was automatically constructed from wordnets and dictionaries contained within the UKC resource, as described in our paper. For 37 different orthographies of 338 languages, we used the Wiktionary tranlisteration, WikTra, developed by Wiktionary linguists and developers.

UKC resource is at http://ukc.disi.unitn.it, and more details of CogNet is at http://cognet.ukc.disi.unitn.it/

Why cognates are important?

In Computational Linguistics: improve the cross-lingual NLP tasks, e.g., word translation, bilingual lexicon induction, cross-lingual knowledge transfer.

How can I explore cognate data?

Besides downloading the entire CogNet as a structured text file, you can also use the Linguarena website to display and browse (currently an older version of) cognate data interactively on a world map, as also shown in the figure above. The fish example is on this link

CogNet Format

Each line represents one instance of a pair of cognate words. Columns are separated by TAB.

Column	Description
concept id	A code used by Princeton WordNet 3.0 to represent a meaning (called a synset)
language 1	the 3-letter iso code for the first language
word 1	a word in the language 1
language 2	the 3-letter iso code for the second language
word 2	a word in the language 2
transliteration 1	a romanized word for the first word
tranlisteration 2	a romanized word for the second word

For example,

concept id	lang 1	word 1	lang 2	word 2	translit 1	translit 2
n14996158	glg	polipropileno	jpn	ポリプロピレン	-	poripuropiren
n06566077	nep	सफ्टवेर	kas	سافٹویٚیَر	saphtawera	saftoeyar
n07062058	eng	song	jpn	ソング	-	songu
n02506148	cmn	象	mon	заан	xiang	zaan

License

This tool is available under the Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) License. Read more about this license from https://creativecommons.org/licenses/by-nc-sa/4.0/.

References

Please cite the following articles if you find this resource useful:

Khuyagbaatar Batsuren, Gábor Bella, and Fausto Giunchiglia – CogNet: A large-scale cognate database, Proceedings of The 57th Annual Meeting of the Association for Computational Linguistics (ACL), 2019. https://aclweb.org/anthology/papers/P/P19/P19-1302/

Khuyagbaatar Batsuren, Gábor Bella, and Fausto Giunchiglia - A large and evolving cognate database. Lang Resources & Evaluation (2021). https://doi.org/10.1007/s10579-021-09544-6

Acknowledgements

Thanks to Global Wordnet Association (GWA) and all the wordnet developers. http://globalwordnet.org/resources/wordnets-in-the-world/

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

kbatsuren / CogNet

Labels

Projects that are alternatives of or similar to CogNet

CogNet: a large-scale, high-quality cognate database for 338 languages

Why cognates are important?

How can I explore cognate data?

CogNet Format

License

References

Acknowledgements