All Projects → droher → etymology-db

droher / etymology-db

Licence: Apache-2.0 license
An open etymology dataset created using Wiktionary data. Contains 3.8M entries, 1.8M terms, 2900 languages, and 31 unique relationship types.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to etymology-db

Spell4Wiki
Spell4Wiki is a mobile application to record and upload audio for Wiktionary words to Wikimedia commons. Also act as a Wiki-Dictionary.
Stars: ✭ 17 (-15%)
Mutual labels:  wikimedia, wiktionary
wikdict-gen
Generation of bilingual dictionaries from Wiktionary/dbnary data for the WikDict project
Stars: ✭ 32 (+60%)
Mutual labels:  open-data, wiktionary
textbox
Text collections made available by the CLiGS group.
Stars: ✭ 19 (-5%)
Mutual labels:  digital-humanities
adresse.data.gouv.fr
Le site officiel de l'Adresse
Stars: ✭ 139 (+595%)
Mutual labels:  open-data
trusat-frontend
The React codebase for space-sustainability tool TruSat
Stars: ✭ 31 (+55%)
Mutual labels:  open-data
Covid19arData
Data COVID-19 Argentina actualizada y en formatos abiertos.
Stars: ✭ 51 (+155%)
Mutual labels:  open-data
website-old
The Frictionless Data website.
Stars: ✭ 31 (+55%)
Mutual labels:  open-data
TopicsExplorer
Explore your own text collection with a topic model – without prior knowledge.
Stars: ✭ 53 (+165%)
Mutual labels:  digital-humanities
ocorrencias-transito-pmsp
Dados abertos sobre ocorrências de trânsito na cidade de São Paulo
Stars: ✭ 13 (-35%)
Mutual labels:  open-data
wiki
从diy行为艺术到diy苏格拉底式对话,从diy一个仪式到diy一次旷课,各种活动指南的百科。diy💔是706孵化的一个非代码开源项目。
Stars: ✭ 49 (+145%)
Mutual labels:  digital-humanities
dashmap.io
DashMap is an open source web platform that gathers, analyses and visualises urban data.
Stars: ✭ 36 (+80%)
Mutual labels:  open-data
datascience
Keeping track of activities around research data
Stars: ✭ 29 (+45%)
Mutual labels:  open-data
booknlp
BookNLP, a natural language processing pipeline for books
Stars: ✭ 636 (+3080%)
Mutual labels:  digital-humanities
whyqd
data wrangling simplicity, complete audit transparency, and at speed
Stars: ✭ 16 (-20%)
Mutual labels:  open-data
TraduXio
A participative platform for cultural texts translators
Stars: ✭ 19 (-5%)
Mutual labels:  digital-humanities
coletores
Coletores de dados sobre remunerações do sistema de justiça brasileiro
Stars: ✭ 18 (-10%)
Mutual labels:  open-data
cia
Citizen Intelligence Agency, open-source intelligence (OSINT) project
Stars: ✭ 79 (+295%)
Mutual labels:  open-data
events
Materials related to events I might attend, and to talks I am giving
Stars: ✭ 22 (+10%)
Mutual labels:  open-data
OSODOS
Open Science, Open Data, Open Source
Stars: ✭ 23 (+15%)
Mutual labels:  open-data
dw-jdbc
JDBC driver for data.world
Stars: ✭ 17 (-15%)
Mutual labels:  open-data

etymology-db

Downloads: (Last generated 2021-11-14)
Gzipped CSV
Parquet

A structured, comprehensive, and multilingual etymology dataset created by parsing Wiktionary's etymology sections. Key features:

  • 3.8+ million etymological relationships between 1.8+ million terms in 2900+ languages/dialects
  • 31 different types of etymological relations, distinguishing between inheritance, borrowing, etc.
  • Hierarchical data that preserves relationship structures, such as the evolution of a term across languages

Caveat for people interested in using this for research: all information is pulled directly from Wiktionary via semi-structured text parsing, and I've made no effort yet to validate any particular result. That said, I would love for this to be useful, so please raise an issue if you have questions.

Here is a description of the table schema:

Column Name Description
term_id A hash of the term and its language.
lang The language/dialect of the term.
term The term itself. Usually a word, but can also be a prefix or a multi-word expression, hence "term" instead of word.
reltype The kind of etymological relation being specified (see below for details on each possible value).
related_term_id A hash of the related term and its language (useful for assembling relationships across multiple terms).
related_lang The language/dialect of the related term. NULL for parent root nodes.
related_term The term that is etymologically related to the original entry. NULL for parent root nodes.
position Zero-indexed position of the term when the relation is made up of multiple terms (e.g. a compound).
group_tag Randomly generated ID. populated only for the root nodes of nested relationships.
parent_tag If this relation is inside of a nested structure, this will be populated with the group_tag of its immediate parent. NULL otherwise.
parent_position Zero-indexed position of the relation inside of its nested structure. NULL if not nested.

And here is a description of each relation type. Note that these are all derived directly from Wiktionary's etymology templates -- all rows are classified according to the name of the template from which the info was extracted, and no further inferences are made. The only exception to this is the group relations, which are based on formulaic and reoccuring patterns in natural language sections.

Relation Type Description
inherited_from Indicates that term has an unbroken chain of inheritance from related_term.
borrowed_from Indicates that term is a loanword borrowed during the time the borrowing language was spoken.
derived_from A catch-all for a derivation relationship that is not specifically inherited/borrowed.
learned_borrowing_from Borrowed from the original language via atypical ("inorganic") means of language contact.
semi_learned_borrowing_from Borrowed words that have been partly reshaped by later sound change or analogy with inherited terms. /
orthographic_borrowing_from Borrows the spelling of related_term but not the pronunciation.
unadapted_borrowing_from Borrowed words that have not been conformed to the morpho-syntactic, phonological and/or phonotactical rules of the target language.
root Constructed root(s) of term in a theoretical ancestor language, e.g. Proto-Indo-European.
has_prefix Indicates that term is partially based on the prefix related_term.
has_prefix_with_root related_term is the term attached to a prefix (not necessarily a suffix, e.g. "normal" in "abnormal")
has_suffix Same as above, but for suffixes.
has_suffix_with_root Same as above, but for suffixes.
has_confix A confix is a term whose first element is a prefix and whose last is a suffix. Position 0 of a confix is the prefix, and the last position is the suffix.
has_affix Affix is the general form of prefix/suffix/confix and indicates some kind of compound structure without further detail.
compound_of Indicates that related_term is the position-indexed term of a compound that makes up term. Used interchangeably with affix above.
back-formation_from Indicates that term was formed from related_term by removing a prefix/suffix.
doublet_with Indicates that term and related_term have the same etymological origin, especially when the relationship is unintuitive.
is_onomatopoeic Indicates that term is an onomatopoeia (a word form from a sound associated with its meaning).
calque_of Indicates that term is borrowed from related_term via a direct word-for-word or root-for-root translation.
semantic_loan_of Special case of calques in which the word already existed but a new meaning was added.
named_after Indicates that term is based on the name of the person related_term (an eponym).
phono-semantic_matching_of Indicates that term and related_term have very similar sounds and meanings in both languages.
etymologically_related_to A catch-all indicating term and related_term are etymologically related without any further context provided.
blend_of Indicates that term is made up of a blend of related_term and the related terms in other positions. This differs from a compound in that the beginning of one word is combined with the ending of another.
clipping_of Indicates that term is a spoken shortened version of related_term without any semantic difference.
abbreviation_of Indicates that term is a written shortened version of related_term without any semantic difference.
initialism_of Indicates that term is based on the initials of related_term.
cognate_of Indicates that term and related_term sound/mean similar things, but no direct ancestral relationship exists.
group_affix_root A node that groups together rows that, when combined, form an affix.
group_related_root A node that groups together rows in which related_terms are not just related to the term, but to each other as well.
group_derived_root A node that groups together rows that, when combined, form an unbroken chain of inheritance (in reverse chronological order).

The wiktionary_codes.csv file is manually combined from these two pages:
https://en.wiktionary.org/wiki/Wiktionary:List_of_languages
https://en.wiktionary.org/wiki/Module:etymology_languages/data

All data is licensed under the Creative Commons ShareAlike 3.0 License. All code is licensed under the Apache 2.0 license.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].