All Projects → proiel → proiel-treebank

proiel / proiel-treebank

Licence: other
Official releases of the PROIEL treebank of ancient Indo-European languages

Projects that are alternatives of or similar to proiel-treebank

Weixin public corpus
微信公众号语料库
Stars: ✭ 465 (+1450%)
Mutual labels:  corpus, linguistics
folia
FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for proces…
Stars: ✭ 56 (+86.67%)
Mutual labels:  corpus, linguistics
gum
Repository for the Georgetown University Multilayer Corpus (GUM)
Stars: ✭ 71 (+136.67%)
Mutual labels:  corpus, treebank
Colibri Core
Colibri core is an NLP tool as well as a C++ and Python library for working with basic linguistic constructions such as n-grams and skipgrams (i.e patterns with one or more gaps, either of fixed or dynamic size) in a quick and memory-efficient way. At the core is the tool ``colibri-patternmodeller`` whi ch allows you to build, view, manipulate and query pattern models.
Stars: ✭ 112 (+273.33%)
Mutual labels:  corpus, linguistics
poesy
Poetic processing, for Python.
Stars: ✭ 28 (-6.67%)
Mutual labels:  linguistics
Nlvr
Cornell NLVR and NLVR2 are natural language grounding datasets. Each example shows a visual input and a sentence describing it, and is annotated with the truth-value of the sentence.
Stars: ✭ 192 (+540%)
Mutual labels:  corpus
Nlp bahasa resources
A Curated List of Dataset and Usable Library Resources for NLP in Bahasa Indonesia
Stars: ✭ 158 (+426.67%)
Mutual labels:  corpus
Wp2txt
WP2TXT extracts plain text data from Wikipedia dump file (encoded in XML/compressed with Bzip2) stripping all the MediaWiki markups and other metadata.
Stars: ✭ 145 (+383.33%)
Mutual labels:  corpus
DANeS
DANeS is an open-source E-newspaper dataset by collaboration between DATASET JSC (dataset.vn) and AIV Group (aivgroup.vn)
Stars: ✭ 64 (+113.33%)
Mutual labels:  corpus
pylangacq
Language Acquisition Research Tools
Stars: ✭ 33 (+10%)
Mutual labels:  linguistics
WonderfulPolishLanguage
This is a repository created for the list of resources for learning and exploring Wonderful Polish language.
Stars: ✭ 31 (+3.33%)
Mutual labels:  linguistics
Weibo terminater
Final Weibo Crawler Scrap Anything From Weibo, comments, weibo contents, followers, anything. The Terminator
Stars: ✭ 2,295 (+7550%)
Mutual labels:  corpus
transliteration-php
🇺🇦 🇬🇧 🔡 🐘 PHP library for transliteration.
Stars: ✭ 34 (+13.33%)
Mutual labels:  latin
Efaqa Corpus Zh
❤️Emotional First Aid Dataset, 心理咨询问答、聊天机器人语料库
Stars: ✭ 170 (+466.67%)
Mutual labels:  corpus
Probabilistic-RNN-DA-Classifier
Probabilistic Dialogue Act Classification for the Switchboard Corpus using an LSTM model
Stars: ✭ 22 (-26.67%)
Mutual labels:  corpus
Indonesian Nlp Resources
data resource untuk NLP bahasa indonesia
Stars: ✭ 143 (+376.67%)
Mutual labels:  corpus
megs
A merged version of multiple open-source German speech datasets.
Stars: ✭ 21 (-30%)
Mutual labels:  corpus
rclc
Rich Context leaderboard competition, including the corpus and current SOTA for required tasks.
Stars: ✭ 20 (-33.33%)
Mutual labels:  corpus
Dialogue-Corpus
No description or website provided.
Stars: ✭ 27 (-10%)
Mutual labels:  corpus
Chinese Names Corpus
中文人名语料库。人名生成器。中文姓名,姓氏,名字,称呼,日本人名,翻译人名,英文人名。可用于中文分词、人名实体识别。
Stars: ✭ 3,053 (+10076.67%)
Mutual labels:  corpus

The PROIEL Treebank

The PROIEL Treebank is a dependency treebank with morphosyntactic and information-structure annotation. It includes texts in several ancient Indo-European languages and is freely available under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License.

Please cite as

Dag T. T. Haug and Marius L. Jøhndal. 2008. 'Creating a Parallel Treebank of the Old Indo-European Bible Translations'. In Caroline Sporleder and Kiril Ribarov (eds.). Proceedings of the Second Workshop on Language Technology for Cultural Heritage Data (LaTeCH 2008) (2008), pp. 27-34.

Releases of the PROIEL Treebank are hosted on Github.

Contents

The following texts are included in this release of the treebank:

Text Language Filename Size
The Greek New Testament (ed. Tischendorf 1869) Ancient Greek greek-nt 140,763 tokens
The Armenian New Testament (ed. Künzle 1984) Classical Armenian armenian-nt 23,513 tokens
The Gothic Bible (ed. Streitberg 1919) Gothic gothic-nt 57,211 tokens
Codex Marianus (ed. Jagić 1883) Old Church Slavonic marianus 58,269 tokens
Jerome's Vulgate Latin latin-nt 112,454 tokens
Caesar, Commentarii belli Gallici (ed. Holmes 1914) Latin caes-gal 28,607 tokens
Cicero, De officiis (ed. Miller 1913) Latin cic-off 10,644 tokens
Cicero, Epistulae ad Atticum (ed. Purser 1901) Latin cic-att 42,855 tokens
Palladius, Opus agriculturae (ed. Schmitt 1898) Latin pal-agr 12,148 tokens
Peregrinatio Aetheriae (ed. Heraeus 1908) Latin per-aeth 18,356 tokens
Herodotus, Histories (ed. Godley 1920) Ancient Greek hdt 85,080 tokens
Sphrantzes, Chronicles (post-1453) (ed. Grecu 1966) Ancient Greek chron 24,612 tokens

(The 'size' column in the table above shows the number of annotated tokens in a text. The number of tokens will be slightly larger than the number of words in the original printed edition as some words have been split into multiple tokens and some tokens have been inserted during annotation.)

Please see the XML files for detailed metadata and a full list of contributors.

Some sentences have not yet been annotated. This is an overview of where in the texts unannotated sentences occur:

Sections in which more than half of sentences have not yet been annotated:

  • armenian-nt: JOHN 1-21, MATT 1-28, MARK 1-16
  • caes-gal: 5.8-5.58, books 7, book 8
  • cic-att: 6.2-6.9, 7.2-7.9, 7.11-7.26, 8.1-8.16
  • cic-off: 1.114-1.161, book 2, book 3
  • greek-nt: HEB 13, 1PET 3-5, 2PET 1-3, 1JOHN 1-5, 2JOHN 1, 3JOHN 1, JUDE 1
  • hdt: 1.70, 1.127-1.130, 1.200, book 2, book 3, 4.1-4.156, 5.94-5.101, 6.82, 6.86, 7.1, 7.31, 8.8-8.144, book 9
  • latin-nt: COL 3-4, 1TIM 1-6, 2TIM 1-3, HEB 1-13, JAS 1-5, 1PET 1-5, 2PET 1-3, 1JOHN 1-5, 2JOHN 1, 3JOHN 1
  • pal-agr: 2.12, 3.13-3.34, books 4-14

Sections or section ranges in which there are gaps:

  • armenian-nt: LUKE 3
  • caes-gal: 6.36
  • cic-att: 1.17-1.20, 2.3-2.24, 3.20-3.23, 4.2-4.19, 5.2-5.21, 6.1, 7.1
  • cic-off: 1.7-1.10, 1.38, 1.48, 1.61, 1.100, 1.106, 1.112, 1.133
  • hdt: 1.45-1.69, 1.126, 1.141-1.216, 4.157-4.198, 5.1-5.109, 6.12-6.138, 7.2-7.198, 7.220-7.234, 8.3-8.7
  • latin-nt: ACTS 21-28, ROM 11, ROM 13, GAL 1-6, EPH 3-5, PHIL 1, PHIL 3, COL 1-2, 2THESS 3, 2TIM 4, JUDE 1
  • marianus: MATT 5, MARK 16, LUKE 2, LUKE 24, JOHN 1-2, JOHN 18, JOHN 20
  • pal-agr: 1.4-1.12, 1.35-1.40, 2.3, 2.9-2.23, 3.9-3.10

These gaps will be completed in future releases.

Data formats

The texts are available on two formats:

  1. PROIEL XML: These files are the authoritative source files and the only ones that contain all available annotation. They contain the complete morphological, syntactic and information-structure annotation, as well as the complete text, including punctuation, section headers etc. The schema is defined in proiel.xsd.

  2. CoNLL-X format

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].