All Projects → amir-zeldes → gum

amir-zeldes / gum

Licence: other
Repository for the Georgetown University Multilayer Corpus (GUM)

Programming Languages

python
139335 projects - #7 most used programming language
XSLT
1337 projects
cython
566 projects
javascript
184084 projects - #8 most used programming language
HTML
75241 projects
CSS
56736 projects

Projects that are alternatives of or similar to gum

proiel-treebank
Official releases of the PROIEL treebank of ancient Indo-European languages
Stars: ✭ 30 (-57.75%)
Mutual labels:  corpus, treebank
Indonesian Nlp Resources
data resource untuk NLP bahasa indonesia
Stars: ✭ 143 (+101.41%)
Mutual labels:  corpus, pos-tagging
OpenConvert
Text conversion tool (from e.g. Word, HTML, txt) to corpus formats TEI or FoLiA)
Stars: ✭ 20 (-71.83%)
Mutual labels:  corpus
sequence labeling tf
Sequence Labeling in Tensorflow
Stars: ✭ 18 (-74.65%)
Mutual labels:  pos-tagging
PersianNER
Named-Entity Recognition in Persian Language
Stars: ✭ 48 (-32.39%)
Mutual labels:  annotations
twc
TypeScript based, boilerplate-less, Polymer toolbox friendly Polymer Modules
Stars: ✭ 33 (-53.52%)
Mutual labels:  annotations
accelerator-core-js
Accelerator Core provides a simple way to integrate real-time audio/video into your web application using the OpenTok Platform
Stars: ✭ 24 (-66.2%)
Mutual labels:  annotations
opensource-voice-tools
A repo listing known open source voice tools, ordered by where they sit in the voice stack
Stars: ✭ 21 (-70.42%)
Mutual labels:  corpus
kibana-comments-app-plugin
An application plugin to add and visualize comments to your Kibana dashboards
Stars: ✭ 36 (-49.3%)
Mutual labels:  annotations
pyheartex
Heartex Python SDK - Connect your own models to Heartex Data Labeling
Stars: ✭ 27 (-61.97%)
Mutual labels:  annotations
phpunit-injector
Injects services from a PSR-11 dependency injection container to PHPUnit test cases
Stars: ✭ 62 (-12.68%)
Mutual labels:  annotations
PrimeAdapter
PrimeAdapter makes working with RecyclerView easier.
Stars: ✭ 54 (-23.94%)
Mutual labels:  annotations
task-bundle
Scheduling of tasks for symfony made simple
Stars: ✭ 33 (-53.52%)
Mutual labels:  annotations
ocr2text
Convert a PDF via OCR to a TXT file in UTF-8 encoding
Stars: ✭ 90 (+26.76%)
Mutual labels:  corpus
aioapi
Yet another way to build APIs using AIOHTTP framework
Stars: ✭ 14 (-80.28%)
Mutual labels:  annotations
AnnotationInject
Compile-time Swift dependency injection annotations
Stars: ✭ 40 (-43.66%)
Mutual labels:  annotations
Easy-Fragment-Argument
This library will help you to pass and receive fragment arguments in easier way
Stars: ✭ 17 (-76.06%)
Mutual labels:  annotations
Parceler
简单的Bundle数据注入框架
Stars: ✭ 107 (+50.7%)
Mutual labels:  annotations
dart sealed
Dart and Flutter sealed class generator and annotations, with match methods and other utilities. There is also super_enum compatible API.
Stars: ✭ 16 (-77.46%)
Mutual labels:  annotations
graphql-metadata
Annotate your graphql schema with lightweight directives
Stars: ✭ 28 (-60.56%)
Mutual labels:  annotations

GUM

Repository for the Georgetown University Multilayer Corpus (GUM)

This repository contains release versions of the Georgetown University Multilayer Corpus (GUM), a corpus of English texts from twelve written and spoken text types:

  • interviews
  • news
  • travel guides
  • how-to guides
  • academic writing
  • biographies
  • fiction
  • online forum discussions
  • spontaneous face to face conversations
  • political speeches
  • textbooks
  • vlogs

The corpus is created as part of the course LING-367 (Computational Corpus Linguistics) at Georgetown University. For more details see: https://gucorpling.org/gum.

A note about reddit data

For one of the twelve text types in this corpus, reddit forum discussions, plain text data is not supplied. To obtain this data, please run _build/process_reddit.py, then run _build/build_gum.py. This, and all data, is provided with absolutely no warranty; users agree to use the data under the license with which it is provided, and reddit data is subject to reddit's terms and conditions. See README_reddit.md for more details.

Train / dev / test splits

Two documents from each genre are reserved for testing and devlopment (24 test documents, 24 dev documents). See splits.md for the official training, development and testing partitions.

Citing

To cite this corpus in general, please refer to the following article, or see different citations for specific aspects below:

Zeldes, Amir (2017) "The GUM Corpus: Creating Multilayer Resources in the Classroom". Language Resources and Evaluation 51(3), 581–612.

@Article{Zeldes2017,
  author    = {Amir Zeldes},
  title     = {The {GUM} Corpus: Creating Multilayer Resources in the Classroom},
  journal   = {Language Resources and Evaluation},
  year      = {2017},
  volume    = {51},
  number    = {3},
  pages     = {581--612},
  doi       = {http://dx.doi.org/10.1007/s10579-016-9343-x}
}

If you are using the Reddit subset of GUM in particular, please use this citation instead:

  • Behzad, Shabnam and Zeldes, Amir (2020) "A Cross-Genre Ensemble Approach to Robust Reddit Part of Speech Tagging". In: Proceedings of the 12th Web as Corpus Workshop (WAC-XII).
@InProceedings{BehzadZeldes2020,
  author    = {Shabnam Behzad and Amir Zeldes},
  title     = {A Cross-Genre Ensemble Approach to Robust {R}eddit Part of Speech Tagging},
  booktitle = {Proceedings of the 12th Web as Corpus Workshop (WAC-XII)},
  pages     = {50--56},
  year      = {2020},
}

If you are using the OntoNotes schema version of the coreference annotations (a.k.a. OntoGUM data in coref/ontogum/), please cite this paper instead:

@InProceedings{ZhuEtAl2021,
  author    = {Yilun Zhu and Sameer Pradhan and Amir Zeldes},
  booktitle = {Proceedings of ACL-IJCNLP 2021},
  title     = {{OntoGUM}: Evaluating Contextualized {SOTA} Coreference Resolution on 12 More Genres},
  year      = {2021},
  pages     = {461--467},
  address   = {Bangkok, Thailand}

For a full list of contributors please see the corpus website.

Directories

The corpus is downloadable in multiple formats. Not all formats contain all annotations: The most accessible format is probably CoNLL-U dependencies (in dep/), but the most complete XML representation is in PAULA XML, and the easiest way to search in the corpus is using ANNIS. Here is an example query for phrases headed by 'one' bridging back to a different, previously mentioned entity. Other formats may be useful for other purposes. See website for more details.

NB: reddit data is not included in top folders - consult README_reddit.md to add it

  • _build/ - The GUM build bot and utilities for data merging and validation
  • annis/ - The entire merged corpus, with all annotations, as a relANNIS 3.3 corpus dump, importable into ANNIS
  • const/ - Constituent trees with function labels and PTB POS tags in the PTB bracketing format (automatic parser output from gold POS with functions projected from gold dependencies)
  • coref/ - Entity and coreference annotation in two formats:
    • conll/ - CoNLL shared task tabular format (with Wikification but no bridging or split antecedent annotations)
    • ontogum/ - alternative version of coreference annotation in CoNLL, tsv and CoNLL-U formats following OntoNotes guidelines (see Zhu et al. 2021)
    • tsv/ - WebAnno .tsv format, including entity and information status annotations, Wikification, bridging, split antecedent and singleton entities
  • dep/ - Dependency trees using Universal Dependencies, enriched with sentence types, enhanced dependencies, entities, information status, coreference, bridging, Wikification, XML markup, morphological tags and Universal POS tags according to the UD standard
  • paula/ - The entire merged corpus in standoff PAULA XML, with all annotations
  • rst/ - Rhetorical Structure Theory analyses in .rs3 format as used by RSTTool and rstWeb, as well as binary and n-ary lisp trees (.dis) and an RST dependency representation (.rsd)
  • xml/ - vertical XML representations with 1 token or tag per line and tab delimited lemmas and POS tags (extended VVZ style, vanilla, UPOS and CLAWS5, as well as dependency functions), compatible with the IMS Corpus Workbench (a.k.a. TreeTagger format).
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].