All Projects → proycon → folia

proycon / folia

Licence: GPL-3.0 License
FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for proces…

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to folia

foliapy
An extensive Python library for dealing with FoLiA (Format for Linguistic Annotation) documents, a rich XML-based format for linguistic annotation finding application in Natural Language Processing (NLP). This library was formerly part of PyNLPl.
Stars: ✭ 13 (-76.79%)
Mutual labels:  xml, computational-linguistics, folia
wikipron
Massively multilingual pronunciation mining
Stars: ✭ 167 (+198.21%)
Mutual labels:  linguistics, computational-linguistics
fuzzing-corpus
My fuzzing corpus
Stars: ✭ 120 (+114.29%)
Mutual labels:  corpus, file-format
Colibri Core
Colibri core is an NLP tool as well as a C++ and Python library for working with basic linguistic constructions such as n-grams and skipgrams (i.e patterns with one or more gaps, either of fixed or dynamic size) in a quick and memory-efficient way. At the core is the tool ``colibri-patternmodeller`` whi ch allows you to build, view, manipulate and query pattern models.
Stars: ✭ 112 (+100%)
Mutual labels:  corpus, linguistics
linguistics problems
Natural language processing in examples and games
Stars: ✭ 23 (-58.93%)
Mutual labels:  linguistics, computational-linguistics
ucto
Unicode tokeniser. Ucto tokenizes text files: it separates words from punctuation, and splits sentences. It offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation. Ucto comes with tokenisation rules …
Stars: ✭ 58 (+3.57%)
Mutual labels:  computational-linguistics, folia
Weixin public corpus
微信公众号语料库
Stars: ✭ 465 (+730.36%)
Mutual labels:  corpus, linguistics
pylangacq
Language Acquisition Research Tools
Stars: ✭ 33 (-41.07%)
Mutual labels:  linguistics, computational-linguistics
proiel-treebank
Official releases of the PROIEL treebank of ancient Indo-European languages
Stars: ✭ 30 (-46.43%)
Mutual labels:  corpus, linguistics
nytwit
New York Times Word Innovation Types dataset
Stars: ✭ 21 (-62.5%)
Mutual labels:  corpus, computational-linguistics
frog
Frog is an integration of memory-based natural language processing (NLP) modules developed for Dutch. All NLP modules are based on Timbl, the Tilburg memory-based learning software package.
Stars: ✭ 70 (+25%)
Mutual labels:  computational-linguistics, folia
xrechnung-visualization
XSL transformators for web and pdf rendering of German CIUS XRechnung or EN16931-1:2017 [MIRROR OF GitLab]
Stars: ✭ 26 (-53.57%)
Mutual labels:  xml
odin
Data-structure definition/validation/traversal, mapping and serialisation toolkit for Python
Stars: ✭ 24 (-57.14%)
Mutual labels:  xml
cljs-corpus
A greppable archive of ClojureScript code
Stars: ✭ 37 (-33.93%)
Mutual labels:  corpus
VectorDrawable2Svg
Converts Android VectorDrawable .xml files to .svg files
Stars: ✭ 50 (-10.71%)
Mutual labels:  xml
dreamland world
DreamLand MUD: all configuration files, and some areas for local dev
Stars: ✭ 16 (-71.43%)
Mutual labels:  xml
naf
Nucleotide Archival Format - Compressed file format for DNA/RNA/protein sequences
Stars: ✭ 35 (-37.5%)
Mutual labels:  file-format
KWDLC
Kyoto University Web Document Leads Corpus
Stars: ✭ 64 (+14.29%)
Mutual labels:  corpus
SentimentAnalysis
Sentiment Analysis: Deep Bi-LSTM+attention model
Stars: ✭ 32 (-42.86%)
Mutual labels:  computational-linguistics
TextDatasetCleaner
🔬 Очистка датасетов от мусора (нормализация, препроцессинг)
Stars: ✭ 27 (-51.79%)
Mutual labels:  linguistics

FoLiA: Format for Linguistic Annotation

tests documentation lamabadge DOI Project Status: Active – The project has reached a stable, usable state and is being actively developed.

Documentation | Examples | Python Library | Python Library Documentation | C++ Library | Rust Library | FoLiA-Tools | FoLiA Utilities | FLAT: Web-based Annotation environment

by Maarten van Gompel, CLST/Radboud University Nijmegen & KNAW Humanities Cluster

https://proycon.github.io/folia

FoLiA is an XML-based annotation format, suitable for the representation of linguistically annotated language resources. FoLiA's intended use is as a format for storing and/or exchanging language resources, including corpora. Our aim is to introduce a single rich format that can accommodate a wide variety of linguistic annotation types through a single generalised paradigm. We do not commit to any label set, language or linguistic theory. This is always left to the developer of the language resource, and provides maximum flexibility.

XML is an inherently hierarchic format. FoLiA does justice to this by maximally utilising a hierarchic, inline, setup. We inherit from the D-Coi format, which posits to be loosely based on a minimal subset of TEI. Because of the introduction of a new and much broader paradigm, FoLiA is not backwards-compatible with D-Coi, i.e. validators for D-Coi will not accept FoLiA XML. It is however easy to convert FoLiA to less complex or verbose formats such as the D-Coi format, or plain-text. Converters are provided.

The main characteristics of FoLiA are:

  • Generalised paradigm - We use a generalised paradigm, with as few ad-hoc provisions for annotation types as possible.
  • Expressivity - The format is highly expressive, annotations can be expressed in great detail and with flexibility to the user's needs, without forcing unwanted details. Moreover, FoLiA has generalised support for representing annotation alternatives, and annotation metadata such as information on annotator, time of annotation, and annotation confidence.
  • Extensible - Due to the generalised paradigm and the fact that the format does not commit to any label set, FoLiA is fairly easily extensible.
  • Formalised - The format is formalised, and can be validated on both a shallow and a deep level (the latter including tagset validation), and easily machine parsable, for which tools are provided.
  • Practical - FoLiA has been developed in a bottom-up fashion right alongside applications, libraries, and other toolkits and converters. Whilst the format is rich, we try to maintain it as simple and straightforward as possible, minimising the learning curve and making it easy to adopt FoLiA in practical applications.

The FoLiA format makes mixed-use of inline and stand-off annotation. Inline annotation is used for annotations pertaining to single tokens, whilst stand-off annotation in a separate annotation layers is adopted for annotation types that span over multiple tokens. This provides FoLiA with the necessary flexibility and extensibility to deal with various kinds of annotations.

Notable features are:

  • XML-based, UTF-8 encoded
  • Language and tagset independent
  • Can encode both tokenised as well as untokenised text + partial reconstructability of untokenised form even after tokenisation.
  • Generalised paradigm, extensible and flexible
  • Provenance support for all linguistic annotations: annotator, type (automatic or manual), time.
  • Used by various software projects and corpora, especially in the Dutch-Flemish NLP community

Paradigm Schema

Resources

A more extensive list of FoLiA-capable software is maintained on the FoLiA website

Publications

See the FoLiA website for more publications and full text links.

  • Maarten van Gompel (2019). FoLiA: Format for Linguistic Annotation - Documentation. Language and Speech Technology Technical Report Series. Radboud University Nijmegen.
  • Maarten van Gompel, Ko van der Sloot, Martin Reynaert, Antal van den Bosch (2017). FoLiA in Practice: The Infrastructure of a Linguistic Annotation Format. In: CLARIN in the Low Countries. Eds: Jan Odijk and Arjan van Hessen. Pp. 71-81. PDF
  • Maarten van Gompel & Martin Reynaert (2014). FoLiA: A practical XML format for linguistic annotation - a descriptive and comparative study; Computational Linguistics in the Netherlands Journal; 3:63-81; 2013. PDF
  • Maarten van Gompel (2014). FoLiA: Format for Linguistic Annotation. Documentation. Language and Speech Technology Technical Report Series LST-14-01. Radboud University Nijmegen.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].