All Projects → pyconll → Pyconll

pyconll / Pyconll

Licence: mit
A minimal, pure Python library to interface with CoNLL-U format files.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Pyconll

TextGridTools
Read, write, and manipulate Praat TextGrid files with Python
Stars: ✭ 84 (-19.23%)
Mutual labels:  annotation, linguistics
Gatsby Starter Portfolio Minimal
A Gatsby Starter to create a clean one-page portfolio with Markdown content.
Stars: ✭ 100 (-3.85%)
Mutual labels:  minimal
Lyra
[ARCHIVED] A library which saves and restores the state of Android components easily.
Stars: ✭ 87 (-16.35%)
Mutual labels:  annotation
Minimalftp
A lightweight, simple FTP server. Pure Java, no dependencies.
Stars: ✭ 94 (-9.62%)
Mutual labels:  minimal
Pico
Graceful & Minimal CSS design system in pure semantic HTML
Stars: ✭ 89 (-14.42%)
Mutual labels:  minimal
Universal Resume
Minimal and formal résumé (CV) website template for print, mobile, and desktop. https://bit.ly/ur_demo
Stars: ✭ 1,349 (+1197.12%)
Mutual labels:  minimal
Check
Development environment for Meedan Check, a collaborative media annotation platform
Stars: ✭ 84 (-19.23%)
Mutual labels:  annotation
Devjournal
Jekyll theme for developers! 💻
Stars: ✭ 103 (-0.96%)
Mutual labels:  minimal
Laines
Cycle-accurate NES emulator in ~1000 lines of code
Stars: ✭ 1,365 (+1212.5%)
Mutual labels:  minimal
Flat
FoLiA Linguistic Annotation Tool -- Flat is a web-based linguistic annotation environment based around the FoLiA format (http://proycon.github.io/folia), a rich XML-based format for linguistic annotation. Flat allows users to view annotated FoLiA documents and enrich these documents with new annotations, a wide variety of linguistic annotation types is supported through the FoLiA paradigm.
Stars: ✭ 93 (-10.58%)
Mutual labels:  linguistics
Yui
Minimal vim color scheme
Stars: ✭ 93 (-10.58%)
Mutual labels:  minimal
Doodle
A Simple Java MVC Framework。提供Bean容器、Ioc、Aop、MVC功能
Stars: ✭ 90 (-13.46%)
Mutual labels:  annotation
Rpc.py
A fast and powerful RPC framework based on ASGI/WSGI.
Stars: ✭ 98 (-5.77%)
Mutual labels:  annotation
Dush
👏 Microscopic & functional event emitter in ~350 bytes, extensible through plugins.
Stars: ✭ 87 (-16.35%)
Mutual labels:  minimal
Elpis
🙊 WIP software for creating speech recognition models.
Stars: ✭ 101 (-2.88%)
Mutual labels:  linguistics
Obofoundry.github.io
Metadata and website for the Open Bio Ontologies Foundry Ontology Registry
Stars: ✭ 85 (-18.27%)
Mutual labels:  annotation
Bonjourr
iOS styled StartPage
Stars: ✭ 92 (-11.54%)
Mutual labels:  minimal
Jupyterlab Prodigy
🧬 A JupyterLab extension for annotating data with Prodigy
Stars: ✭ 97 (-6.73%)
Mutual labels:  annotation
Forward Proxy
150 LOC Ruby forward proxy using only standard libraries.
Stars: ✭ 105 (+0.96%)
Mutual labels:  minimal
Minimal Notes
Minimal Notes web app build with Vue.js
Stars: ✭ 102 (-1.92%)
Mutual labels:  minimal

Build Status Coverage Status Documentation Status Version gitter

pyconll

Easily work with CoNLL files using the familiar syntax of python.

Links

Installation

As with most python packages, simply use pip to install from PyPi.

pip install pyconll

pyconll is also available as a conda package on the pyconll channel. Only packages 2.2.0 and newer are available on conda at the moment.

conda install -c pyconll pyconll

pyconll supports Python 3.6 and greater, starting in version 3.0.0. In general pyconll will focus development efforts on officially supported python versions. Python 3.5 reached end of support in October 2020.

Use

This tool is intended to be a minimal, low level, expressive and pragmatic library in a widely used programming language. pyconll creates a thin API on top of raw CoNLL annotations that is simple and intuitive.

It offers the following features:

  • Regular CI testing and validation against all UD v2.x versions.
  • A strong domain model that includes CoNLL sources, Sentences, Tokens, Trees, etc.
  • A typed API for better development experience and better semantics.
  • A focus on usability and simplicity in design (no dependencies)
  • Performance optimizations for a smooth development workflow no matter the dataset size (performs about 25%-35% faster than other comparable packages)

See the following code example to understand the basics of the API.

# This snippet finds sentences where a token marked with part of speech 'AUX' are
# governed by a NOUN. For example, in French this is a less common construction
# and we may want to validate these examples because we have previously found some
# problematic examples of this construction.
import pyconll

train = pyconll.load_from_file('./ud/train.conllu')

review_sentences = []

# Conll objects are iterable over their sentences, and sentences are iterable
# over their tokens. Sentences also de/serialize comment information.
for sentence in train:
    for token in sentence:

        # Tokens have attributes such as upos, head, id, deprel, etc, and sentences
        # can be indexed by a token's id. We must check that the token is not the
        # root token, whose id, '0', cannot be looked up.
        if token.upos == 'AUX' and (token.head != '0' and sentence[token.head].upos == 'NOUN'):
            review_sentences.append(sentence)

print('Review the following sentences:')
for sent in review_sentences:
    print(sent.id)

A full definition of the API can be found in the documentation or use the quick start guide for a focused introduction.

Uses and Limitations

This package edits CoNLL-U annotations. This does not include the annotated text itself. Word forms on Tokens are not editable and Sentence Tokens cannot be reassigned or reordered. pyconll focuses on editing CoNLL-U annotation rather than creating it or changing the underlying text that is annotated. If there is interest in this functionality area, please create a GitHub issue for more visibility.

This package also is only validated against the CoNLL-U format. The CoNLL and CoNLL-X format are not supported, but are very similar. I originally intended to support these formats as well, but their format is not as well defined as CoNLL-U so they are not included. Please create an issue for visibility if this feature interests you.

Lastly, linguistic data can often be very large and this package attempts to keep that in mind. pyconll provides methods for creating in memory conll objects along with an iterate only version in case a corpus is too large to store in memory (the size of the memory structure is several times larger than the actual corpus file). The iterate only version can parse upwards of 100,000 words per second on a 16gb ram machine, so for most datasets to be used on a local dev machine, this package will perform well. The 2.2.0 release also improves parse time and memory footprint by about 25%!

Contributing

Contributions to this project are welcome and encouraged! If you are unsure how to contribute, here is a guide from Github explaining the basic workflow. After cloning this repo, please run pip install -r requirements.txt to properly setup locally. Some of these tools like yapf, pylint, and mypy do not have to be run locally, but CI builds will fail without their successful running. Some other release dependencies like twine and sphinx are also installed.

For packaging new versions, use setuptools version 24.2.0 or greater for creating the appropriate packaging that recognizes the python_requires metadata. Final packaging and release is now done with Github actions so this is less of a concern.

README and CHANGELOG

When changing either of these files, please change the Markdown version and run make gendocs so that the other versions stay in sync.

Release Checklist

Below enumerates the general release process explicitly. This section is for internal use and most people do not have to worry about this. First note, that the dev branch is always a direct extension of master with the latest changes since the last release. That is, it is essentially a staging release branch.

  • Change the version in pyconll/_version.py appropriately.
  • Merge dev into master locally. Github does not offer a fast forward merge and explicitly uses --no-ff. So to keep the linear nature of changes, merge locally to fast forward. This is assuming that the dev branch looks good on CI tests which do not automatically run in this situation.
  • Push the master branch. This should start some CI tests specifically for master. After validating these results, create a tag corresponding to the next version number and push the tag.
  • Create a new release from this tag from the Releases page. On creating this release, two workflows will start. One releases to pypi, and the other releases to conda.
  • Validate these workflows pass, and the package is properly released on both platforms.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].