All Projects → andreasvc → disco-dop

andreasvc / disco-dop

Licence: GPL-2.0 License
Discontinuous Data-Oriented Parsing

Programming Languages

python
139335 projects - #7 most used programming language
cython
566 projects
HTML
75241 projects
C++
36643 projects - #6 most used programming language
javascript
184084 projects - #8 most used programming language
c
50402 projects - #5 most used programming language

Projects that are alternatives of or similar to disco-dop

node-typescript-parser
Parser for typescript (and javascript) files, that compiles those files and generates a human understandable AST.
Stars: ✭ 121 (+202.5%)
Mutual labels:  parsing
ohm-editor
An IDE for the Ohm language (JavaScript edition)
Stars: ✭ 78 (+95%)
Mutual labels:  parsing
python-yamlable
A thin wrapper of PyYaml to convert Python objects to YAML and back
Stars: ✭ 28 (-30%)
Mutual labels:  parsing
humanparser
Parse a human name string into salutation, first name, middle name, last name, suffix.
Stars: ✭ 78 (+95%)
Mutual labels:  parsing
MimeParser
Mime parsing in Swift | Relevant RFCs: RFC 822, RFC 2045, RFC 2046
Stars: ✭ 18 (-55%)
Mutual labels:  parsing
hxjsonast
Parse JSON into position-aware AST with Haxe!
Stars: ✭ 28 (-30%)
Mutual labels:  parsing
structures
Declarative binary data builder and parser: simple, fast, extensible
Stars: ✭ 29 (-27.5%)
Mutual labels:  parsing
parser-combinators
Lightweight package providing commonly useful parser combinators
Stars: ✭ 41 (+2.5%)
Mutual labels:  parsing
libcitygml
C++ Library for CityGML Parsing and Visualization
Stars: ✭ 69 (+72.5%)
Mutual labels:  parsing
m3u8
Parse and generate m3u8 playlists for Apple HTTP Live Streaming (HLS) in Ruby.
Stars: ✭ 96 (+140%)
Mutual labels:  parsing
dataconf
Simple dataclasses configuration management for Python with hocon/json/yaml/properties/env-vars/dict support.
Stars: ✭ 40 (+0%)
Mutual labels:  parsing
json2object
Type safe Haxe/JSON (de)serializer
Stars: ✭ 54 (+35%)
Mutual labels:  parsing
GitHub-WebHook
🐱 Validates and processes GitHub's webhooks
Stars: ✭ 25 (-37.5%)
Mutual labels:  parsing
pdfmajor
A better PDF Extraction Tool using the latest and fastest python features
Stars: ✭ 19 (-52.5%)
Mutual labels:  parsing
clojure-dsl-resources
A curated list of Clojure resources for dealing with domain-specific languages.
Stars: ✭ 99 (+147.5%)
Mutual labels:  parsing
LR
explore different techniques to generate LR(k) parsing code
Stars: ✭ 13 (-67.5%)
Mutual labels:  parsing
desktop
Extendable calculator for the 21st Century ⚡
Stars: ✭ 85 (+112.5%)
Mutual labels:  parsing
LeagueReplayParser
C# library which can read some data from a .rofl file, and start a replay in the client. (no longer actively maintained)
Stars: ✭ 20 (-50%)
Mutual labels:  parsing
scala-csv-parser
CSV parser library.
Stars: ✭ 24 (-40%)
Mutual labels:  parsing
xcfg
X (weighted / probabilistic) Context-Free Grammars
Stars: ✭ 17 (-57.5%)
Mutual labels:  parsing

Discontinuous DOP

contrived discontinuous constituent for expository purposes.

The aim of this project is to parse discontinuous constituents in natural language using Data-Oriented Parsing (DOP), with a focus on global world domination. The grammar is extracted from a treebank of sentences annotated with (discontinuous) phrase-structure trees. Concretely, this project provides a statistical constituency parser with support for discontinuous constituents and Data-Oriented Parsing. Discontinuous constituents are supported through the grammar formalism Linear Context-Free Rewriting System (LCFRS), which is a generalization of Probabilistic Context-Free Grammar (PCFG). Data-Oriented Parsing allows re-use of arbitrary-sized fragments from previously seen sentences using Tree-Substitution Grammar (TSG).

Features

General statistical parsing:

  • grammar formalisms: PCFG, PLCFRS
  • extract treebank grammar: trees decomposed into productions, relative frequencies as probabilities
  • exact k-best list of derivations
  • coarse-to-fine pruning: posterior threshold, k-best coarse-to-fine

DOP specific (parsing with tree fragments):

  • implementations: Goodman's DOP reduction, Double-DOP, DOP1.
  • estimators: relative frequency estimate (RFE), equal weights estimate (EWE).
  • objective functions: most probable parse (MPP), most probable derivation (MPD), most probable shortest derivation (MPSD), most likely tree with shortest derivation (SL-DOP), most constituents correct (MCC).

screenshot of parse tree produced by parser

Installation

Requirements:

Debian, Ubuntu based systems

The following instructions employ the --user option which means that Python packages will be installed to your home directory. Make sure that ~/.local/bin is in your PATH, or add it as follows (and restart terminal for it to take effect):

echo export PATH=$HOME/.local/bin:$PATH >> ~/.bashrc

To compile the latest development version of discodop, issue the following commands:

sudo apt-get install build-essential python3-dev python3-pip git
git clone --recursive https://github.com/andreasvc/disco-dop.git
cd disco-dop
pip3 install --user -r requirements.txt
make install

Other Linux systems

This assumes no root access, but assumes that gcc is installed.

Set environment variables so that software can be installed to the home directory (replace with equivalent for your shell if you do not use bash):

mkdir -p ~/.local
echo export PATH=$HOME/.local/bin:$PATH >> ~/.bashrc
echo export LD_LIBRARY_PATH=$HOME/.local/lib:/usr/lib64:/usr/lib >> ~/.bashrc
echo export PYTHONIOENCODING="utf-8" >> ~/.bashrc

After this, re-login or restart the shell to activate these settings. Install Python 3 from source, if not installed already. Python may require some libraries such as zlib and readline; installation steps are similar to the ones below:

wget http://www.python.org/ftp/python/3.6.1/Python-3.6.1.tgz
tar -xzf Python-*.tgz
cd Python-*
./configure --prefix=$HOME/.local --enable-shared
make install && cd ..
ldconfig

Check by running python3 that version 3.6.1 was installed successfully and is the default.

Install the latest development version of discodop:

wget https://github.com/andreasvc/disco-dop/archive/master.zip
unzip disco-dop-master.zip
cd disco-dop-master
pip3 install --user -r requirements.txt
make install

Mac OS X

  • Install Xcode and Homebrew

  • Install dependencies using Homebrew:

    brew install gcc python3 git
    git clone --recursive git://github.com/andreasvc/disco-dop.git
    cd disco-dop
    sudo pip3 install -r requirements.txt
    env CC=gcc sudo python3 setup.py install
    

Windows 10

Install the Windows subsystem for Linux (you may need to install a Windows update first), install Ubuntu from the Windows Store, and proceed with the steps above for Ubuntu-based systems.

Other systems

If you do not run Linux, it is possible to run the code inside a virtual machine. To do that, install Docker or Virtualbox and download a minimal Ubuntu image and follow the above installation instructions.

Usage, documentation

discodop can be used in three ways:

  1. through the command line; cf. the manual pages for the discodop command installed as part of the installation: man discodop.
  2. as a library, cf. the API reference and example notebooks
  3. Web interfaces

NB: avoid running discodop from within the source tree, to ensure that the installed versions of modules are imported.

The documentation can be found at http://discodop.readthedocs.io

Grammars, demo

A interactive demo of the parser is available at: https://lang.science.uva.nl/parser/

The pretrained grammars used in this demo are available at: https://lang.science.uva.nl/grammars/

The English, German, and Dutch grammars are described in van Cranenburgh et al., (2016); the French grammar appears in Sangati & van Cranenburgh (2015). For comparison, there is also an English grammar without discontinuous constituents (ptb-nodisc).

Acknowledgments

The Tree data structures in tree.py and the simple binarization algorithm in treetransforms.py were taken from NLTK. The Zhang-Shasha tree-edit distance algorithm in treedist.py was taken from https://github.com/timtadh/zhang-shasha Elements of the PLCFRS parser and punctuation re-attachment are based on code from rparse. Various other bits inspired by the Stanford parser, Berkeley parser, Bubs parser, &c.

References

Please cite the following paper if you use this code in the context of a publication:

@article{vancranenburgh2016disc,
    title={Data-Oriented Parsing with discontinuous constituents and function tags},
    author={van Cranenburgh, Andreas and Remko Scha and Rens Bod},
    journal={Journal of Language Modelling},
    year={2016},
    volume={4},
    number={1},
    pages={57--111},
    url={http://dx.doi.org/10.15398/jlm.v4i1.100}
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].