All Projects → zhaoyanpeng → xcfg

zhaoyanpeng / xcfg

Licence: other
X (weighted / probabilistic) Context-Free Grammars

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to xcfg

CYK-Parser
A CYK parser written in Python 3.
Stars: ✭ 24 (+41.18%)
Mutual labels:  parsing
humanparser
Parse a human name string into salutation, first name, middle name, last name, suffix.
Stars: ✭ 78 (+358.82%)
Mutual labels:  parsing
bypass-paywalls-chrome-clean-magnolia1234
Bypass Paywalls Chrome Clean (GitLab proxy)
Stars: ✭ 32 (+88.24%)
Mutual labels:  wsj
ruby-marshal
Haskell library to parse a subset of Ruby objects serialised with Marshal.dump
Stars: ✭ 30 (+76.47%)
Mutual labels:  parsing
node-typescript-parser
Parser for typescript (and javascript) files, that compiles those files and generates a human understandable AST.
Stars: ✭ 121 (+611.76%)
Mutual labels:  parsing
Miksilo
The fastest way to build a language
Stars: ✭ 27 (+58.82%)
Mutual labels:  parsing
kolasu
Kotlin Language Support – AST Library
Stars: ✭ 45 (+164.71%)
Mutual labels:  parsing
hxjsonast
Parse JSON into position-aware AST with Haxe!
Stars: ✭ 28 (+64.71%)
Mutual labels:  parsing
pdfmajor
A better PDF Extraction Tool using the latest and fastest python features
Stars: ✭ 19 (+11.76%)
Mutual labels:  parsing
libcitygml
C++ Library for CityGML Parsing and Visualization
Stars: ✭ 69 (+305.88%)
Mutual labels:  parsing
YaccConstructor
Platform for parser generators and other grammarware research and development. GLL, RNGLR, graph parsing algorithms, and many others are included.
Stars: ✭ 36 (+111.76%)
Mutual labels:  parsing
LR
explore different techniques to generate LR(k) parsing code
Stars: ✭ 13 (-23.53%)
Mutual labels:  parsing
json2object
Type safe Haxe/JSON (de)serializer
Stars: ✭ 54 (+217.65%)
Mutual labels:  parsing
Jsonify
♨️A delightful JSON parsing framework.
Stars: ✭ 42 (+147.06%)
Mutual labels:  parsing
ohm-editor
An IDE for the Ohm language (JavaScript edition)
Stars: ✭ 78 (+358.82%)
Mutual labels:  parsing
statham-schema
Statham is a Python Model Parsing Library for JSON Schema.
Stars: ✭ 21 (+23.53%)
Mutual labels:  parsing
dataconf
Simple dataclasses configuration management for Python with hocon/json/yaml/properties/env-vars/dict support.
Stars: ✭ 40 (+135.29%)
Mutual labels:  parsing
GitHub-WebHook
🐱 Validates and processes GitHub's webhooks
Stars: ✭ 25 (+47.06%)
Mutual labels:  parsing
desktop
Extendable calculator for the 21st Century ⚡
Stars: ✭ 85 (+400%)
Mutual labels:  parsing
MimeParser
Mime parsing in Swift | Relevant RFCs: RFC 822, RFC 2045, RFC 2046
Stars: ✭ 18 (+5.88%)
Mutual labels:  parsing

XCFGs

Aiming at unifying all extensions of context-free grammars (XCFGs). X stands for weighted, (compound) probabilistic, and neural extensions, etc. Currently only the data preprocessing module has been implemented though.

Update (06/02/2022): Parse MSCOCO and Flickr30k captions, create data splits, and encode images for VC-PCFG.

Update (03/10/2021): Parallel Chinese-English data is supported.

Data

The repo handles WSJ, CTB, and SPMRL. Have a look at treebank.py.

If you are looking for the data used in C-PCFGs. Follow the instructions in treebank.py and put all outputs in the same folder, let us say ./data.punct. The script only removes morphology features and creates data splits. To remove punctuation we will need clean_tb.py. For example, I used python clean_tb.py ./data.punct ./data.clean. All the cleaned treebanks will reside in /data.clean. Then simply execute the command ./batchify.sh ./data.clean/, you will have all the data needed to reproduce the results in C-PCFGs. Feel free to change parameters in batchify.sh if you want to use a different batch size or vocabulary size.

Evaluation

To ease evaluation I represent a gold tree as a tuple:

TREE: TUPLE(sentence: STR, spans: LIST[SPAN], span_labels: LIST[STR], pos_tags: LIST[STR])
SPAN: TUPLE(left_boundary: INT, right_boundary: INT)

If you have followed the instructions in the last section, this command ./binarize.sh ./data.clean/ could help you convert gold trees into the tuple representation.

Trivial baselines

Even for trivial baselines, e.g., left- and right-branching trees, you may find different F1 numbers in literature on grammar induction, partly because the authors used (slightly) different procedures for data preprocessing. To encourage truly fair comparison I also released a standard procedure baseline.py. Hopefully, this will help with the situation.

Model WSJ CTB Basque German French Hebrew Hungarian Korean Polish Swedish
LB 8.7 7.2 17.9 10.0 5.7 8.5 13.3 18.5 10.9 8.4
RB 39.5 25.5 15.4 14.7 26.4 30.0 12.7 19.2 34.2 30.4

An evaluation checklist for phrase-structure grammar induction

Below is a comparison of several cirtical training / evaluation settings of recent unsupervised parsing models.

Model Sent. F1 Corpus F1 Variance Word repr. Punct. rm Length Dataset
PRPN RAW WSJ
ON RAW WSJ
DIORA ELMo WSJ
URNNG RAW WSJ
N-PCFG RAW WSJ / CTB
C-PCFG RAW WSJ / CTB
VG-NSL RAW / FastText MSCOCO
LN-PCFG RAW WSJ
CT RoBERTa WSJ
S-DIORA ELMo WSJ
VC-PCFG RAW MSCOCO
C-PCFG (Zhao 2020) RAW WSJ / CTB / SPMRL

Citing XCFGs

If you use XCFGs in your research or wish to refer to the results in C-PCFGs, please use the following BibTeX entry.

@inproceedings{zhao-titov-2021-empirical,
    title = "An Empirical Study of Compound {PCFG}s",
    author = "Zhao, Yanpeng and Titov, Ivan",
    booktitle = "Proceedings of the Second Workshop on Domain Adaptation for NLP",
    month = apr,
    year = "2021",
    address = "Kyiv, Ukraine",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.adaptnlp-1.17",
    pages = "166--171",
}

Acknowledgements

batchify.py is borrowed from C-PCFGs.

License

MIT

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].