zhaoyanpeng / xcfg

Licence: other

X (weighted / probabilistic) Context-Free Grammars

Programming Languages

python

139335 projects - #7 most used programming language

shell

77523 projects

Projects that are alternatives of or similar to xcfg

CYK-Parser

A CYK parser written in Python 3.

Stars: ✭ 24 (+41.18%)

Mutual labels: parsing

humanparser

Parse a human name string into salutation, first name, middle name, last name, suffix.

Stars: ✭ 78 (+358.82%)

Mutual labels: parsing

bypass-paywalls-chrome-clean-magnolia1234

Bypass Paywalls Chrome Clean (GitLab proxy)

Stars: ✭ 32 (+88.24%)

Mutual labels: wsj

ruby-marshal

Haskell library to parse a subset of Ruby objects serialised with Marshal.dump

Stars: ✭ 30 (+76.47%)

Mutual labels: parsing

node-typescript-parser

Parser for typescript (and javascript) files, that compiles those files and generates a human understandable AST.

Stars: ✭ 121 (+611.76%)

Mutual labels: parsing

Miksilo

The fastest way to build a language

Stars: ✭ 27 (+58.82%)

Mutual labels: parsing

kolasu

Kotlin Language Support – AST Library

Stars: ✭ 45 (+164.71%)

Mutual labels: parsing

hxjsonast

Parse JSON into position-aware AST with Haxe!

Stars: ✭ 28 (+64.71%)

Mutual labels: parsing

pdfmajor

A better PDF Extraction Tool using the latest and fastest python features

Stars: ✭ 19 (+11.76%)

Mutual labels: parsing

libcitygml

C++ Library for CityGML Parsing and Visualization

Stars: ✭ 69 (+305.88%)

Mutual labels: parsing

YaccConstructor

Platform for parser generators and other grammarware research and development. GLL, RNGLR, graph parsing algorithms, and many others are included.

Stars: ✭ 36 (+111.76%)

Mutual labels: parsing

explore different techniques to generate LR(k) parsing code

Stars: ✭ 13 (-23.53%)

Mutual labels: parsing

json2object

Type safe Haxe/JSON (de)serializer

Stars: ✭ 54 (+217.65%)

Mutual labels: parsing

Jsonify

♨️A delightful JSON parsing framework.

Stars: ✭ 42 (+147.06%)

Mutual labels: parsing

ohm-editor

An IDE for the Ohm language (JavaScript edition)

Stars: ✭ 78 (+358.82%)

Mutual labels: parsing

statham-schema

Statham is a Python Model Parsing Library for JSON Schema.

Stars: ✭ 21 (+23.53%)

Mutual labels: parsing

dataconf

Simple dataclasses configuration management for Python with hocon/json/yaml/properties/env-vars/dict support.

Stars: ✭ 40 (+135.29%)

Mutual labels: parsing

GitHub-WebHook

🐱 Validates and processes GitHub's webhooks

Stars: ✭ 25 (+47.06%)

Mutual labels: parsing

desktop

Extendable calculator for the 21st Century ⚡

Stars: ✭ 85 (+400%)

Mutual labels: parsing

MimeParser

Mime parsing in Swift | Relevant RFCs: RFC 822, RFC 2045, RFC 2046

Stars: ✭ 18 (+5.88%)

Mutual labels: parsing

View All Similar Projects ➔

XCFGs

Aiming at unifying all extensions of context-free grammars (XCFGs). X stands for weighted, (compound) probabilistic, and neural extensions, etc. Currently only the data preprocessing module has been implemented though.

Update (06/02/2022): Parse MSCOCO and Flickr30k captions, create data splits, and encode images for VC-PCFG.

Update (03/10/2021): Parallel Chinese-English data is supported.

Data

The repo handles WSJ, CTB, and SPMRL. Have a look at treebank.py.

If you are looking for the data used in C-PCFGs. Follow the instructions in treebank.py and put all outputs in the same folder, let us say ./data.punct. The script only removes morphology features and creates data splits. To remove punctuation we will need clean_tb.py. For example, I used python clean_tb.py ./data.punct ./data.clean. All the cleaned treebanks will reside in /data.clean. Then simply execute the command ./batchify.sh ./data.clean/, you will have all the data needed to reproduce the results in C-PCFGs. Feel free to change parameters in batchify.sh if you want to use a different batch size or vocabulary size.

Evaluation

To ease evaluation I represent a gold tree as a tuple:

TREE: TUPLE(sentence: STR, spans: LIST[SPAN], span_labels: LIST[STR], pos_tags: LIST[STR])
SPAN: TUPLE(left_boundary: INT, right_boundary: INT)

If you have followed the instructions in the last section, this command ./binarize.sh ./data.clean/ could help you convert gold trees into the tuple representation.

Trivial baselines

Even for trivial baselines, e.g., left- and right-branching trees, you may find different F1 numbers in literature on grammar induction, partly because the authors used (slightly) different procedures for data preprocessing. To encourage truly fair comparison I also released a standard procedure baseline.py. Hopefully, this will help with the situation.

Model	WSJ	CTB	Basque	German	French	Hebrew	Hungarian	Korean	Polish	Swedish
LB	8.7	7.2	17.9	10.0	5.7	8.5	13.3	18.5	10.9	8.4
RB	39.5	25.5	15.4	14.7	26.4	30.0	12.7	19.2	34.2	30.4

An evaluation checklist for phrase-structure grammar induction

Below is a comparison of several cirtical training / evaluation settings of recent unsupervised parsing models.

Model	Sent. F1	Corpus F1	Variance	Word repr.	Punct. rm	Dataset
PRPN	✓			RAW	✓	WSJ
ON	✓			RAW	✓	WSJ
DIORA	✓			ELMo		WSJ
URNNG	✓			RAW	✗	WSJ
N-PCFG	✓			RAW	✓	WSJ / CTB
C-PCFG	✓			RAW	✓	WSJ / CTB
VG-NSL	✓		✓	RAW / FastText	✗	MSCOCO
LN-PCFG	✓			RAW		WSJ
CT	✓			RoBERTa		WSJ
S-DIORA	✓			ELMo		WSJ
VC-PCFG	✓	✓	✓	RAW	✓	MSCOCO
C-PCFG (Zhao 2020)	✓	✓	✓	RAW	✓	WSJ / CTB / SPMRL

Citing XCFGs

If you use XCFGs in your research or wish to refer to the results in C-PCFGs, please use the following BibTeX entry.

@inproceedings{zhao-titov-2021-empirical,
    title = "An Empirical Study of Compound {PCFG}s",
    author = "Zhao, Yanpeng and Titov, Ivan",
    booktitle = "Proceedings of the Second Workshop on Domain Adaptation for NLP",
    month = apr,
    year = "2021",
    address = "Kyiv, Ukraine",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.adaptnlp-1.17",
    pages = "166--171",
}

Acknowledgements

batchify.py is borrowed from C-PCFGs.

License

MIT

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

zhaoyanpeng / xcfg

Programming Languages

Labels

Projects that are alternatives of or similar to xcfg

XCFGs

Data

Evaluation

Trivial baselines

An evaluation checklist for phrase-structure grammar induction

Citing XCFGs

Acknowledgements

License