All Projects → clab → language-universal-parser

clab / language-universal-parser

Licence: Apache-2.0 License
No description, website, or topics provided.

Programming Languages

C++
36643 projects - #6 most used programming language
CMake
9771 projects

Dependencies

  • boost-1.60.0
  • eigen hg clone https://bitbucket.org/eigen/eigen

How to use?

# setup repository #
cd
mkdir git ; cd git/
git clone [email protected]:clab/language-universal-parser.git
cd language-universal-parser
git submodule init
git submodule update
cd dynet
git pull origin master
cd ../

# build the parser (with latest version of dynet) #
cd ~/git/language-universal-parser/dynet
git pull origin master
cd .. ; mkdir build-gpu ; cd build-gpu
cmake -DEIGEN3_INCLUDE_DIR=$EIGEN_ROOT ..  # -DBACKEND=cuda is not supported just yet
make -j 10

# train the parser on small data #
~/git/language-universal-parser/build-gpu/parser/lstm-parse --train -P --training_data $TRAIN_ARCSTD --dev_data $DEV_ARCSTD --pretrained_dim 50 --pretrained $PRETRAINED_EMBEDDINGS --brown_clusters $PRETRAINED_CLUSTERS --epochs 1

How to generate arc-standard transitions?

The parser expects projective treebanks with arc-standard transitions as input (see command lines below). To convert nonprojective treebanks in CoNLL 2006 format to the arc-std oracle files of the pseudo-projective treebanks:

java -jar maltparser-1.8.1.jar -c pproj -m proj -i $split_lc -o $split_projective -pp baseline
java -jar ParserOracleArcStd.jar -t -1 -l 1 -c treebank.conll -i treebank.conll > treebank.arcstd

We recommend that you lowercase word tokens/types in all input files (e.g., pretrained embeddings, Brown clusters, train/dev/test treebanks) before calling the parser.

Language typology embeddings

To enable language typology embeddings, use the following command line argument --typological_properties typology_file. Sample typology files have been provided in the subdirectory typological_properties/. If you enable typology embeddings, please prefix each word in the input files (e.g., en:book instead of book). The two-letter prefix should match the first field in the typology file.

What to cite?

Many Languages, One Parser TACL 2016 (to appear) Waleed Ammar, George Mulcaire, Miguel Ballesteros, Chris Dyer, Noah A. Smith

results

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].