All Projects → strubell → preprocess-conll05

strubell / preprocess-conll05

Licence: other
Scripts for preprocessing the CoNLL-2005 SRL dataset.

Programming Languages

shell
77523 projects
python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to preprocess-conll05

cmip6 preprocessing
Analysis ready CMIP6 data in python the easy way with pangeo tools.
Stars: ✭ 126 (+641.18%)
Mutual labels:  preprocessing
oxygenjs
This a JavaScript Library for the Numerical Javascript and Machine Learning
Stars: ✭ 13 (-23.53%)
Mutual labels:  preprocessing
skippa
SciKIt-learn Pipeline in PAndas
Stars: ✭ 33 (+94.12%)
Mutual labels:  preprocessing
indic nlp resources
Resources to go with the Indic NLP Library
Stars: ✭ 55 (+223.53%)
Mutual labels:  nlp-resources
sparklanes
A lightweight data processing framework for Apache Spark
Stars: ✭ 17 (+0%)
Mutual labels:  preprocessing
Preprocessing-Method-for-STEMI-Detection
Official source code of "Preprocessing Method for Performance Enhancement in CNN-based STEMI Detection from 12-lead ECG"
Stars: ✭ 12 (-29.41%)
Mutual labels:  preprocessing
SeqTools
A python library to manipulate and transform indexable data (lists, arrays, ...)
Stars: ✭ 42 (+147.06%)
Mutual labels:  preprocessing
BrainPrep
Preprocessing pipeline on Brain MR Images through FSL and ANTs, including registration, skull-stripping, bias field correction, enhancement and segmentation.
Stars: ✭ 107 (+529.41%)
Mutual labels:  preprocessing
TextDatasetCleaner
🔬 Очистка датасетов от мусора (нормализация, препроцессинг)
Stars: ✭ 27 (+58.82%)
Mutual labels:  preprocessing
tweets-preprocessor
Repo containing the Twitter preprocessor module, developed by the AUTH OSWinds team
Stars: ✭ 26 (+52.94%)
Mutual labels:  preprocessing
contextualSpellCheck
✔️Contextual word checker for better suggestions
Stars: ✭ 274 (+1511.76%)
Mutual labels:  preprocessing
MLLabelUtils.jl
Utility package for working with classification targets and label-encodings
Stars: ✭ 30 (+76.47%)
Mutual labels:  preprocessing
veridical-flow
Making it easier to build stable, trustworthy data-science pipelines.
Stars: ✭ 28 (+64.71%)
Mutual labels:  preprocessing
minie
An open information extraction system that provides compact extractions
Stars: ✭ 83 (+388.24%)
Mutual labels:  nlp-resources
pywedge
Makes Interactive Chart Widget, Cleans raw data, Runs baseline models, Interactive hyperparameter tuning & tracking
Stars: ✭ 49 (+188.24%)
Mutual labels:  preprocessing
NLP-Natural-Language-Processing
Projects and useful articles / links
Stars: ✭ 149 (+776.47%)
Mutual labels:  nlp-resources
nlp-notebooks
A collection of natural language processing notebooks.
Stars: ✭ 19 (+11.76%)
Mutual labels:  nlp-resources
postcss-each
PostCSS plugin to iterate through values
Stars: ✭ 93 (+447.06%)
Mutual labels:  preprocessing
dropEst
Pipeline for initial analysis of droplet-based single-cell RNA-seq data
Stars: ✭ 71 (+317.65%)
Mutual labels:  preprocessing
text-normalizer
Normalize text string
Stars: ✭ 12 (-29.41%)
Mutual labels:  preprocessing

preprocess-conll05

Scripts for preprocessing the CoNLL-2005 SRL dataset.

Requirements:

Basic CoNLL-2005 pre-processing

These pre-processing steps download the CoNLL-2005 data and gather gold part-of-speech and parse info from your copy of the PTB. The output will look like:

The         DT    (S(NP-SBJ-1(NP*  *    -   -      (A1*      
economy     NN    *                *    -   -      *      
's          POS   *)               *    -   -      *      
temperature NN    *)               *    -   -      *)     
will        MD    (VP*             *    -   -      (AM-MOD*)     
be          VB    (VP*             *    -   -      *      
taken       VBN   (VP*             *    01  take   (V*) 
  • Field 1: word form
  • Field 2: gold part-of-speech tag
  • Field 3: gold sytax
  • Field 4: placeholder
  • Field 5: verb sense
  • Field 6: predicate (infinitive form)
  • Field 7+: for each predicate, a column representing the labeled arguments of the predicate.

First, set up paths to existing data:

export WSJ="/your/path/to/wsj/"
export BROWN="/your/path/to/brown"

Download CoNLL-2005 data and scripts:

./bin/basic/get_data.sh

Extract pos/parse info from gold data:

./bin/basic/extract_train_from_ptb.sh
./bin/basic/extract_dev_from_ptb.sh
./bin/basic/extract_test_from_ptb.sh
./bin/basic/extract_test_from_brown.sh

Format into combined output files:

./bin/basic/make-trainset.sh
./bin/basic/make-devset.sh 
./bin/basic/make-wsj-test.sh
./bin/basic/make-brown-test.sh 

Further pre-processing (e.g. for LISA)

Sometimes it's nice to convert constituencies to dependency parses and provide automatic part-of-speech tags, e.g. if you wish to train a parsing model. BIO format is also a more standard way of representing spans than the default CoNLL-2005 format. This pre-processing converts the constituency parses to Stanford dependencies (v3.5), assigns automatic part-of-speech tags from the Stanford left3words tagger, and converts SRL spans to BIO format. The output will look like:

conll05 0       0       The         DT      DT      2       det         _       -       -       -       -       O       B-A1
conll05 0       1       economy     NN      NN      4       poss        _       -       -       -       -       O       I-A1
conll05 0       2       's          POS     POS     2       possessive  _       -       -       -       -       O       I-A1
conll05 0       3       temperature NN      NN      7       nsubjpass   _       -       -       -       -       O       I-A1
conll05 0       4       will        MD      MD      7       aux         _       -       -       -       -       O       B-AM-MOD
conll05 0       5       be          VB      VB      7       auxpass     _       -       -       -       -       O       O
conll05 0       6       taken       VBN     VBN     0       root        _       01      take    -       -       O       B-V
  • Field 1: domain placeholder
  • Field 2: sentence id
  • Field 3: token id
  • Field 4: word form
  • Field 5: gold part-of-speech tag
  • Field 6: auto part-of-speech tag
  • Field 7: dependency parse head
  • Field 8: dependency parse label
  • Field 9: placeholder
  • Field 10: verb sense
  • Field 11: predicate (infinitive form)
  • Field 12: placeholder
  • Field 13: placeholder
  • Field 14: NER placeholder
  • Field 15+: for each predicate, a column representing the labeled arguments of the predicate.

First, set up paths to Stanford parser and part-of-speech tagger:

export STANFORD_PARSER="/your/path/to/stanford-parser-full-2017-06-09"
export STANFORD_POS="/your/path/to/stanford-postagger-full-2017-06-09"

The following script will then convert dependencies, tag, and reformat the data. This will create a new file in the $CONLL05 directory with the same name as the input and suffix .parse.sdeps.combined. If $CONLL05 is not set, you should set it to the conll05st-release directory.

./bin/preprocess_conll05_sdeps.sh $CONLL05/train-set.gz
./bin/preprocess_conll05_sdeps.sh $CONLL05/dev-set.gz
./bin/preprocess_conll05_sdeps.sh $CONLL05/test.wsj.gz
./bin/preprocess_conll05_sdeps.sh $CONLL05/test.brown.gz

Now all that remains is to convert fields to BIO format. The following script will create a new file in the same directory as the old file with the suffix .bio:

./bin/convert-bio.sh $CONLL05/train-set.gz.parse.sdeps.combined
./bin/convert-bio.sh $CONLL05/dev-set.gz.parse.sdeps.combined
./bin/convert-bio.sh $CONLL05/test.wsj.gz.parse.sdeps.combined
./bin/convert-bio.sh $CONLL05/test.brown.gz.parse.sdeps.combined

You may also want to generate a matrix of transition probabilities for performing Viterbi inference at test time. You can use the following to do so:

python3 bin/compute_transition_probs.py --in_file_name $CONLL05/train-set.gz.parse.sdeps.combined.bio > $CONLL05/transition_probs.tsv

Pre-processing for evaluation scripts

To evaluate using the CoNLL eval.pl and srl-eval.pl scripts, you'll need files in a different format to evaluate against. To generate files for parse evaluation (eval.pl), use the following script:

python3 bin/eval/extract_conll_parse_file.py --input_file $CONLL05/dev-set.gz.parse.sdeps.combined --id_field 2 --word_field 3 --pos_field 4 --head_field 6 --label_field 7 > $CONLL05/conll2005-dev-gold-parse.txt

For SRL evaluation, use the following:

python3 bin/eval/extract_conll_prop_file.py --input_file $CONLL05/dev-set.gz.parse.sdeps.combined --take_last --word_field 3 --pred_field 10 --first_prop_field 14 > $CONLL05/conll2005-dev-gold-props.txt
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].