All Projects → INK-USC → StructMineDataPipeline

INK-USC / StructMineDataPipeline

Licence: other
Performs entity detection, distant supervision, candidate generation, and produces JSON files for typing systems (PLE, AFET, CoType)

Programming Languages

C++
36643 projects - #6 most used programming language
python
139335 projects - #7 most used programming language
c
50402 projects - #5 most used programming language
shell
77523 projects
HTML
75241 projects
Makefile
30231 projects
CSS
56736 projects

StructMineDataPipeline

Data Processing Pipeline for StructMine Tools: CoType, PLE, AFET.

Description

It generates the train & test json files and brown cluster file for the above three information extraction models as input files. Each line of a json file contains information of a sentence, including entity mentions, relation mentions, etc. To generate such json files, you need to provide the following input files (we include examples in ./data folder):

Training:

  1. Freebase files (download from here (8G) and put the unzipped freebase folder in parallel with code/ and data/ folders)

    • The freebase folder should contain:

      freebase-facts.txt (relation triples in the format of 'id of entity 1, relation type, id of entity 2');

      freebase-mid-name.map (entity id to name map in the format of 'entity id, entity surface name');

      freebase-mid-type.map (entity id to type map in the format of 'entity id, entity type').

  2. Raw training corpus file (each line as a document)

  3. Entity & Relation mention target type mapping from freebase type name to target type name

Test:

  1. Raw test corpus file (each line as a document)

Dependencies

We will take Ubuntu for example.

  • python 2.7
  • Python library dependencies
$ pip install nltk
$ cd code/
$ git clone [email protected]:stanfordnlp/stanza.git
$ cd stanza
$ pip install -e .
$ wget http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip
$ unzip stanford-corenlp-full-2016-10-31.zip

Example Run

Run CoTypeDataProcessing to generate Json input files of CoType for the example training and test raw corpus

$ java -mx4g -cp "code/stanford-corenlp-full-2016-10-31/*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer
$ ./getInputJsonFile.sh  

Our example data files are located in ./data folder. You should be able to see these 3 files generated in the same folder - train.json, test.json and brown (brown cluster file), after running the above command.

Parameters - getInputJsonFile.sh

Raw train & test files to run on.

inTrainFile='./data/documents.txt'
inTestFile='./data/test.txt'

Output files (input json files for CoType, PLE, AFET).

outTrainFile='./data/train.json'
outTestFile='./data/test.json'
bcOutFile='./data/brown'

Directory path of freebase files:

freebaseDir='./freebase'

Mention type(s) required to generate. You can choose to generate entity mentions only or relation mentions only or both. The parameter value can be set to 'em' or 'rm' or 'both'.

mentionType='both'

Target mention type mapping files.

emTypeMapFile='./data/emTypeMap.txt'
rmTypeMapFile='./data/rmTypeMap.txt' # leave it empty if only entity mention is needed

Parsing tool to do sentence splitting, tokenization, entity mention detection, etc. It can be 'nltk' or 'stanford'.

parseTool='stanford'

Set this parameter to be true if you already have a pretrained model and only need to generate test json file.

testOnly=false
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].