All Projects → nasa-jpl-memex → memex-gate

nasa-jpl-memex / memex-gate

Licence: Apache-2.0 license
General Architecture for Text Engineering

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects
java
68154 projects - #9 most used programming language

Projects that are alternatives of or similar to memex-gate

Awesome Persian Nlp Ir
Curated List of Persian Natural Language Processing and Information Retrieval Tools and Resources
Stars: ✭ 460 (+878.72%)
Mutual labels:  information-retrieval, named-entity-recognition
acl19 subtagger
Code for ACL '19 paper: Towards Improving Neural Named Entity Recognition with Gazetteers
Stars: ✭ 33 (-29.79%)
Mutual labels:  named-entities, named-entity-recognition
scikitcrf NER
Python library for custom entity recognition using Sklearn CRF
Stars: ✭ 17 (-63.83%)
Mutual labels:  entities, named-entity-recognition
react-taggy
A simple zero-dependency React component for tagging user-defined entities within a block of text.
Stars: ✭ 29 (-38.3%)
Mutual labels:  entities, named-entity-recognition
Awesome Hungarian Nlp
A curated list of NLP resources for Hungarian
Stars: ✭ 121 (+157.45%)
Mutual labels:  information-retrieval, named-entity-recognition
Dan Jurafsky Chris Manning Nlp
My solution to the Natural Language Processing course made by Dan Jurafsky, Chris Manning in Winter 2012.
Stars: ✭ 124 (+163.83%)
Mutual labels:  information-retrieval, named-entity-recognition
PersianNER
Named-Entity Recognition in Persian Language
Stars: ✭ 48 (+2.13%)
Mutual labels:  named-entities, named-entity-recognition
DRhard
SIGIR'21: Optimizing DR with hard negatives and achieving SOTA first-stage retrieval performance on TREC DL Track.
Stars: ✭ 93 (+97.87%)
Mutual labels:  information-retrieval
anonymisation
Anonymization of legal cases (Fr) based on Flair embeddings
Stars: ✭ 85 (+80.85%)
Mutual labels:  entities
BM25Transformer
(Python) transform a document-term matrix to an Okapi/BM25 representation
Stars: ✭ 50 (+6.38%)
Mutual labels:  information-retrieval
tutorials
A tutorial series by Preferred.AI
Stars: ✭ 136 (+189.36%)
Mutual labels:  information-retrieval
anonymization-api
How to build and deploy an anonymization API with FastAPI
Stars: ✭ 51 (+8.51%)
Mutual labels:  named-entity-recognition
IP-Tracker
Track any ip address with IP-Tracker. IP-Tracker is developed for Linux and Termux. you can retrieve any ip address information using IP-Tracker.
Stars: ✭ 53 (+12.77%)
Mutual labels:  information-retrieval
oci-cloudera
Terraform module to deploy Cloudera on Oracle Cloud Infrastructure (OCI)
Stars: ✭ 20 (-57.45%)
Mutual labels:  hadoop
rastercube
rastercube is a python library for big data analysis of georeferenced time series data (e.g. MODIS NDVI)
Stars: ✭ 15 (-68.09%)
Mutual labels:  hadoop
deepnlp
小时候练手的nlp项目
Stars: ✭ 11 (-76.6%)
Mutual labels:  named-entity-recognition
query completion
Personalized Query Completion
Stars: ✭ 24 (-48.94%)
Mutual labels:  information-retrieval
learning-spark
Tidy up Spark and Hadoop tutorials.
Stars: ✭ 28 (-40.43%)
Mutual labels:  hadoop
arabic-tagger
AQMAR Arabic Tagger: Sequence tagger with cost-augmented structured perceptron training
Stars: ✭ 38 (-19.15%)
Mutual labels:  named-entities
korean ner tagging challenge
KU_NERDY 이동엽, 임희석 (2017 국어 정보 처리 시스템경진대회 금상) - 한글 및 한국어 정보처리 학술대회
Stars: ✭ 30 (-36.17%)
Mutual labels:  named-entity-recognition

MemexGATE

Introduction... what is it?

A server side application and environment for running large scale General Architecture Text Engineering tasks over document resources such as online ads, debarment information, federal and district court appeals, press releases, news articles, social media streams, etc. The MemexGATE application is itself run in conjunction with Behemoth to provide an annotation-based implementation of document corpi and a number of modules operating on these documents. The project can be used to simplify the deployment of document analysers on a large scale.

Features

  • ingesting from common data sources (Warc, Nutch Segments, etc...), raw files (PDF, MSWord, Excel...) and Hadoop Sequence Files
  • text processing (Apache Tika, Apache UIMA, GATE, Language Identification) such as tokenization, sentence splitting, part-of-speech tagging, etc.
  • named entity recognition e.g. identification and explicit featurization of proper names, people. locations, organizations, date/time expressions, measures (percent, money, weight), email addresses, U.S. business addresses, U.S. district attorney names, U.S. Federal and State Judges, U.S. Districts, U.S. Courts, dates, websites, ages, genders, legal lexicon, etc.
  • classification of named entities into predefined categories of interest
  • shallow parsing of entities present within taxonomies or lexicon of terms e.g. legal lexicon
  • generating output for external tools (Apache Solr, Elasticsearch, Mahout)

Use Cases

  • Scrape all court documents from prosecution offices (at Federal and State level) and determine, based on terminology used in the releases, how many cases of a particular nature/type are being brought before the court(s).
  • Scrape all press releases from prosecution offices (at Federal and State level) and determine, based on terminology used in the releases, how many cases of a particular type are being brought forward.
  • Based on domain research and use of domain specific entities, define features (via Natural Language Processing and/or Named Entity Recognition) and make them searchable for researchers and investigators alike
  • Advance the ability to visualize connections between ads, debarment information, court documents, press releases, etc

This tool heavily leverages the GATE software. GATE is an acronym for General Architecture for Text Engineering. Please see below for all of the steps required to use the software. The document corpus' I've made available can be used with the MemexGATE application to do interesting things with legal documents such as

  • natural language processing e.g. tokenization, sentence splitting, part-of-speech tagging, etc
  • named entity recognition e.g. identification of proper names, people. locations, organizations, date/time expressions, measures (percent, money, weight), email addresses, business addresses, etc.
  • classification of named entities into predefined categories of interest
  • shallow parsing of entities present within taxonomies or lexicon of terms e.g. legal lexicon

Dockerfile

MemexGATE is available on Dockerhub for rapid deployment and prototyping of textual document engineering and processing pipelines. To get the MemexGATE application and environment make sure you have Docker installed then simply

$ docker pull lewismc/memex-gate
$ docker run -t -i lewismc/memex-gate /bin/bash

N.B. If you are on MacOSX you may need to run the following two commands first
$ boot2docker start
$ $(boot2docker shellinit)

You will not be within your own environment with all of the tools required to run MemexGATE, namely Hadoop 2.2.0, Mahout 0.10.0, Tika 1.9, Gate 8.1, etc. You can run MemexGATE as follows

root@e4e137838adc:/usr/local# memexgate
   _____                                 ________    ___________________________
  /     \   ____   _____   ____ ___  ___/  _____/   /  _  \__    ___/\_   _____/
 /  \ /  \_/ __ \ /     \_/ __ \  \/  /   \  ___  /  /_\  \|    |    |    __)_
/    Y    \  ___/|  Y Y  \  ___/ >    <\    \_\  \/    |    \    |    |  v0.1  \
\____|__  /\___  >__|_|  /\___  >__/\_ \______  /\____|__  /____|   /_______  /
        \/     \/      \/     \/      \/       \/         \/                 \/
Server side framework for large scale General Architecture Text Engineering tasks.
Usage: run COMMAND
where COMMAND is one of:
  ioWarc           load documents from WARC
  ioNutch          load documents from Nutch segment(s)
  ioHadoop         load documents from Hadoop Sequence files
  importer         generate a SequenceFile containing BehemothDocuments given a directory of raw docs
  reader           read and inspect document corpus
  exporter         read and execute intermediate document extraction creating new corpus
  filter           filter documents and create new corpus
  gate             process documents using MemexGATE apps
  tika             parse documents using Tika
  uima             process documents using UIMA
  mahout           generate vectors for clustering with Mahout
  solr             send documents to Solr for indexing
  elastic          send documents to ElasticSearch for indexing
  language-id      identify the language of documents
Most commands print help when invoked w/o parameters.

#Prerequisites for Manual Installation

#Installation There is VERY little installation required to run MemexGATE over and above provisioning your Hadoop node/cluster and then installing Behemoth as stated in the prerequisites above. MemexGATE is a first class citizen within the Behemoth framework meaning that the Behemoth Processing with GATE instructions can be followed to the T.

This follows the following procedure

  • The zipped MemexGATE application must be pushed onto the distributed filesystem by copying the file from your local file system onto the hdfs as follows
hadoop fs -copyFromLocal /mylocalpath/legisgate.zip /apps/legisgate.zip
  • create a file behemoth-site.xml file in your Hadoop/conf directory and add the following properties:
<property>
  <name>gate.annotationset.input</name>
  <value></value>
  <description>Map the information at the behemoth format onto the select annotationset 
  </description>
</property>
<property>
  <name>gate.annotationset.output</name>
  <value></value>
  <description>AnnotationSet to consider when serializing to the behemoth format
  </description>
</property>
<property>
  <name>gate.annotations.filter</name>
  <value>Token</value>
  <description>Annotations types to consider when serializing to the behemoth format, separated by commas 
  </description>
</property>
<property>
  <name>gate.features.filter</name>
  <value>Token.string</value>
  <description>if specified, only the feature listed for a type will be kept
  </description>
</property>
<property>
  <name>gate.emptyannotationset</name>
  <value>false</value>
  <description>if specified all the annotations in the Behemoth document will be deleted before
 processing with GATE </description>
</property>

Usage

Run MemexGATE on your Behemoth document corpus as follows

hadoop jar gate/target/behemoth-gate*job.jar com.digitalpebble.behemoth.gate.GATEDriver 
 "input path" "target output path" /apps/legisgate.zip
e.g. hadoop jar gate/target/behemoth-gate*job.jar com.digitalpebble.behemoth.gate.GATEDriver 
 /data/behemothcorpus /data/behemoth_legisgate_corpus /apps/legsigate.zip

If you've followed the Behemoth installation instructions and successfully run legisgate from within Behemoth, you are ready to explore other Behemoth modules. For example, a next step might be to use the Behemoth Solr Module to persist the data into an indexing engine such as Apache Solr or maybe Elasticsearch.

Acknowledgements

A huge degree of thanks go to Julien Nioche of DigitalPebble Ltd. who developed and maintains the Behemoth software. Thank you Julien for licensing your code under ALv2.0. This work is funded through the DARPA Memex project.

Contacts

Lewis John McGibbney 0 [email protected]

#License MemexGATE is licensed permissively under the Apache Software License v2.0

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].