All Projects → KBNLresearch → europeananp-ner

KBNLresearch / europeananp-ner

Licence: other
Named Entities Recognition Annotator Tool for Europeana Newspapers

Programming Languages

java
68154 projects - #9 most used programming language
python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to europeananp-ner

Stanford-NER-Python
Stanford Named Entity Recognizer (NER) - Python Wrapper
Stars: ✭ 63 (+8.62%)
Mutual labels:  named-entity-recognition, stanford-ner
redcoat
A lightweight web-based annotation tool for labelling entity recognition data.
Stars: ✭ 19 (-67.24%)
Mutual labels:  named-entity-recognition
nlp ner workshop
Named-Entity-Recognition Workshop
Stars: ✭ 15 (-74.14%)
Mutual labels:  named-entity-recognition
stack-lstm-ner
Transition-based NER system
Stars: ✭ 35 (-39.66%)
Mutual labels:  named-entity-recognition
clinical concept extraction
Clinical Concept Extraction with Contextual Word Embedding
Stars: ✭ 34 (-41.38%)
Mutual labels:  named-entity-recognition
open-semantic-desktop-search
Virtual Machine for Desktop Search with Open Semantic Search
Stars: ✭ 22 (-62.07%)
Mutual labels:  named-entity-recognition
TorchBlocks
A PyTorch-based toolkit for natural language processing
Stars: ✭ 85 (+46.55%)
Mutual labels:  named-entity-recognition
acl19 subtagger
Code for ACL '19 paper: Towards Improving Neural Named Entity Recognition with Gazetteers
Stars: ✭ 33 (-43.1%)
Mutual labels:  named-entity-recognition
grobid-ner
A Named-Entity Recogniser based on Grobid.
Stars: ✭ 38 (-34.48%)
Mutual labels:  named-entity-recognition
named-entity-recognition
Notebooks for teaching Named Entity Recognition at the Cultural Heritage Data School, run by Cambridge Digital Humanities
Stars: ✭ 18 (-68.97%)
Mutual labels:  named-entity-recognition
LNEx
📍 🏢 🏦 🏣 🏪 🏬 LNEx: Location Name Extractor
Stars: ✭ 21 (-63.79%)
Mutual labels:  named-entity-recognition
SkillsExtractorCognitiveSearch
Azure Search Cognitive Skill to extract technical and business skills from text
Stars: ✭ 51 (-12.07%)
Mutual labels:  named-entity-recognition
weak-supervision-for-NER
Framework to learn Named Entity Recognition models without labelled data using weak supervision.
Stars: ✭ 114 (+96.55%)
Mutual labels:  named-entity-recognition
huner
Named Entity Recognition for biomedical entities
Stars: ✭ 44 (-24.14%)
Mutual labels:  named-entity-recognition
knowledge-graph-nlp-in-action
从模型训练到部署,实战知识图谱(Knowledge Graph)&自然语言处理(NLP)。涉及 Tensorflow, Bert+Bi-LSTM+CRF,Neo4j等 涵盖 Named Entity Recognition,Text Classify,Information Extraction,Relation Extraction 等任务。
Stars: ✭ 58 (+0%)
Mutual labels:  named-entity-recognition
IE Paper Notes
Paper notes for Information Extraction, including Relation Extraction (RE), Named Entity Recognition (NER), Entity Linking (EL), Event Extraction (EE), Named Entity Disambiguation (NED).
Stars: ✭ 14 (-75.86%)
Mutual labels:  named-entity-recognition
article-summary-deep-learning
📖 Using deep learning and scraping to analyze/summarize articles! Just drop in any URL!
Stars: ✭ 18 (-68.97%)
Mutual labels:  named-entity-recognition
CrowdLayer
A neural network layer that enables training of deep neural networks directly from crowdsourced labels (e.g. from Amazon Mechanical Turk) or, more generally, labels from multiple annotators with different biases and levels of expertise.
Stars: ✭ 45 (-22.41%)
Mutual labels:  named-entity-recognition
thai-ner
Thai Named Entity Recognition
Stars: ✭ 34 (-41.38%)
Mutual labels:  named-entity-recognition
NER-Multimodal-pytorch
Pytorch Implementation of "Adaptive Co-attention Network for Named Entity Recognition in Tweets" (AAAI 2018)
Stars: ✭ 42 (-27.59%)
Mutual labels:  named-entity-recognition

Named Entity Recognition Tool for
Europeana Newspapers Build Status

This tool takes container documents (MPEG21-DIDL, METS), parses all references to ALTO files and tries to find named entities in the pages (with most models: Location, Person, Organisation, Misc). The aim is to keep the physical location on the page available through the whole process to be able to highlight the results in a viewer.

Read more about it on the KBNLresearch blog.

Stanford NER is used for tagging. The goal during development was to use 'loose coupling', this enables us to quickly inherit/benefit from upstream development. Most of the development is done at the research department of the KB, national library of the Netherlands. If you are looking for a project which does more interaction with the core of Stanford-NER, take a peek at the project from our colleagues INL, Institute for Dutch Lexicology INL-NERT, although they are separate branches now, there is a desire to integrate both in the future.

This version is not longer maintained, for a maintained version go here: https://github.com/EuropeanaNewspapers/ner-app

Input formats

The following input formats are implemented:

  • ALTO 1.0
  • HTML
  • Mets
  • MPEG21 DIDL
  • Text

Output formats

The following output formats are implemented:

Building

Building from source:

Install Maven, Java (v1.7 and up). Clone the source from github, and in the toplevel directory run:

mvn package

This command will generate a JAR and a WAR of the NER located in the target/ directory. To deploy the WAR, just copy it into the Tomcat webapp directory, or use Tomcat manager to do it for you.

Or move quickly and run (on *nix systems):

git clone https://github.com/KBNLresearch/europeananp-ner.git
cd europeananp-ner/
./go.sh

Usage command-line-interface

Invoking help:

java -jar NerAnnotator.jar --help

usage: java -jar NerAnnotator.jar [OPTIONS] [INPUTFILES..]
-c,--container <FORMAT>             Input type: mets (Default), didl,
                                    alto, text, html
-d,--output-directory <DIRECTORY>   output DIRECTORY for result files.
                                    Default ./output
-f,--export <FORMAT>                Output type: log (Default), csv,
                                    html, db, alto, alto2_1, alto3, bio.
                                    Multiple formats:" -f html -f csv"
-l,--language <ISO-CODE>            use two-letter ISO-CODE for language
                                    selection: en, de, nl ....
-m,--models <language=filename>     models for languages. Ex. -m
                                    de=/path/to/file/model_de.gz -m
                                    nl=/path/to/file/model_nl.gz
-n,--nthreads <THREADS>             maximum number of threads to be used
                                    for processing. Default 8

If there are no input files specified, a list of file names is read from stdin.

Example invocation for classification of german_alto.xml:

java -Xmx800m -jar NerAnnotator.jar -c mets -f alto -l de -m de=./test-files/german.ser.gz -n 2 ./test-files/german_alto.xml

The given example takes the language model called 'german.ser.gz' and applies it to 'german_alo.xml' using 2 threads, and container type METS.

Usage web-interface

Webinterface standalone:

mvn jetty:run

This will try to bind to port 8080, using Jetty.

Once deployed to Tomcat the following applies. The default configuration (as well as test-classifiers) reside in src/main/resources/config.ini, this file references the available classifiers.

See the provided sample for some default settings. The landing page of the application will show the available options once invoked with the browser. The config.ini and the classifiers will end up in WEB-INF/classes/, after deployment.

Working with classifiers and binary model generation

To be able to compare your results with a baseline we provide some test files located in the test-files directory.

To run a back-to-front test try:

cd test-files;./test_europeana_ner.sh

The output should look something like:

Generating new classification model. (de)
-rw-rw-r-- 1 aloha aloha 1.4M Sep 11 15:55 ./eunews_german.crf.gz

real	0m3.984s
user	0m5.452s
sys	0m0.235s
Applying generated model (de).

Results:
    Locations: 4
    Organizations: 0
    Persons: 1071

real	0m13.512s
user	0m17.771s
sys	0m0.336s

Generating new classification model. (nl)
-rw-rw-r-- 1 aloha aloha 1.7M Sep 11 15:56 ./eunews_dutch.crf.gz

real	0m8.816s
user	0m10.437s
sys	0m0.371s
Applying generated model (nl).

Results:
    Locations: 1
    Organizations: 8
    Persons: 0

real	0m5.048s
user	0m9.278s
sys	0m0.233s

To generate a binary classification model, use the following command:

cd test-files; java -Xmx5G -cp ../target/NerAnnotator-0.0.2-SNAPSHOT-jar-with-dependencies.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop austen_dutch.prop

This should result in a file called eunews_dutch.crf.gz, with a file-size of +/- 1MB.

To verify the NER software, use the created classifier to process the provided example file.

cd test-files; java -jar ../target/NerAnnotator-0.0.2-SNAPSHOT-jar-with-dependencies.jar -c alto -d out -f alto -l nl -m nl=./eunews_dutch.crf.gz -n 8 ./dutch_alto.xml

Resulting in a directory called out containing ALTO files with inline annotation.

General remarks on binary classification model generation

The process of generating a binary classification model is a delicate one. The input .bio file needs be as clean as possible to prevent the garbage in-out rule from happening. Thus, use noise filters while creating .bio files.

Gazette's greatly improve the quality of your classification process, but a big model in memory may slow down processing speed. Overall there is a strong correlation in model size and performance.

The Stanford NER package offers a lot of settings that can influence the binary model generation process. These settings can be configured using austen.prop, For more information on the Stanford settings see Stanford NER FAQ.

Binary classification models generated with this tool are fully compatible with the upstream version of the Stanford NER.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].