All Projects → MantisAI → nervaluate

MantisAI / nervaluate

Licence: MIT license
Full named-entity (i.e., not tag/token) evaluation metrics based on SemEval’13

Programming Languages

python
139335 projects - #7 most used programming language
Makefile
30231 projects

Projects that are alternatives of or similar to nervaluate

Shukongdashi
使用知识图谱,自然语言处理,卷积神经网络等技术,基于python语言,设计了一个数控领域故障诊断专家系统
Stars: ✭ 109 (+172.5%)
Mutual labels:  named-entity-recognition
rosette-elasticsearch-plugin
Document Enrichment plugin for Elasticsearch
Stars: ✭ 25 (-37.5%)
Mutual labels:  named-entity-recognition
FDDC
Named Entity Recognition & Relation Extraction 实体命名识别与关系分类
Stars: ✭ 29 (-27.5%)
Mutual labels:  named-entity-recognition
NER-using-Deep-Learning
A project on achieving Named-Entity Recognition using Deep Learning.
Stars: ✭ 24 (-40%)
Mutual labels:  named-entity-recognition
slotminer
Tool for slot extraction from text
Stars: ✭ 15 (-62.5%)
Mutual labels:  named-entity-recognition
state-spaces
Sequence Modeling with Structured State Spaces
Stars: ✭ 694 (+1635%)
Mutual labels:  sequence-models
spacy-server
🦜 Containerized HTTP API for industrial-strength NLP via spaCy and sense2vec
Stars: ✭ 58 (+45%)
Mutual labels:  named-entity-recognition
Pytorch-NLU
Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech ta…
Stars: ✭ 151 (+277.5%)
Mutual labels:  named-entity-recognition
subject-extractor
No description or website provided.
Stars: ✭ 21 (-47.5%)
Mutual labels:  named-entity-recognition
deep-atrous-ner
Deep-Atrous-CNN-NER: Word level model for Named Entity Recognition
Stars: ✭ 35 (-12.5%)
Mutual labels:  named-entity-recognition
EntityTargetedActiveLearning
No description or website provided.
Stars: ✭ 17 (-57.5%)
Mutual labels:  named-entity-recognition
PySODEvalToolkit
PySODEvalToolkit: A Python-based Evaluation Toolbox for Salient Object Detection and Camouflaged Object Detection
Stars: ✭ 59 (+47.5%)
Mutual labels:  evaluation-metrics
trinity-ie
Information extraction pipeline containing coreference resolution, named entity linking, and relationship extraction
Stars: ✭ 59 (+47.5%)
Mutual labels:  named-entity-recognition
lingvo--Ner-ru
Named entity recognition (NER) in Russian texts / Определение именованных сущностей (NER) в тексте на русском языке
Stars: ✭ 38 (-5%)
Mutual labels:  named-entity-recognition
simple NER
simple rule based named entity recognition
Stars: ✭ 29 (-27.5%)
Mutual labels:  named-entity-recognition
metamaplite
A near real-time named-entity recognizer
Stars: ✭ 37 (-7.5%)
Mutual labels:  named-entity-recognition
neji
Flexible and powerful platform for biomedical information extraction from text
Stars: ✭ 37 (-7.5%)
Mutual labels:  named-entity-recognition
nested-ner-tacl2020-flair
Implementation of Nested Named Entity Recognition using Flair
Stars: ✭ 23 (-42.5%)
Mutual labels:  named-entity-recognition
DeepNER
An Easy-to-use, Modular and Prolongable package of deep-learning based Named Entity Recognition Models.
Stars: ✭ 9 (-77.5%)
Mutual labels:  named-entity-recognition
Keras ile Derin Ogrenmeye Giris
BTK Akademi -1 Milyon İstihdam Projesi için Merve Ayyüce Kızrak tarafından Hazırlanmıştır.
Stars: ✭ 109 (+172.5%)
Mutual labels:  sequence-models

Build Status codecov GitHub PyPI

nervaluate

nervaluate is a python module for evaluating Named Entity Recognition (NER) models as defined in the SemEval 2013 - 9.1 task.

The evaluation metrics output by nervaluate go beyond a simple token/tag based schema, and consider diferent scenarios based on wether all the tokens that belong to a named entity were classified or not, and also whether the correct entity type was assigned.

This full problem is described in detail in the original blog post by David Batista, and extends the code in the original repository which accompanied the blog post.

The code draws heavily on:

The problem

Token level evaluation for NER is too simplistic

When running machine learning models for NER, it is common to report metrics at the individual token level. This may not be the best approach, as a named entity can be made up of multiple tokens, so a full-entity accuracy would be desireable.

When comparing the golden standard annotations with the output of a NER system different scenarios might occur:

I. Surface string and entity type match

Token Gold Prediction
in O O
New B-LOC B-LOC
York I-LOC I-LOC
. O O

II. System hypothesized an incorrect entity

Token Gold Prediction
an O O
Awful O B-ORG
Headache O I-ORG
in O O

III. System misses an entity

Token Gold Prediction
in O O
Palo B-LOC O
Alto I-LOC O
, O O

Based on these three scenarios we have a simple classification evaluation that can be measured in terms of false positives, true positives, false negatives and false positives, and subsequently compute precision, recall and f1-score for each named-entity type.

However this simple schema ignores the possibility of partial matches or other scenarios when the NER system gets the named-entity surface string correct but the type wrong, and we might also want to evaluate these scenarios again at a full-entity level.

For example:

IV. System assigns the wrong entity type

Token Gold Prediction
I O O
live O O
in O O
Palo B-LOC B-ORG
Alto I-LOC I-ORG
, O O

V. System gets the boundaries of the surface string wrong

Token Gold Prediction
Unless O B-PER
Karl B-PER I-PER
Smith I-PER I-PER
resigns O O

VI. System gets the boundaries and entity type wrong

Token Gold Prediction
Unless O B-ORG
Karl B-PER I-ORG
Smith I-PER I-ORG
resigns O O

How can we incorporate these described scenarios into evaluation metrics? See the original blog for a great explanation, a summary is included here:

We can use the following five metrics to consider difference categories of errors:

Error type Explanation
Correct (COR) both are the same
Incorrect (INC) the output of a system and the golden annotation don’t match
Partial (PAR) system and the golden annotation are somewhat “similar” but not the same
Missing (MIS) a golden annotation is not captured by a system
Spurius (SPU) system produces a response which doesn’t exit in the golden annotation

These five metrics can be measured in four different ways:

Evaluation schema Explanation
Strict exact boundary surface string match and entity type
Exact exact boundary match over the surface string, regardless of the type
Partial partial boundary match over the surface string, regardless of the type
Type some overlap between the system tagged entity and the gold annotation is required

These five errors and four evaluation schema interact in the following ways:

Scenario Gold entity Gold string Pred entity Pred string Type Partial Exact Strict
III BRAND tikosyn MIS MIS MIS MIS
II BRAND healthy SPU SPU SPU SPU
V DRUG warfarin DRUG of warfarin COR PAR INC INC
IV DRUG propranolol BRAND propranolol INC COR COR INC
I DRUG phenytoin DRUG phenytoin COR COR COR COR
VI GROUP contraceptives DRUG oral contraceptives INC PAR INC INC

Then precision/recall/f1-score are calculated for each different evaluation schema. In order to achieve data, two more quantities need to be calculated:

POSSIBLE (POS) = COR + INC + PAR + MIS = TP + FN
ACTUAL (ACT) = COR + INC + PAR + SPU = TP + FP

Then we can compute precision/recall/f1-score, where roughly describing precision is the percentage of correct named-entities found by the NER system, and recall is the percentage of the named-entities in the golden annotations that are retrieved by the NER system. This is computed in two different ways depending wether we want an exact match (i.e., strict and exact ) or a partial match (i.e., partial and type) scenario:

Exact Match (i.e., strict and exact )

Precision = (COR / ACT) = TP / (TP + FP)
Recall = (COR / POS) = TP / (TP+FN)

Partial Match (i.e., partial and type)

Precision = (COR + 0.5 × PAR) / ACT = TP / (TP + FP)
Recall = (COR + 0.5 × PAR)/POS = COR / ACT = TP / (TP + FP)

Putting all together:

Measure Type Partial Exact Strict
Correct 3 3 3 2
Incorrect 2 0 2 3
Partial 0 2 0 0
Missed 1 1 1 1
Spurius 1 1 1 1
Precision 0.5 0.66 0.5 0.33
Recall 0.5 0.66 0.5 0.33
F1 0.5 0.66 0.5 0.33

Notes:

In scenarios IV and VI the entity type of the true and pred does not match, in both cases we only scored against the true entity, not the predicted one. You can argue that the predicted entity could also be scored as spurious, but according to the definition of spurius:

  • Spurius (SPU) : system produces a response which doesn’t exist in the golden annotation;

In this case there exists an annotation, but with a different entity type, so we assume it's only incorrect.

Installation

To install the package:

pip install nervaluate

To create a virtual environment for development:

make virtualenv

# Then to activate the virtualenv:

source /build/virtualenv/bin/activate

Alternatively you can use your own virtualenv manager and simply make reqs to install requirements.

To run tests:

# Will run tox

make test

Example:

The main Evaluator class will accept a number of formats:

  • prodi.gy style lists of spans.
  • Nested lists containing NER labels.
  • CoNLL style tab delimited strings.

Prodigy spans

true = [
    [{"label": "PER", "start": 2, "end": 4}],
    [{"label": "LOC", "start": 1, "end": 2},
     {"label": "LOC", "start": 3, "end": 4}]
]

pred = [
    [{"label": "PER", "start": 2, "end": 4}],
    [{"label": "LOC", "start": 1, "end": 2},
     {"label": "LOC", "start": 3, "end": 4}]
]

from nervaluate import Evaluator

evaluator = Evaluator(true, pred, tags=['LOC', 'PER'])

# Returns overall metrics and metrics for each tag

results, results_per_tag = evaluator.evaluate()

print(results)
{
    'ent_type':{
        'correct':3,
        'incorrect':0,
        'partial':0,
        'missed':0,
        'spurious':0,
        'possible':3,
        'actual':3,
        'precision':1.0,
        'recall':1.0
    },
    'partial':{
        'correct':3,
        'incorrect':0,
        'partial':0,
        'missed':0,
        'spurious':0,
        'possible':3,
        'actual':3,
        'precision':1.0,
        'recall':1.0
    },
    'strict':{
        'correct':3,
        'incorrect':0,
        'partial':0,
        'missed':0,
        'spurious':0,
        'possible':3,
        'actual':3,
        'precision':1.0,
        'recall':1.0
    },
    'exact':{
        'correct':3,
        'incorrect':0,
        'partial':0,
        'missed':0,
        'spurious':0,
        'possible':3,
        'actual':3,
        'precision':1.0,
        'recall':1.0
    }
}
print(results_by_tag)
{
    'LOC':{
        'ent_type':{
            'correct':2,
            'incorrect':0,
            'partial':0,
            'missed':0,
            'spurious':0,
            'possible':2,
            'actual':2,
            'precision':1.0,
            'recall':1.0
        },
        'partial':{
            'correct':2,
            'incorrect':0,
            'partial':0,
            'missed':0,
            'spurious':0,
            'possible':2,
            'actual':2,
            'precision':1.0,
            'recall':1.0
        },
        'strict':{
            'correct':2,
            'incorrect':0,
            'partial':0,
            'missed':0,
            'spurious':0,
            'possible':2,
            'actual':2,
            'precision':1.0,
            'recall':1.0
        },
        'exact':{
            'correct':2,
            'incorrect':0,
            'partial':0,
            'missed':0,
            'spurious':0,
            'possible':2,
            'actual':2,
            'precision':1.0,
            'recall':1.0
        }
    },
    'PER':{
        'ent_type':{
            'correct':1,
            'incorrect':0,
            'partial':0,
            'missed':0,
            'spurious':0,
            'possible':1,
            'actual':1,
            'precision':1.0,
            'recall':1.0
        },
        'partial':{
            'correct':1,
            'incorrect':0,
            'partial':0,
            'missed':0,
            'spurious':0,
            'possible':1,
            'actual':1,
            'precision':1.0,
            'recall':1.0
        },
        'strict':{
            'correct':1,
            'incorrect':0,
            'partial':0,
            'missed':0,
            'spurious':0,
            'possible':1,
            'actual':1,
            'precision':1.0,
            'recall':1.0
        },
        'exact':{
            'correct':1,
            'incorrect':0,
            'partial':0,
            'missed':0,
            'spurious':0,
            'possible':1,
            'actual':1,
            'precision':1.0,
            'recall':1.0
        }
    }
}

Nested lists

true = [
    ['O', 'O', 'B-PER', 'I-PER', 'O'],
    ['O', 'B-LOC', 'I-LOC', 'B-LOC', 'I-LOC', 'O'],
]

pred = [
    ['O', 'O', 'B-PER', 'I-PER', 'O'],
    ['O', 'B-LOC', 'I-LOC', 'B-LOC', 'I-LOC', 'O'],
]

evaluator = Evaluator(true, pred, tags=['LOC', 'PER'], loader="list")

results, results_by_tag = evaluator.evaluate()

CoNLL style tab delimited


true = "word\tO\nword\tO\B-PER\nword\tI-PER\n"

pred = "word\tO\nword\tO\B-PER\nword\tI-PER\n"

evaluator = Evaluator(true, pred, tags=['PER'], loader="conll")

results, results_by_tag = evaluator.evaluate()

Extending the package to accept more formats

Additional formats can easily be added to the module by creating a converstion function in nervaluate/utils.py, for example conll_to_spans(). This function must return the spans in the prodigy style dicts shown in the prodigy example above.

The new function can then be added to the list of loaders in nervaluate/nervaluate.py, and can then be selection with the loader argument when instantiating the Evaluator class.

A list of formats we intend to include is included in #3.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].