All Projects → microsoft → presidio-research

microsoft / presidio-research

Licence: MIT license
This package features data-science related tasks for developing new recognizers for Presidio. It is used for the evaluation of the entire system, as well as for evaluating specific PII recognizers or PII detection models.

Programming Languages

python
139335 projects - #7 most used programming language
Jupyter Notebook
11667 projects

Projects that are alternatives of or similar to presidio-research

NER-and-Linking-of-Ancient-and-Historic-Places
An NER tool for ancient place names based on Pleiades and Spacy.
Stars: ✭ 26 (-58.06%)
Mutual labels:  spacy, named-entity-recognition, ner
anonymization-api
How to build and deploy an anonymization API with FastAPI
Stars: ✭ 51 (-17.74%)
Mutual labels:  spacy, named-entity-recognition, ner
Spacy Streamlit
👑 spaCy building blocks and visualizers for Streamlit apps
Stars: ✭ 360 (+480.65%)
Mutual labels:  spacy, named-entity-recognition, ner
anonymisation
Anonymization of legal cases (Fr) based on Flair embeddings
Stars: ✭ 85 (+37.1%)
Mutual labels:  spacy, ner, flair
Spacy Lookup
Named Entity Recognition based on dictionaries
Stars: ✭ 212 (+241.94%)
Mutual labels:  spacy, named-entity-recognition, ner
simple NER
simple rule based named entity recognition
Stars: ✭ 29 (-53.23%)
Mutual labels:  named-entity-recognition, ner
TweebankNLP
[LREC 2022] An off-the-shelf pre-trained Tweet NLP Toolkit (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Tweebank-NER dataset
Stars: ✭ 84 (+35.48%)
Mutual labels:  named-entity-recognition, ner
scikitcrf NER
Python library for custom entity recognition using Sklearn CRF
Stars: ✭ 17 (-72.58%)
Mutual labels:  named-entity-recognition, ner
nlp-cheat-sheet-python
NLP Cheat Sheet, Python, spacy, LexNPL, NLTK, tokenization, stemming, sentence detection, named entity recognition
Stars: ✭ 69 (+11.29%)
Mutual labels:  spacy, named-entity-recognition
PhoNER COVID19
COVID-19 Named Entity Recognition for Vietnamese (NAACL 2021)
Stars: ✭ 55 (-11.29%)
Mutual labels:  named-entity-recognition, ner
molminer
Python library and command-line tool for extracting compounds from scientific literature. Written in Python.
Stars: ✭ 38 (-38.71%)
Mutual labels:  named-entity-recognition, ner
ner-d
Python module for Named Entity Recognition (NER) using natural language processing.
Stars: ✭ 14 (-77.42%)
Mutual labels:  named-entity-recognition, ner
extractacy
Spacy pipeline object for extracting values that correspond to a named entity (e.g., birth dates, account numbers, laboratory results)
Stars: ✭ 47 (-24.19%)
Mutual labels:  spacy, ner
korean ner tagging challenge
KU_NERDY 이동엽, 임희석 (2017 국어 정보 처리 시스템경진대회 금상) - 한글 및 한국어 정보처리 학술대회
Stars: ✭ 30 (-51.61%)
Mutual labels:  named-entity-recognition, ner
SynLSTM-for-NER
Code and models for the paper titled "Better Feature Integration for Named Entity Recognition", NAACL 2021.
Stars: ✭ 26 (-58.06%)
Mutual labels:  named-entity-recognition, ner
neural name tagging
Code for "Reliability-aware Dynamic Feature Composition for Name Tagging" (ACL2019)
Stars: ✭ 39 (-37.1%)
Mutual labels:  named-entity-recognition, ner
SkillNER
A (smart) rule based NLP module to extract job skills from text
Stars: ✭ 69 (+11.29%)
Mutual labels:  spacy, ner
lingvo--Ner-ru
Named entity recognition (NER) in Russian texts / Определение именованных сущностей (NER) в тексте на русском языке
Stars: ✭ 38 (-38.71%)
Mutual labels:  named-entity-recognition, ner
KoBERT-NER
NER Task with KoBERT (with Naver NLP Challenge dataset)
Stars: ✭ 76 (+22.58%)
Mutual labels:  named-entity-recognition, ner
CrossNER
CrossNER: Evaluating Cross-Domain Named Entity Recognition (AAAI-2021)
Stars: ✭ 87 (+40.32%)
Mutual labels:  named-entity-recognition, ner

Presidio-research

This package features data-science related tasks for developing new recognizers for Presidio. It is used for the evaluation of the entire system, as well as for evaluating specific PII recognizers or PII detection models. In addition, it contains a fake data generator which creates fake sentences based on templates and fake PII.

Who should use it?

  • Anyone interested in developing or evaluating PII detection models, an existing Presidio instance or a Presidio PII recognizer.
  • Anyone interested in generating new data based on previous datasets or sentence templates (e.g. to increase the coverage of entity values) for Named Entity Recognition models.

Getting started

To install the package:

  1. Clone the repo
  2. Install all dependencies, preferably in a virtual environment:
# Create conda env (optional)
conda create --name presidio python=3.9
conda activate presidio

# Install package+dependencies
pip install -r requirements.txt
python setup.py install

# Download a spaCy model used by presidio-analyzer
python -m spacy download en_core_web_lg

# Verify installation
pytest

Note that some dependencies (such as Flair and Stanza) are not automatically installed to reduce installation complexity.

What's in this package?

  1. Fake data generator for PII recognizers and NER models
  2. Data representation layer for data generation, modeling and analysis
  3. Multiple Model/Recognizer evaluation files (e.g. for Spacy, Flair, CRF, Presidio API, Presidio Analyzer python package, specific Presidio recognizers)
  4. Training and modeling code for multiple models
  5. Helper functions for results analysis

1. Data generation

See Data Generator README for more details.

The data generation process receives a file with templates, e.g. My name is {{name}}. Then, it creates new synthetic sentences by sampling templates and PII values. Furthermore, it tokenizes the data, creates tags (either IO/BIO/BILUO) and spans for the newly created samples.

Once data is generated, it could be split into train/test/validation sets while ensuring that each template only exists in one set. See this notebook for more details.

2. Data representation

In order to standardize the process, we use specific data objects that hold all the information needed for generating, analyzing, modeling and evaluating data and models. Specifically, see data_objects.py.

The standardized structure, List[InputSample] could be translated into different formats:

  • CONLL
from presidio_evaluator import InputSample
dataset = InputSample.read_dataset_json("data/synth_dataset_v2.json")
conll = InputSample.create_conll_dataset(dataset)
conll.to_csv("dataset.csv", sep="\t")
  • spaCy v3
from presidio_evaluator import InputSample
dataset = InputSample.read_dataset_json("data/synth_dataset_v2.json")
InputSample.create_spacy_dataset(dataset, output_path="dataset.spacy")
  • Flair
from presidio_evaluator import InputSample
dataset = InputSample.read_dataset_json("data/synth_dataset_v2.json")
flair = InputSample.create_flair_dataset(dataset)
  • json
from presidio_evaluator import InputSample
dataset = InputSample.read_dataset_json("data/synth_dataset_v2.json")
InputSample.to_json(dataset, output_file="dataset_json")

3. PII models evaluation

The presidio-evaluator framework allows you to evaluate Presidio as a system, a NER model, or a specific PII recognizer for precision and recall and error-analysis.

Examples:

4. Training PII detection models

CRF

To train a vanilla CRF on a new dataset, see this notebook. To evaluate, see this notebook.

spaCy

To train a new spaCy model, first save the dataset in a spaCy format:

# dataset is a List[InputSample]
InputSample.create_spacy_dataset(dataset ,output_path="dataset.spacy")

To evaluate, see this notebook

Flair

  • To train Flair models, see this helper class or this snippet:
from presidio_evaluator.models import FlairTrainer
train_samples = "data/generated_train.json"
test_samples = "data/generated_test.json"
val_samples = "data/generated_validation.json"

trainer = FlairTrainer()
trainer.create_flair_corpus(train_samples, test_samples, val_samples)

corpus = trainer.read_corpus("")
trainer.train(corpus)

Note that the three json files are created using InputSample.to_json.

For more information

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Copyright notice:

Fake Name Generator identities by the Fake Name Generator are licensed under a Creative Commons Attribution-Share Alike 3.0 United States License. Fake Name Generator and the Fake Name Generator logo are trademarks of Corban Works, LLC.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].