The aim of this repository is to show a baseline model for text classification by implementing a LSTM-based model coded in PyTorch. In order to provide a better understanding of the model, it will be used a Tweets dataset provided by Kaggle.

Stars: ✭ 45 (+12.5%)

Mutual labels: text-mining, text-processing

Pipeit

PipeIt is a text transformation, conversion, cleansing and extraction tool.

Stars: ✭ 57 (+42.5%)

Mutual labels: text-mining, text-processing

Xioc

Extract indicators of compromise from text, including "escaped" ones.

Stars: ✭ 148 (+270%)

Mutual labels: text-mining, text-processing

support-tickets-classification

This case study shows how to create a model for text analysis and classification and deploy it as a web service in Azure cloud in order to automatically classify support tickets. This project is a proof of concept made by Microsoft (Commercial Software Engineering team) in collaboration with Endava http://endava.com/en

Stars: ✭ 142 (+255%)

Mutual labels: text-mining, text-processing

perke

A keyphrase extractor for Persian

Stars: ✭ 60 (+50%)

Mutual labels: text-mining, text-processing

palladian

Palladian is a Java-based toolkit with functionality for text processing, classification, information extraction, and data retrieval from the Web.

Stars: ✭ 32 (-20%)

Mutual labels: text-mining, information-extraction

Text Mining

Text Mining in Python

Stars: ✭ 18 (-55%)

Mutual labels: text-mining, text-processing

estratto

parsing fixed width files content made easy

Stars: ✭ 12 (-70%)

Mutual labels: text-mining, text-processing

Artificial Adversary

🗣️ Tool to generate adversarial text examples and test machine learning models against them

Stars: ✭ 348 (+770%)

Mutual labels: text-mining, text-processing

Applied Text Mining In Python

Repo for Applied Text Mining in Python (coursera) by University of Michigan

Stars: ✭ 59 (+47.5%)

Mutual labels: text-mining, text-processing

Text-Analysis

Explaining textual analysis tools in Python. Including Preprocessing, Skip Gram (word2vec), and Topic Modelling.

Stars: ✭ 48 (+20%)

Mutual labels: text-mining, text-processing

Awesome Hungarian Nlp

A curated list of NLP resources for Hungarian

Stars: ✭ 121 (+202.5%)

Mutual labels: text-mining, information-extraction

TextDatasetCleaner

🔬 Очистка датасетов от мусора (нормализация, препроцессинг)

Stars: ✭ 27 (-32.5%)

Mutual labels: text-mining, text-processing

advanced-text-mining

TEANAPS 라이브러리를 활용한 자연어 처리와 텍스트 분석 방법론에 대해 다룹니다.

Stars: ✭ 15 (-62.5%)

Mutual labels: text-mining, text-processing

text-analysis

Weaving analytical stories from text data

Stars: ✭ 12 (-70%)

Mutual labels: text-mining, text-processing

corpusexplorer2.0

Korpuslinguistik war noch nie so einfach...

Stars: ✭ 16 (-60%)

Mutual labels: text-mining, text-processing

View All Similar Projects ➔

Deduce: de-identification method for Dutch medical text

If you are looking for the version of DEDUCE as published with Menger et al (2017), please visit vmenger/deduce-classic, where the original is archived. This version is maintained and improved, thus possibly differing from the validated original.

This project contains the code for DEDUCE: de-identification method for Dutch medical text, initially described in Menger et al (2017). De-identification of medical text is needed for using text data for analysis, to comply with legal requirements and to protect the privacy of patients. Our pattern matching based method removes Protected Health Information (PHI) in the following categories:

Person names, including initials
Geographical locations smaller than a country
Names of institutions that are related to patient treatment
Dates
Ages
Patient numbers
Telephone numbers
E-mail addresses and URLs

The details of the development and workings of the initial method, and its validation can be found in:

Menger, V.J., Scheepers, F., van Wijk, L.M., Spruit, M. (2017). DEDUCE: A pattern matching method for automatic de-identification of Dutch medical text, Telematics and Informatics, 2017, ISSN 0736-5853

Prerequisites

nltk

Installing

Installing can be done through pip and git:

>>> pip install deduce

Or from source, simply download and use python to install:

>>> python setup.py install

Getting started

The package has a method for annotating (annotate_text) and for removing the annotations (deidentify_annotations).

import deduce 

deduce.annotate_text(
        text,                       # The text to be annotated
        patient_first_names="",     # First names (separated by whitespace)
        patient_initials="",        # Initial
        patient_surname="",         # Surname(s)
        patient_given_name="",      # Given name
        names=True,                 # Person names, including initials
        locations=True,             # Geographical locations
        institutions=True,          # Institutions
        dates=True,                 # Dates
        ages=True,                  # Ages
        patient_numbers=True,       # Patient numbers
        phone_numbers=True,         # Phone numbers
        urls=True,                  # Urls and e-mail addresses
        flatten=True                # Debug option
    )    
    
deduce.deidentify_annotations(
        text                        # The annotated text that should be de-identified
    )

Examples

>>> import deduce

>>> text = u"Dit is stukje tekst met daarin de naam Jan Jansen. De patient J. Jansen (e: j.jnsen@email.com, t: 06-12345678) is 64 jaar oud 
    en woonachtig in Utrecht. Hij werd op 10 oktober door arts Peter de Visser ontslagen van de kliniek van het UMCU."
>>> annotated = deduce.annotate_text(text, patient_first_names="Jan", patient_surname="Jansen")
>>> deidentified = deduce.deidentify_annotations(annotated)

>>> print (annotated)
"Dit is stukje tekst met daarin de naam <PATIENT Jan Jansen>. De <PATIENT patient J. Jansen> (e: <URL j.jnsen@email.com>, t: <TELEFOONNUMMER 06-12345678>) 
is <LEEFTIJD 64> jaar oud en woonachtig in <LOCATIE Utrecht>. Hij werd op <DATUM 10 oktober> door arts <PERSOON Peter de Visser> ontslagen van de kliniek van het <INSTELLING umcu>."
>>> print (deidentified)
"Dit is stukje tekst met daarin de naam <PATIENT>. De <PATIENT> (e: <URL-1>, t: <TELEFOONNUMMER-1>) is <LEEFTIJD-1> jaar oud en woonachtig in <LOCATIE-1>.
Hij werd op <DATUM-1> door arts <PERSOON-1> ontslagen van de kliniek van het <INSTELLING-1>."

Configuring

The lookup lists in the data/ folder can be tailored to the users specific needs. This is especially recommended for the list of names of institutions, since they are by default tailored to location of development and testing of the method. Regular expressions can be modified in annotate.py, this is for the same reason recommended for detecting patient numbers.

Contributing

Thanks a lot for considering to make a contribution to DEDUCE, we are very open to your help!

If you need support, have a question, or found a bug/error, please get in touch by creating a New Issue. We don't have an issue template, just try to be specific and complete, so we can tackle it.
If you want to make a contribution either to the code or the docs, please take a few minutes to read our contribution guidelines. This greatly improve the chances of your work being merged into the repository.

Changelog

You may find detailed versioning information in the changelog.

Authors

Vincent Menger - Initial work
Jonathan de Bruin - Code review
Pablo Mosteiro - Bug fixes, structured annotations

License

This project is licensed under the GNU LGPLv3 license - see the LICENSE.md file for details

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

vmenger / deduce

Programming Languages

Labels

Projects that are alternatives of or similar to deduce

Deduce: de-identification method for Dutch medical text

Prerequisites

Installing

Getting started

Examples

Configuring

Contributing

Changelog

Authors

License