All Projects → Georgetown-IR-Lab → Quickumls

Georgetown-IR-Lab / Quickumls

Licence: mit
System for Medical Concept Extraction and Linking

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Quickumls

Fox
Federated Knowledge Extraction Framework
Stars: ✭ 155 (-25.84%)
Mutual labels:  named-entity-recognition
Kashgari
Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.
Stars: ✭ 2,235 (+969.38%)
Mutual labels:  named-entity-recognition
Pyhanlp
中文分词 词性标注 命名实体识别 依存句法分析 新词发现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁 自然语言处理
Stars: ✭ 2,564 (+1126.79%)
Mutual labels:  named-entity-recognition
Solrtexttagger
A text tagger based on Lucene / Solr, using FST technology
Stars: ✭ 162 (-22.49%)
Mutual labels:  named-entity-recognition
Spark Nlp
State of the Art Natural Language Processing
Stars: ✭ 2,518 (+1104.78%)
Mutual labels:  named-entity-recognition
Bert Sklearn
a sklearn wrapper for Google's BERT model
Stars: ✭ 182 (-12.92%)
Mutual labels:  named-entity-recognition
Deeplearning nlp
基于深度学习的自然语言处理库
Stars: ✭ 154 (-26.32%)
Mutual labels:  named-entity-recognition
Monpa
MONPA 罔拍是一個提供正體中文斷詞、詞性標註以及命名實體辨識的多任務模型
Stars: ✭ 203 (-2.87%)
Mutual labels:  named-entity-recognition
Chinsesner Pytorch
基于BI-LSTM+CRF的中文命名实体识别 Pytorch
Stars: ✭ 174 (-16.75%)
Mutual labels:  named-entity-recognition
Simpletransformers
Transformers for Classification, NER, QA, Language Modelling, Language Generation, T5, Multi-Modal, and Conversational AI
Stars: ✭ 2,881 (+1278.47%)
Mutual labels:  named-entity-recognition
Open Semantic Etl
Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & Named Entity Recognition) & data enrichment (annotation) pipelines & ingestor to Solr or Elastic search index & linked data graph database
Stars: ✭ 165 (-21.05%)
Mutual labels:  named-entity-recognition
Vntk
Vietnamese NLP Toolkit for Node
Stars: ✭ 170 (-18.66%)
Mutual labels:  named-entity-recognition
Persian Ner
پیکره بزرگ شناسایی موجودیت‌های نامدار فارسی برچسب خورده
Stars: ✭ 183 (-12.44%)
Mutual labels:  named-entity-recognition
Bert Ner Tf
Named Entity Recognition with BERT using TensorFlow 2.0
Stars: ✭ 155 (-25.84%)
Mutual labels:  named-entity-recognition
Bertner
ChineseNER based on BERT, with BiLSTM+CRF layer
Stars: ✭ 195 (-6.7%)
Mutual labels:  named-entity-recognition
Sequence tagging
Named Entity Recognition (LSTM + CRF) - Tensorflow
Stars: ✭ 1,889 (+803.83%)
Mutual labels:  named-entity-recognition
Gerbil
GERBIL - General Entity annotatoR Benchmark
Stars: ✭ 180 (-13.88%)
Mutual labels:  named-entity-recognition
Pytorch graph Rel
A PyTorch implementation of GraphRel
Stars: ✭ 204 (-2.39%)
Mutual labels:  named-entity-recognition
Lac
百度NLP:分词,词性标注,命名实体识别,词重要性
Stars: ✭ 2,792 (+1235.89%)
Mutual labels:  named-entity-recognition
Medical Entity Recognition
包含传统的基于统计模型(CRF)和基于深度学习(Embedding-Bi-LSTM-CRF)下的医疗数据命名实体识别
Stars: ✭ 184 (-11.96%)
Mutual labels:  named-entity-recognition

NEW: v.1.4 supports starting multiple QuickUMLS matchers concurrently! I've finally added support for unqlite as an alternative to leveldb for storage of CUIs and Semantic Types (see here for more details). unqlite-backed QuickUMLS installation support multiple matchers running at the same time. Other than better multi-processing support, unqlite should have better support for unicode.

QuickUMLS

QuickUMLS (Soldaini and Goharian, 2016) is a tool for fast, unsupervised biomedical concept extraction from medical text. It takes advantage of Simstring (Okazaki and Tsujii, 2010) for approximate string matching. For more details on how QuickUMLS works, we remand to our paper.

This project should be compatible with Python 3 (Python 2 is no longer supported) and run on any UNIX system (support for Windows is experimental, please report bugs!). If you find any bugs, please file an issue on GitHub or email the author at [email protected].

Installation

  1. Obtain a UMLS installation This tool requires you to have a valid UMLS installation on disk. To install UMLS, you must first obtain a license from the National Library of Medicine; then you should download all UMLS files from this page; finally, you can install UMLS using the MetamorphoSys tool as explained in this guide. The installation can be removed once the system has been initialized.
  2. Install QuickUMLS: You can do so by either running pip install quickumls or python setup.py install. On macOS, using anaconda is strongly recommended.
  3. Create a QuickUMLS installation Initialize the system by running python -m quickumls.install <umls_installation_path> <destination_path>, where <umls_installation_path> is where the installation files are (in particular, we need MRCONSO.RRF and MRSTY.RRF) and <destination_path> is the directory where the QuickUmls data files should be installed. This process will take between 5 and 30 minutes depending how fast the CPU and the drive where UMLS and QuickUMLS files are stored are (on a system with a Intel i7 6700K CPU and a 7200 RPM hard drive, initialization takes 8.5 minutes). python -m quickumls.install supports the following optional arguments:
    • -L / --lowercase: if used, all concept terms are folded to lowercase before being processed. This option typically increases recall, but it might reduce precision;
    • -U / --normalize-unicode: if used, expressions with non-ASCII characters are converted to the closest combination of ASCII characters.
    • -E / --language: Specify the language to consider for UMLS concepts; by default, English is used. For a complete list of languages, please see this table provided by NLM.
    • -d / --database-backend: Specify which database backend to use for QuickUMLS. The two options are leveldb and unqlite. The latter supports multi-process reading and has better unicode compatibility, and it used as default for all new 1.4 installations; the former is still used as default when instantiating a QuickUMLS client. More info about differences between the two databases and migration info are available here.

: If the installation fails on macOS when using Anaconda, install leveldb first by running conda install -c conda-forge python-leveldb.

APIs

A QuickUMLS object can be instantiated as follows:

from quickumls import QuickUMLS

matcher = QuickUMLS(quickumls_fp, overlapping_criteria, threshold,
                    similarity_name, window, accepted_semtypes)

Where:

  • quickumls_fp is the directory where the QuickUMLS data files are installed.
  • overlapping_criteria (optional, default: "score") is the criteria used to deal with overlapping concepts; choose "score" if the matching score of the concepts should be consider first, "length" if the longest should be considered first instead.
  • threshold (optional, default: 0.7) is the minimum similarity value between strings.
  • similarity_name (optional, default: "jaccard") is the name of similarity to use. Choose between "dice", "jaccard", "cosine", or "overlap".
  • window (optional, default: 5) is the maximum number of tokens to consider for matching.
  • accepted_semtypes (optional, default: see constants.py) is the set of UMLS semantic types concepts should belong to. Semantic types are identified by the letter "T" followed by three numbers (e.g., "T131", which identifies the type "Hazardous or Poisonous Substance"). See here for the full list.

To use the matcher, simply call

text = "The ulna has dislocated posteriorly from the trochlea of the humerus."
matcher.match(text, best_match=True, ignore_syntax=False)

Set best_match to False if you want to return overlapping candidates, ignore_syntax to True to disable all heuristics introduced in (Soldaini and Goharian, 2016).

If the matcher throws a warning during initialization, read this page to learn why and how to stop it from doing so.

spaCy pipeline component

QuickUMLS can be used for standalone processing but it can also be use as a component in a modular spaCy pipeline. This follows traditional spaCy handling of concepts to be entity objects added to the Document object. These entity objects contain the CUI, similarity score and Semantic Types in the spacy "underscore" object.

Adding QuickUMLS as a component in a pipeline can be done as follows:

from quickumls.spacy_component import SpacyQuickUMLS

# common English pipeline
nlp = spacy.load('en_core_web_sm')

quickumls_component = SpacyQuickUMLS(nlp, 'PATH_TO_QUICKUMLS_DATA')
nlp.add_pipe(quickumls_component)

doc = nlp('Pt c/o shortness of breath, chest pain, nausea, vomiting, diarrrhea')

for ent in doc.ents:
    print('Entity text : {}'.format(ent.text))
    print('Label (UMLS CUI) : {}'.format(ent.label_))
    print('Similarity : {}'.format(ent._.similarity))
    print('Semtypes : {}'.format(ent._.semtypes))

Server / Client Support

Starting with v.1.2, QuickUMLS includes a support for being used in a client-server configuration. That is, you can start one QuickUMLS server, and query it from multiple scripts using a client.

To start the server, run python -m quickumls.server:

python -m quickumls.server /path/to/quickumls/files {-P QuickUMLS port} {-H QuickUMLS host} {QuickUMLS options}

Host and port are optional; by default, QuickUMLS runs on localhost:4645. You can also pass any QuickUMLS option mentioned above to the server. To obtain a list of options for the server, run python -m quickumls.server -h.

To load the client, import get_quickumls_client from quickumls:

from quickumls import get_quickumls_client
matcher = get_quickumls_client()
text = "The ulna has dislocated posteriorly from the trochlea of the humerus."
matcher.match(text, best_match=True, ignore_syntax=False)

The API of the client is the same of a QuickUMLS object.

In case you wish to run the server in the background, you can do so as follows:

nohup python -m quickumls.server /path/to/QuickUMLS {server options} > /dev/null 2>&1 & echo $! > nohup.pid

When you are done, don't forget to stop the server by running.

kill -9 `cat nohup.pid`
rm nohup.pid

References

  • Okazaki, Naoaki, and Jun'ichi Tsujii. "Simple and efficient algorithm for approximate dictionary matching." COLING 2010.
  • Luca Soldaini and Nazli Goharian. "QuickUMLS: a fast, unsupervised approach for medical concept extraction." MedIR Workshop, SIGIR 2016.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].