All Projects → ncbi-nlp → PhenoTagger

ncbi-nlp / PhenoTagger

Licence: MIT License
PhenoTagger

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to PhenoTagger

huner
Named Entity Recognition for biomedical entities
Stars: ✭ 44 (+109.52%)
Mutual labels:  bionlp
civicmine
Text mining cancer biomarkers for the CIVIC database
Stars: ✭ 19 (-9.52%)
Mutual labels:  bionlp
HPO-translations
Internationalisation of the HPO content
Stars: ✭ 19 (-9.52%)
Mutual labels:  hpo
VERSE
Vancouver Event and Relation System for Extraction
Stars: ✭ 13 (-38.1%)
Mutual labels:  bionlp
cometa
Corpus of Online Medical EnTities: the cometA corpus
Stars: ✭ 31 (+47.62%)
Mutual labels:  bionlp
loinc2hpo
Java library to map LOINC-encoded test results to Human Phenotype Ontology
Stars: ✭ 19 (-9.52%)
Mutual labels:  hpo
nalaf
NLP framework in python for entity recognition and relationship extraction
Stars: ✭ 104 (+395.24%)
Mutual labels:  bionlp
loinc2hpoAnnotation
loinc2hpo Annotation Data
Stars: ✭ 18 (-14.29%)
Mutual labels:  hpo
AutoTabular
Automatic machine learning for tabular data. ⚡🔥⚡
Stars: ✭ 51 (+142.86%)
Mutual labels:  hpo
CREST
A Causal Relation Schema for Text
Stars: ✭ 19 (-9.52%)
Mutual labels:  bionlp

PhenoTagger


This repo contains the source code and dataset for the PhenoTagger.

PhenoTagger is a hybrid method that combines dictionary and deep learning-based methods to recognize Human Phenotype Ontology (HPO) concepts in unstructured biomedical text. It is an ontology-driven method without requiring any manually labeled training data, as that is expensive and annotating a large-scale training dataset covering all classes of HPO concepts is highly challenging and unrealistic. Please refer to our paper for more details:

Content

Dependency package

PhenoTagger have been tested using Python3.7 on CentOS and uses the following dependencies on a CPU and GPU:

TF2:

or TF1:

To install all dependencies automatically using the command:

$ pip install -r requirements.txt

Data and model preparation

  1. To run this code, you need to first download the model file ( it includes some trained models, i.e., BioBERT-Base v1.1, pre-trained word embedding, two trained models for HPO concept recognition), then unzip and put the model folder into the Phenotagger folder.
  2. The corpora used in the experiments are provided in /data/corpus.zip. Please unzip the file, if you need to use them.

Tagging free text with PhenoTagger

You can use our trained PhenoTagger to identify the HPO concepts from biomedical texts by the PhenoTagger_tagging.py file.

The file requires 2 parameters:

  • --input, -i, help="the folder with input files"
  • --output, -o, help="output folder to save the tagged results"

The file format can be in BioC(xml) or PubTator(tab-delimited text file) (click here to see our format descriptions). There are some examples in the /example/ folder.

Example:

$ python PhenoTagger_tagging.py -i ../example/input/ -o ../example/output/

We also provide some optional parameters for the different requirements of users in the PhenoTagger_tagging.py file.

para_set={
'model_type':'biobert',   # two deep learning models are provided. cnn or biobert
'onlyLongest':False,  # False: return overlapping concepts; True: only return the longgest concepts in the overlapping concepts
'abbrRecog':False,    # False: don't identify abbreviation; True: identify abbreviations
'ML_Threshold':0.95,  # the Threshold of deep learning model
  }

Training PhenoTagger with a new ontology

1. Build the ontology dictionary using the Build_dict.py file

The file requires 3 parameters:

  • --input, -i, help="input the ontology .obo file"
  • --output, -o, help="the output folder of dictionary"
  • --rootnode, -r, help="input the root node of the ontogyly"

Example:

$ python Build_dict.py -i ../ontology/hp.obo -o ../dict/ -r HP:0000118

After the program is finished, 5 files will be generated in the output folder.

  • id_word_map.json
  • lable.vocab
  • noabb_lemma.dic
  • obo.json
  • word_id_map.json

2. Build the distantly-supervised training dataset using the Build_distant_corpus.py file

The file requires 4 parameters:

  • --dict, -d, help="the input folder of the ontology dictionary"
  • --fileneg, -f, help="the text file used to generate the negatives" (You can use our negative text "mutation_disease.txt" )
  • --negnum, -n, help="the number of negatives, we suggest that the number is the same with the positives."
  • --output, -o, help="the output folder of the distantly-supervised training dataset"

Example:

$ python Build_distant_corpus.py -d ../dict/ -f ../data/mutation_disease.txt -n 10000 -o ../data/distant_train_data/

After the program is finished, 3 files will be generated in the outpath:

  • distant_train.conll (distantly-supervised training data)
  • distant_train_pos.conll (distantly-supervised training positives)
  • distant_train_neg.conll (distantly-supervised training negatives)

3. Train PhenoTagger using the PhenoTagger_training.py file

The file requires 4 parameters:

  • --trainfile, -t, help="the training file"
  • --devfile, -d, help="the development set file. If don't provide the dev file, the training will be stopped by the specified EPOCH"
  • --modeltype, -m, help="the deep learning model type (cnn or biobert?)"
  • --output, -o, help="the output folder of the model"

Example:

$ python PhenoTagger_training.py -t ../data/distant_train_data/distant_train.conll -d ../data/corpus/GSC/GSCplus_dev_gold.tsv -m biobert -o ../models/

After the program is finished, 2 files will be generated in the output folder:

  • cnn.h5/biobert.h5 (the trained model)
  • cnn_dev_temp.tsv/biobert_dev_temp.tsv (the prediction results of the development set, if you input a development set file)

Web API

We also provide Web API for PhenoTagger for ease of use. Due to the limitation of computing resources, the API is run on a CPU. If you have GPUs, we suggest you download the source code and run PhenoTagger on own server.

You can use it to process raw text in the same way as Pubtotar API. You need to set [Bioconcept] parameter to "Phenotype". The code samples in python are found in API_pythonExample folder. We suggest the user use PubTator or BioC-XML formats.

The process consists of two primary steps 1) submitting requests and 2) retrieving results.

1. Submitting requests

$ python SubmitText_request.py [Inputfolder] [Bioconcept:Phenotype] [Outputfile_SessionNumber]

Three parameters are required:

  • [Inputfolder]: a folder with files to submit
  • [Bioconcept]: Phenotype
  • [Outputfile_SessionNumber]: output file to save the session numbers

Example:

$ python SubmitText_request.py input Phenotype SessionNumber.txt

2. Retrieving results

$ python SubmitText_retrieve.py [Inputfolder] [Inputfile_SessionNumber] [outputfolder]

Three parameters are required:

  • [Inputfolder]: original input folder
  • [Inputfile_SessionNumber]: a file with a list of session numbers
  • [Outputfolder]: Output folder

Example:

$ python SubmitText_retrieve.py input SessionNumber.txt output

Note that each file in the input folder will be submitted for processing separately. After submission, each file may be queued for 10 to 20 minutes, depending on the computer cluster workload.

Performance on HPO GSC+

The following Table shows the results of PhenoTagger with the CNN and BioBERT models on the GSC+ test set. And the training/test time on one NVIDIA Tesla V100 GPU is provided. You can choose the appropriate model according to your needs.

Method Training/Test time Men-P Men-R Men-F1 Doc-P Doc-R Doc-F1
PhenoTagger (CNN) 2h56m/106s 0.772 0.706 0.738 0.735 0.706 0.720
PhenoTagger (BioBERT) 15h42m/152s 0.789 0.722 0.754 0.774 0.740 0.757

Here, h, m, s denotes hour, minute and second, respectively.

Citing PhenoTagger

If you're using PhenoTagger, please cite:

Acknowledgments

This research is supported by the Intramural Research Programs of the National Institutes of Health, National Library of Medicine. Thanks to Dr. Chih-Hsuan Wei for his help with Web APIs.

Disclaimer

This tool shows the results of research conducted in the Computational Biology Branch, NCBI. The information produced on this website is not intended for direct diagnostic use or medical decision-making without review and oversight by a clinical professional. Individuals should not change their health behavior solely on the basis of information produced on this website. NIH does not independently verify the validity or utility of the information produced by this tool. If you have questions about the information produced on this website, please see a health care professional. More information about NCBI's disclaimer policy is available.


Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].