All Projects → wpm → bisemantic

wpm / bisemantic

Licence: MIT license
Text pair classification

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to bisemantic

Spacyr
R wrapper to spaCy NLP
Stars: ✭ 202 (+1583.33%)
Mutual labels:  spacy
spacy-sentence-bert
Sentence transformers models for SpaCy
Stars: ✭ 88 (+633.33%)
Mutual labels:  spacy
spacy-dbpedia-spotlight
A spaCy wrapper for DBpedia Spotlight
Stars: ✭ 85 (+608.33%)
Mutual labels:  spacy
Spacy Lookup
Named Entity Recognition based on dictionaries
Stars: ✭ 212 (+1666.67%)
Mutual labels:  spacy
Spacy Services
💫 REST microservices for various spaCy-related tasks
Stars: ✭ 230 (+1816.67%)
Mutual labels:  spacy
spaczz
Fuzzy matching and more functionality for spaCy.
Stars: ✭ 215 (+1691.67%)
Mutual labels:  spacy
Thinc
🔮 A refreshing functional take on deep learning, compatible with your favorite libraries
Stars: ✭ 2,422 (+20083.33%)
Mutual labels:  spacy
spacy-langdetect
A fully customisable language detection pipeline for spaCy
Stars: ✭ 86 (+616.67%)
Mutual labels:  spacy
Holmes Extractor
Information extraction from English and German texts based on predicate logic
Stars: ✭ 233 (+1841.67%)
Mutual labels:  spacy
Quora-Question-Pairs
The code for our submission in Kaggle's competition Quora Question Pairs which ranked in the top 25%.
Stars: ✭ 30 (+150%)
Mutual labels:  quora-question-pairs
Spacy Api Docker
spaCy REST API, wrapped in a Docker container.
Stars: ✭ 222 (+1750%)
Mutual labels:  spacy
Prodigy Recipes
🍳 Recipes for the Prodigy, our fully scriptable annotation tool
Stars: ✭ 229 (+1808.33%)
Mutual labels:  spacy
prodigy-scratch
Prodigy thing(z)
Stars: ✭ 13 (+8.33%)
Mutual labels:  spacy
Summarizer
A Reddit bot that summarizes news articles written in Spanish or English. It uses a custom built algorithm to rank words and sentences.
Stars: ✭ 213 (+1675%)
Mutual labels:  spacy
replaCy
spaCy match and replace, maintaining conjugation
Stars: ✭ 29 (+141.67%)
Mutual labels:  spacy
Neuralcoref
✨Fast Coreference Resolution in spaCy with Neural Networks
Stars: ✭ 2,453 (+20341.67%)
Mutual labels:  spacy
DrFAQ
DrFAQ is a plug-and-play question answering NLP chatbot that can be generally applied to any organisation's text corpora.
Stars: ✭ 29 (+141.67%)
Mutual labels:  spacy
lemmy
🤘Lemmy is a lemmatizer for Danish 🇩🇰 and Swedish 🇸🇪
Stars: ✭ 68 (+466.67%)
Mutual labels:  spacy
inception-external-recommender
Get annotation suggestions for the INCEpTION text annotation platform from spaCy, Sentence BERT, scikit-learn and more. Runs as a web-service compatible with the external recommender API of INCEpTION.
Stars: ✭ 36 (+200%)
Mutual labels:  spacy
Semantic-Textual-Similarity
Natural Language Processing using NLTK and Spacy
Stars: ✭ 30 (+150%)
Mutual labels:  spacy

Bisemantic

Bisemantic identifies semantic relationships between pairs of text. It uses a shared LSTM to map two texts to a common representation format which is then aligned with training labels.

Installation

Bisemantic depends on the spaCy text natural language processing toolkit. It may be necessary to install spaCy's English language text model with a command like python -m spacy download en before running. See spaCy's models documentation for more information.

Running

Run Bisemantic with the command bisemantic. Subcommands enable you to train and use models and partition data into cross-validation sets. Run bisemantic --help for details about specific commands.

Input data takes the form of comma-separated-value documents. Training data has the columns text1, text2, and label. Test data takes the same form minus the label column. Command line options allow you to read in files with different formatting.

Trained models are written to a directory that contains the following files:

  • model.info.text: a human-readable description of the model and training parameters
  • training-history.json: history of the training procedure, including the loss and accuracy for each epoch
  • model.h5: serialization of the model structure and its weights

Weights from the epoch with the best loss score are saved in model.h5.

The model directory can be used to predict probability distributions over labels and score test sets. Further training can be done using an existing model directory as a starting point.

Classifier Model

Text pair classification is framed as a supervised learning problem. The sample is a pair of texts and the label is a categorical class label. The meaning of the class varies from data set to data set but usually represents some kind of semantic relationship between the two texts.

GloVe vectors are used to embed the texts into matrices of size maximum tokens × 300, clipping or padding the first dimension for each individual text as needed. If maximum tokens is not specified, the number of tokens in the longest text in the pairs is used. An (optionally bidirectional) shared LSTM converts these embeddings to single vectors, r1 and r2, which are then concatenated into the vector [r1, r2, r1 · r2, (r1 - r2)2]. A single-layer perceptron maps this vector to a softmax prediction over the labels.

Example Uses

Bisemantic can be used for tasks like question de-duplication or textual entailment.

Question Deduplication

The Quora question pair corpus contains pairs of questions annotated as either asking the same thing or not.

Bisemantic creates a model similar to that described in [Homma et al.] and [Addair].
The following command can be used to train a model on the train.csv file in this data set.

bisemantic train train.csv \
    --text-1-name question1 --text-2-name question2 \
    --label-name is_duplicate --index-name id \
    --validation-fraction 0.2 --batch-size 1024 \
    --maximum-tokens 75 --dropout 0.5 --units 256 --bidirectional \
    --model-directory-name quora.model

This achieved an accuracy of 83.71% on the validation split after 9 epochs of training.

Textual Entailment

The Stanford Natural Language Inference corpus is a corpus for the recognizing textual entailment (RTE) task. It labels a "premise" sentence as either entailing, contradicting, or being neutral with respect to a "hypothesis" sentence.

Bisemantic creates a model similar to that described in [Bowman et al., 2015]. The following command can be used to train a model on the train snli_1.0_train.txt and snli_1.0_dev.txt files in this data set.

bisemantic train snli_1.0_train.txt \
		--text-1-name sentence1 --text-2-name sentence2 \
		--label-name gold_label --index-name pairID \
		--invalid-labels "-" --not-comma-delimited \
		--validation-set snli_1.0_dev.txt --batch-size 1024 \
		--dropout 0.5 --units 256 --bidirectional \
		--model-directory-name snli.model

This achieved an accuracy of 80.16% on the development set and 79.49% on the snli_1.0_test.txt test set after 9 epochs of training.

References

  • Travis Addair. Duplicate Question Pair Detection with Deep Learning [pdf]

  • Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). [pdf]

  • Yushi Homma, Stuart Sy, Christopher Yeh. Detecting Duplicate Questions with Deep Learning. [pdf]

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].