All Projects → RandolphVI → Text Pairs Relation Classification

RandolphVI / Text Pairs Relation Classification

Licence: apache-2.0
About Text Pairs (Sentence Level) Classification (Similarity Modeling) Based on Neural Network.

Programming Languages

python
139335 projects - #7 most used programming language
python3
1442 projects

Projects that are alternatives of or similar to Text Pairs Relation Classification

sarcasm-detection-for-sentiment-analysis
Sarcasm Detection for Sentiment Analysis
Stars: ✭ 21 (-88.46%)
Mutual labels:  text-classification, word2vec
Nlp Projects
word2vec, sentence2vec, machine reading comprehension, dialog system, text classification, pretrained language model (i.e., XLNet, BERT, ELMo, GPT), sequence labeling, information retrieval, information extraction (i.e., entity, relation and event extraction), knowledge graph, text generation, network embedding
Stars: ✭ 360 (+97.8%)
Mutual labels:  text-classification, word2vec
text-classification-cn
中文文本分类实践,基于搜狗新闻语料库,采用传统机器学习方法以及预训练模型等方法
Stars: ✭ 81 (-55.49%)
Mutual labels:  text-classification, word2vec
Product-Categorization-NLP
Multi-Class Text Classification for products based on their description with Machine Learning algorithms and Neural Networks (MLP, CNN, Distilbert).
Stars: ✭ 30 (-83.52%)
Mutual labels:  text-classification, word2vec
Few Shot Text Classification
Few-shot binary text classification with Induction Networks and Word2Vec weights initialization
Stars: ✭ 32 (-82.42%)
Mutual labels:  text-classification, word2vec
Vaaku2Vec
Language Modeling and Text Classification in Malayalam Language using ULMFiT
Stars: ✭ 68 (-62.64%)
Mutual labels:  text-classification, word2vec
Text Cnn
嵌入Word2vec词向量的CNN中文文本分类
Stars: ✭ 298 (+63.74%)
Mutual labels:  text-classification, word2vec
Shallowlearn
An experiment about re-implementing supervised learning models based on shallow neural network approaches (e.g. fastText) with some additional exclusive features and nice API. Written in Python and fully compatible with Scikit-learn.
Stars: ✭ 196 (+7.69%)
Mutual labels:  text-classification, word2vec
Nlp In Practice
Starter code to solve real world text data problems. Includes: Gensim Word2Vec, phrase embeddings, Text Classification with Logistic Regression, word count with pyspark, simple text preprocessing, pre-trained embeddings and more.
Stars: ✭ 790 (+334.07%)
Mutual labels:  text-classification, word2vec
Lightnlp
基于Pytorch和torchtext的自然语言处理深度学习框架。
Stars: ✭ 739 (+306.04%)
Mutual labels:  text-classification, word2vec
Nlp chinese corpus
大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
Stars: ✭ 6,656 (+3557.14%)
Mutual labels:  text-classification, word2vec
Ml Projects
ML based projects such as Spam Classification, Time Series Analysis, Text Classification using Random Forest, Deep Learning, Bayesian, Xgboost in Python
Stars: ✭ 127 (-30.22%)
Mutual labels:  text-classification, word2vec
Text rnn attention
嵌入Word2vec词向量的RNN+ATTENTION中文文本分类
Stars: ✭ 117 (-35.71%)
Mutual labels:  text-classification, word2vec
Fasttext.js
FastText for Node.js
Stars: ✭ 127 (-30.22%)
Mutual labels:  text-classification, word2vec
Awesome Text Classification
Awesome-Text-Classification Projects,Papers,Tutorial .
Stars: ✭ 158 (-13.19%)
Mutual labels:  text-classification
Spark Nlp
State of the Art Natural Language Processing
Stars: ✭ 2,518 (+1283.52%)
Mutual labels:  text-classification
Transformers for text classification
基于Transformers的文本分类
Stars: ✭ 158 (-13.19%)
Mutual labels:  text-classification
Vdcnn
Implementation of Very Deep Convolutional Neural Network for Text Classification
Stars: ✭ 158 (-13.19%)
Mutual labels:  text-classification
Kashgari
Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.
Stars: ✭ 2,235 (+1128.02%)
Mutual labels:  text-classification
Wordvectors
Pre-trained word vectors of 30+ languages
Stars: ✭ 2,043 (+1022.53%)
Mutual labels:  word2vec

Deep Learning for Text Pairs Relation Classification

Python Version Build Status Codacy Badge License Issues

This repository is my bachelor graduation project, and it is also a study of TensorFlow, Deep Learning (CNN, RNN, etc.).

The main objective of the project is to determine whether the two sentences are similar in sentence meaning (binary classification problems) by the two given sentences based on Neural Networks (Fasttext, CNN, LSTM, etc.).

Requirements

  • Python 3.6
  • Tensorflow 1.15.0
  • Tensorboard 1.15.0
  • Sklearn 0.19.1
  • Numpy 1.16.2
  • Gensim 3.8.3
  • Tqdm 4.49.0

Project

The project structure is below:

.
├── Model
│   ├── test_model.py
│   ├── text_model.py
│   └── train_model.py
├── data
│   ├── word2vec_100.model.* [Need Download]
│   ├── Test_sample.json
│   ├── Train_sample.json
│   └── Validation_sample.json
└── utils
│   ├── checkmate.py
│   ├── data_helpers.py
│   └── param_parser.py
├── LICENSE
├── README.md
└── requirements.txt

Innovation

Data part

  1. Make the data support Chinese and English (Can use jieba or nltk ).
  2. Can use your pre-trained word vectors (Can use gensim).
  3. Add embedding visualization based on the tensorboard (Need to create metadata.tsv first).

Model part

  1. Add the correct L2 loss calculation operation.
  2. Add gradients clip operation to prevent gradient explosion.
  3. Add learning rate decay with exponential decay.
  4. Add a new Highway Layer (Which is useful according to the model performance).
  5. Add Batch Normalization Layer.
  6. Add several performance measures (especially the AUC) since the data is imbalanced.

Code part

  1. Can choose to train the model directly or restore the model from the checkpoint in train.py.
  2. Can create the prediction file which including the predicted values and predicted labels of the Testset data in test.py.
  3. Add other useful data preprocess functions in data_helpers.py.
  4. Use logging for helping to record the whole info (including parameters display, model training info, etc.).
  5. Provide the ability to save the best n checkpoints in checkmate.py, whereas the tf.train.Saver can only save the last n checkpoints.

Data

See data format in /data folder which including the data sample files. For example:

{"front_testid": "4270954", "behind_testid": "7075962", "front_features": ["invention", "inorganic", "fiber", "based", "calcium", "sulfate", "dihydrate", "calcium"], "behind_features": ["vcsel", "structure", "thermal", "management", "structure", "designed"], "label": 0}
  • "testid": just the id.
  • "features": the word segment (after removing the stopwords)
  • "label": 0 or 1. 1 means that two sentences are similar, and 0 means the opposite.

Text Segment

  1. You can use nltk package if you are going to deal with the English text data.

  2. You can use jieba package if you are going to deal with the Chinese text data.

Data Format

This repository can be used in other datasets (text pairs similarity classification) in two ways:

  1. Modify your datasets into the same format of the sample.
  2. Modify the data preprocessing code in data_helpers.py.

Anyway, it should depend on what your data and task are.

Pre-trained Word Vectors

You can download the Word2vec model file (dim=100). Make sure they are unzipped and under the /data folder.

You can pre-training your word vectors (based on your corpus) in many ways:

  • Use gensim package to pre-train data.
  • Use glove tools to pre-train data.
  • Even can use a fasttext network to pre-train data.

🤔Before you open the new issue, please check the data sample file under the data folder and read the other open issues first, because someone maybe ask the same question already.

Usage

See Usage.

Network Structure

FastText

References:


TextANN

References:

  • Personal ideas 🙃

TextCNN

References:


TextRNN

Warning: Model can use but not finished yet 🤪!

TODO

  1. Add BN-LSTM cell unit.
  2. Add attention.

References:


TextCRNN

References:

  • Personal ideas 🙃

TextRCNN

References:

  • Personal ideas 🙃

TextHAN

References:


TextSANN

Warning: Model can use but not finished yet 🤪!

TODO

  1. Add attention penalization loss.
  2. Add visualization.

References:


TextABCNN

Warning: Only achieve the ABCNN-1 Model🤪!

TODO

  1. Add ABCNN-3 model.

References:


About Me

黄威,Randolph

SCU SE Bachelor; USTC CS Ph.D.

Email: [email protected]

My Blog: randolph.pro

LinkedIn: randolph's linkedin

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].