All Projects → cjymz886 → Text_rnn_attention

cjymz886 / Text_rnn_attention

Licence: mit
嵌入Word2vec词向量的RNN+ATTENTION中文文本分类

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Text rnn attention

Text Pairs Relation Classification
About Text Pairs (Sentence Level) Classification (Similarity Modeling) Based on Neural Network.
Stars: ✭ 182 (+55.56%)
Mutual labels:  text-classification, word2vec
text-classification-cn
中文文本分类实践,基于搜狗新闻语料库,采用传统机器学习方法以及预训练模型等方法
Stars: ✭ 81 (-30.77%)
Mutual labels:  text-classification, word2vec
Shallowlearn
An experiment about re-implementing supervised learning models based on shallow neural network approaches (e.g. fastText) with some additional exclusive features and nice API. Written in Python and fully compatible with Scikit-learn.
Stars: ✭ 196 (+67.52%)
Mutual labels:  text-classification, word2vec
Vaaku2Vec
Language Modeling and Text Classification in Malayalam Language using ULMFiT
Stars: ✭ 68 (-41.88%)
Mutual labels:  text-classification, word2vec
Nlp chinese corpus
大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
Stars: ✭ 6,656 (+5588.89%)
Mutual labels:  text-classification, word2vec
Fasttext.js
FastText for Node.js
Stars: ✭ 127 (+8.55%)
Mutual labels:  text-classification, word2vec
sarcasm-detection-for-sentiment-analysis
Sarcasm Detection for Sentiment Analysis
Stars: ✭ 21 (-82.05%)
Mutual labels:  text-classification, word2vec
Ml Projects
ML based projects such as Spam Classification, Time Series Analysis, Text Classification using Random Forest, Deep Learning, Bayesian, Xgboost in Python
Stars: ✭ 127 (+8.55%)
Mutual labels:  text-classification, word2vec
Nlp Projects
word2vec, sentence2vec, machine reading comprehension, dialog system, text classification, pretrained language model (i.e., XLNet, BERT, ELMo, GPT), sequence labeling, information retrieval, information extraction (i.e., entity, relation and event extraction), knowledge graph, text generation, network embedding
Stars: ✭ 360 (+207.69%)
Mutual labels:  text-classification, word2vec
Text Cnn
嵌入Word2vec词向量的CNN中文文本分类
Stars: ✭ 298 (+154.7%)
Mutual labels:  text-classification, word2vec
Product-Categorization-NLP
Multi-Class Text Classification for products based on their description with Machine Learning algorithms and Neural Networks (MLP, CNN, Distilbert).
Stars: ✭ 30 (-74.36%)
Mutual labels:  text-classification, word2vec
Nlp In Practice
Starter code to solve real world text data problems. Includes: Gensim Word2Vec, phrase embeddings, Text Classification with Logistic Regression, word count with pyspark, simple text preprocessing, pre-trained embeddings and more.
Stars: ✭ 790 (+575.21%)
Mutual labels:  text-classification, word2vec
Lightnlp
基于Pytorch和torchtext的自然语言处理深度学习框架。
Stars: ✭ 739 (+531.62%)
Mutual labels:  text-classification, word2vec
Few Shot Text Classification
Few-shot binary text classification with Induction Networks and Word2Vec weights initialization
Stars: ✭ 32 (-72.65%)
Mutual labels:  text-classification, word2vec
Tia
Your Advanced Twitter stalking tool
Stars: ✭ 98 (-16.24%)
Mutual labels:  text-classification
Textclf
TextClf :基于Pytorch/Sklearn的文本分类框架,包括逻辑回归、SVM、TextCNN、TextRNN、TextRCNN、DRNN、DPCNN、Bert等多种模型,通过简单配置即可完成数据处理、模型训练、测试等过程。
Stars: ✭ 105 (-10.26%)
Mutual labels:  word2vec
Text Summarizer
Python Framework for Extractive Text Summarization
Stars: ✭ 96 (-17.95%)
Mutual labels:  word2vec
Postgres Word2vec
utils to use word embedding like word2vec vectors in a postgres database
Stars: ✭ 96 (-17.95%)
Mutual labels:  word2vec
Pytorch Rnn Text Classification
Word Embedding + LSTM + FC
Stars: ✭ 112 (-4.27%)
Mutual labels:  text-classification
Delta
DELTA is a deep learning based natural language and speech processing platform.
Stars: ✭ 1,479 (+1164.1%)
Mutual labels:  text-classification

Text classification with CNN and Word2vec

本文是继自己上的blog“text-cnn”后,基于同样的数据集,嵌入词级别所做的RNN+ATTENTION模型所做的文本分类实验结果;

本实验的主要目是为了探究在同样的数据情况,CNN模型与RNN+attention模型训练的效果对比,训练结果显示在验证集上CNN为96.5%,RNN+attention为96.8%;

有兴趣可以阅读我的:text-cnn

1 环境

python3
tensorflow 1.3以上CPU环境下
gensim
jieba
scipy
numpy
scikit-learn

2 RNN循环神经网络+attention机制

模型RNN+ATTENTION配置的参数在text_model.py中,具体为:

image

模型RNN+ATTENTION大致结构为:

image

3 数据集

本实验同样是使用THUCNews的一个子集进行训练与测试,数据集请自行到THUCTC:一个高效的中文文本分类工具包下载,请遵循数据提供方的开源协议;

文本类别涉及10个类别:categories = ['体育', '财经', '房产', '家居', '教育', '科技', '时尚', '时政', '游戏', '娱乐'],每个分类6500条数据;

cnews.train.txt: 训练集(5000*10)

cnews.val.txt: 验证集(500*10)

cnews.test.txt: 测试集(1000*10)

训练所用的数据,以及训练好的词向量可以下载:链接: https://pan.baidu.com/s/1cBZZE6UTsNb5utkg4k6TOQ,密码: 5y1a

4 预处理

本实验主要对训练文本进行分词处理,一来要分词训练词向量,二来输入模型的以词向量的形式;

另外,除掉文本的标点符号,也使用./data/stopwords.txt文件进行停用词过滤;

处理的程序都放在loader.py文件中;

5 运行步骤

python train_word2vec.py,对训练数据进行分词,利用Word2vec训练词向量(vector_word.txt)

python text_train.py,进行训练模型

python text_test.py,对模型进行测试

python text_predict.py,提供模型的预测

6 训练结果

运行:python text_train.py

本实验经过2轮的迭代,满足终止条件结束,在global_step=1500时在验证集得到最佳效果96.8%

image

7 测试结果

运行:python text_test.py

对测试数据集显示,test_loss=0.14,test_accuracy=95.8%,其中“体育”类测试为100%,整体的precision=recall=F1=96%;
而CNN模型的测试结果为:test_loss=0.13,test_accuracy=96.7%,precision=recall=F1=97%

image

8 预测结果

运行:python text_predict.py

随机从测试数据中挑选了五个样本,输出原文本和它的原文本标签和预测的标签,下图中5个样本预测的都是对的;

image

9 对比结论

在与cnn模型对比中发现,训练中在验证集上准确率96.8%是略优于cnn的,但是在测试集上,并没有cnn模型表现的好;我推测的其中原因是,CNN处理文本的长度为600,而RNN+ATTION处理的文本长度为200,而后者也不能处理太长的文本,文本越长,包含的特征信息越多,所以从整体上来看,我个人觉得CNN模型更适合长文本的分类任务。

10 参考

  1. Convolutional Neural Networks for Sentence Classification
  2. gaussic/text-classification-cnn-rnn
  3. YCG09/tf-text-classification

image

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].