All Projects → cjymz886 → Text Cnn

cjymz886 / Text Cnn

Licence: mit
嵌入Word2vec词向量的CNN中文文本分类

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Text Cnn

Cnn intent classification
CNN for intent classification task in a Chatbot
Stars: ✭ 90 (-69.8%)
Mutual labels:  cnn, text-classification
Cnn Text Classification Keras
Text Classification by Convolutional Neural Network in Keras
Stars: ✭ 213 (-28.52%)
Mutual labels:  cnn, text-classification
Classifier multi label textcnn
multi-label,classifier,text classification,多标签文本分类,文本分类,BERT,ALBERT,multi-label-classification
Stars: ✭ 116 (-61.07%)
Mutual labels:  cnn, text-classification
Deeplearning Nlp Models
A small, interpretable codebase containing the re-implementation of a few "deep" NLP models in PyTorch. Colab notebooks to run with GPUs. Models: word2vec, CNNs, transformer, gpt.
Stars: ✭ 64 (-78.52%)
Mutual labels:  cnn, word2vec
sarcasm-detection-for-sentiment-analysis
Sarcasm Detection for Sentiment Analysis
Stars: ✭ 21 (-92.95%)
Mutual labels:  text-classification, word2vec
Lstm Cnn classification
Stars: ✭ 64 (-78.52%)
Mutual labels:  cnn, text-classification
Tensorflow Tutorials
텐서플로우를 기초부터 응용까지 단계별로 연습할 수 있는 소스 코드를 제공합니다
Stars: ✭ 2,096 (+603.36%)
Mutual labels:  cnn, word2vec
Servenet
Service Classification based on Service Description
Stars: ✭ 21 (-92.95%)
Mutual labels:  cnn, word2vec
Vaaku2Vec
Language Modeling and Text Classification in Malayalam Language using ULMFiT
Stars: ✭ 68 (-77.18%)
Mutual labels:  text-classification, word2vec
Hierarchical Attention Networks Pytorch
Hierarchical Attention Networks for document classification
Stars: ✭ 239 (-19.8%)
Mutual labels:  cnn, text-classification
Sentiment analysis albert
sentiment analysis、文本分类、ALBERT、TextCNN、classification、tensorflow、BERT、CNN、text classification
Stars: ✭ 61 (-79.53%)
Mutual labels:  cnn, text-classification
Product-Categorization-NLP
Multi-Class Text Classification for products based on their description with Machine Learning algorithms and Neural Networks (MLP, CNN, Distilbert).
Stars: ✭ 30 (-89.93%)
Mutual labels:  text-classification, word2vec
Neural Networks
All about Neural Networks!
Stars: ✭ 34 (-88.59%)
Mutual labels:  cnn, word2vec
Char Cnn Text Classification Tensorflow
Character-level Convolutional Networks for Text Classification论文仿真实现
Stars: ✭ 72 (-75.84%)
Mutual labels:  cnn, text-classification
Cnn Question Classification Keras
Chinese Question Classifier (Keras Implementation) on BQuLD
Stars: ✭ 28 (-90.6%)
Mutual labels:  cnn, text-classification
Text Classification Demos
Neural models for Text Classification in Tensorflow, such as cnn, dpcnn, fasttext, bert ...
Stars: ✭ 144 (-51.68%)
Mutual labels:  cnn, text-classification
Text Classification
Implementation of papers for text classification task on DBpedia
Stars: ✭ 682 (+128.86%)
Mutual labels:  cnn, text-classification
Eda nlp
Data augmentation for NLP, presented at EMNLP 2019
Stars: ✭ 902 (+202.68%)
Mutual labels:  cnn, text-classification
Cnn Text Classification Tf Chinese
CNN for Chinese Text Classification in Tensorflow
Stars: ✭ 237 (-20.47%)
Mutual labels:  cnn, text-classification
text-classification-cn
中文文本分类实践,基于搜狗新闻语料库,采用传统机器学习方法以及预训练模型等方法
Stars: ✭ 81 (-72.82%)
Mutual labels:  text-classification, word2vec

Text classification with CNN and Word2vec

本文是参考gaussic大牛的“text-classification-cnn-rnn”后,基于同样的数据集,嵌入词级别所做的CNN文本分类实验结果,gaussic大牛是基于字符级的;

进行了第二版的更新:1.加入不同的卷积核;2.加入正则化;3.词仅为中文或英文,删掉文本中数字、符号等类型的词;4.删除长度为1的词;

训练结果较第一版有所提升,验证集准确率从96.5%达到97.1%,测试准确率从96.7%达到97.2%。

本实验的主要目是为了探究基于Word2vec训练的词向量嵌入CNN后,对模型的影响,实验结果得到的模型在验证集达到97.1%的效果,gaussic大牛为94.12%;

更多详细可以阅读gaussic大牛的博客:text-classification-cnn-rnn

1 环境

python3
tensorflow 1.3以上CPU环境下
gensim
jieba
scipy
numpy
scikit-learn

2 CNN卷积神经网络

模型CNN配置的参数在text_model.py中,具体为:

image

模型CNN大致结构为:

image

3 数据集

本实验同样是使用THUCNews的一个子集进行训练与测试,数据集请自行到THUCTC:一个高效的中文文本分类工具包下载,请遵循数据提供方的开源协议;

文本类别涉及10个类别:categories = ['体育', '财经', '房产', '家居', '教育', '科技', '时尚', '时政', '游戏', '娱乐'],每个分类6500条数据;

cnews.train.txt: 训练集(5000*10)

cnews.val.txt: 验证集(500*10)

cnews.test.txt: 测试集(1000*10)

训练所用的数据,以及训练好的词向量可以下载:链接: https://pan.baidu.com/s/1DOgxlY42roBpOKAMKPPKWA,密码: up9d

4 预处理

本实验主要对训练文本进行分词处理,一来要分词训练词向量,二来输入模型的以词向量的形式;

另外,词仅为中文或英文,词的长度大于1;

处理的程序都放在loader.py文件中;

5 运行步骤

python train_word2vec.py,对训练数据进行分词,利用Word2vec训练词向量(vector_word.txt)

python text_train.py,进行训练模型

python text_test.py,对模型进行测试

python text_predict.py,提供模型的预测

6 训练结果

运行:python text_train.py

本实验经过6轮的迭代,满足终止条件结束,在global_step=2000时在验证集得到最佳效果97.1%

image

7 测试结果

运行:python text_test.py

对测试数据集显示,test_loss=0.1,test_accuracy=97.23%,其中“体育”类测试为100%,整体的precision=recall=F1=97%

image

8 预测结果

运行:python text_predict.py

随机从测试数据中挑选了五个样本,输出原文本和它的原文本标签和预测的标签,下图中5个样本预测的都是对的;

image

9 参考

  1. Convolutional Neural Networks for Sentence Classification
  2. gaussic/text-classification-cnn-rnn
  3. YCG09/tf-text-classification

image

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].