All Projects → yongyehuang → Zhihu Text Classification

yongyehuang / Zhihu Text Classification

[2017知乎看山杯 多标签 文本分类] ye组(第六名) 解题方案

Projects that are alternatives of or similar to Zhihu Text Classification

sarcasm-detection-for-sentiment-analysis
Sarcasm Detection for Sentiment Analysis
Stars: ✭ 21 (-94.64%)
Mutual labels:  text-classification, lstm
Lstm Human Activity Recognition
Human Activity Recognition example using TensorFlow on smartphone sensors dataset and an LSTM RNN. Classifying the type of movement amongst six activity categories - Guillaume Chevalier
Stars: ✭ 2,943 (+650.77%)
Mutual labels:  jupyter-notebook, lstm
Reuters-21578-Classification
Text classification with Reuters-21578 datasets using Gensim Word2Vec and Keras LSTM
Stars: ✭ 44 (-88.78%)
Mutual labels:  text-classification, lstm
Text Classification
Machine Learning and NLP: Text Classification using python, scikit-learn and NLTK
Stars: ✭ 239 (-39.03%)
Mutual labels:  jupyter-notebook, text-classification
Thesemicolon
This repository contains Ipython notebooks and datasets for the data analytics youtube tutorials on The Semicolon.
Stars: ✭ 345 (-11.99%)
Mutual labels:  jupyter-notebook, lstm
Pytorch Sentiment Analysis
Tutorials on getting started with PyTorch and TorchText for sentiment analysis.
Stars: ✭ 3,209 (+718.62%)
Mutual labels:  jupyter-notebook, lstm
medical-diagnosis-cnn-rnn-rcnn
分别使用rnn/cnn/rcnn来实现根据患者描述,进行疾病诊断
Stars: ✭ 39 (-90.05%)
Mutual labels:  text-classification, lstm
Text Classification
Text Classification through CNN, RNN & HAN using Keras
Stars: ✭ 216 (-44.9%)
Mutual labels:  jupyter-notebook, text-classification
Image Captioning
Image Captioning using InceptionV3 and beam search
Stars: ✭ 290 (-26.02%)
Mutual labels:  jupyter-notebook, lstm
Cryptocurrency Price Prediction
Cryptocurrency Price Prediction Using LSTM neural network
Stars: ✭ 271 (-30.87%)
Mutual labels:  jupyter-notebook, lstm
Pytorch Transformers Classification
Based on the Pytorch-Transformers library by HuggingFace. To be used as a starting point for employing Transformer models in text classification tasks. Contains code to easily train BERT, XLNet, RoBERTa, and XLM models for text classification.
Stars: ✭ 229 (-41.58%)
Mutual labels:  jupyter-notebook, text-classification
Easy Deep Learning With Keras
Keras tutorial for beginners (using TF backend)
Stars: ✭ 367 (-6.38%)
Mutual labels:  jupyter-notebook, lstm
Natural Language Processing With Tensorflow
Natural Language Processing with TensorFlow, published by Packt
Stars: ✭ 222 (-43.37%)
Mutual labels:  jupyter-notebook, lstm
Pytorch Seq2seq
Tutorials on implementing a few sequence-to-sequence (seq2seq) models with PyTorch and TorchText.
Stars: ✭ 3,418 (+771.94%)
Mutual labels:  jupyter-notebook, lstm
Interpret Text
A library that incorporates state-of-the-art explainers for text-based machine learning models and visualizes the result with a built-in dashboard.
Stars: ✭ 220 (-43.88%)
Mutual labels:  jupyter-notebook, text-classification
automatic-personality-prediction
[AAAI 2020] Modeling Personality with Attentive Networks and Contextual Embeddings
Stars: ✭ 43 (-89.03%)
Mutual labels:  text-classification, lstm
Screenshot To Code
A neural network that transforms a design mock-up into a static website.
Stars: ✭ 13,561 (+3359.44%)
Mutual labels:  jupyter-notebook, lstm
Graph convolutional lstm
Traffic Graph Convolutional Recurrent Neural Network
Stars: ✭ 210 (-46.43%)
Mutual labels:  jupyter-notebook, lstm
Deeplearning.ai Assignments
Stars: ✭ 268 (-31.63%)
Mutual labels:  jupyter-notebook, lstm
Stock Prediction Models
Gathers machine learning and deep learning models for Stock forecasting including trading bots and simulations
Stars: ✭ 4,660 (+1088.78%)
Mutual labels:  jupyter-notebook, lstm

2017 知乎看山杯 多标签文本分类

比赛总结: 2017知乎看山杯总结(多标签文本分类)

1.运行环境

下面是我实验中的一些环境依赖,版本只提供参考。

环境/库 版本
Ubuntu 14.04.5 LTS
python 2.7.12
jupyter notebook 4.2.3
tensorflow-gpu 1.2.1
numpy 1.12.1
pandas 0.19.2
matplotlib 2.0.0
word2vec 0.9.1
tqdm 4.11.2

2.文件结构

|- zhihu-text-classification
|  |- raw_data         # 比赛提供的原始数据
|  |- data           # 预处理得到的数据
|  |- data_process       # 数据预处理代码
|  |- models          # 模型代码
|  |  |- wd-1-1-cnn-concat    
|  |  |  |- network.py      # 定义网络结构
|  |  |  |- train.py       # 模型训练
|  |  |  |- predict.py      # 验证集/测试集预测,生成概率矩阵
...
|  |- ckpt           # 保存训练好的模型
|  |- summary          # tensorboard数据
|  |- scores           # 测试集的预测概率矩阵
|  |- local_scores        # 验证集的预测概率矩阵
|  |- doc           # 文档说明与相关论文
|  |- notebook-old       # 比赛中未经过整理的代码
|  |- local_ensemble.ipynb   # 验证集模型融合
|  |- ensemble.py        # 测试集模型融合
|  |- data_helpers.py      # 数据处理函数
|  |- evaluator.py        # 评价函数

3.数据预处理

  • 把比赛提供的所有数据解压到 raw_data/ 目录下。
  • 按照顺序依次执行各个 .py,不带任何参数。
    或者在当前目录下输入下面命令运行所有文件:
    dos2unix run_all_data_process.sh # 使用cygwin工具dos2unix将script改为unix格式
    sh run_all_data_process.sh

3.1 embed2ndarray.py

赛方提供了txt格式的词向量和字向量,这里把embedding矩阵转成 np.ndarray 形式,分别保存为 data/word_embedding.npy 和 data/char_embedding.npy。 用 pd.Series 保存词(字)对应 embedding 中的行号(id),存储在 data/sr_word2id.pkl 和 data/sr_char2id.pkl 中。

3.2 question_and_topic_2id.py

把问题和话题转为id形式,保存在 data/sr_question2id.pkl 和 data/sr_id2question.pkl 中。

3.3 char2id.py

利用上面得到的 sr_char2id,把所有问题的字转为对应的id, 存储为
data/ch_train_title.npy
data/ch_train_content.npy
data/ch_eval_title.npy
data/ch_eval_content.npy

3.4 word2id.py

同 char2id.py

3.5 creat_batch_data.py

把所有的数据按照 batch_size(128) 进行打包,固定seed,随机取 10 万样本作为验证集。每个batch存储为一个 npz 文件,包括 X, y 两部分。 这里所有的序列都进行了截断,长度不足的用0进行padding到固定长度。
保存位置:
wd_train_path = '../data/wd-data/data_train/'
wd_valid_path = '../data/wd-data/data_valid/'
wd_test_path = '../data/wd-data/data_test/'
ch_train_path = '../data/ch-data/data_train/'
ch_valid_path = '../data/ch-data/data_valid/'
ch_test_path = '../data/ch-data/data_test/'

3.6 creat_batch_seg.py

和 creat_batch_data.py 相同,只是对 content 部分进行句子划分。用于分层模型。 划分句子长度:
wd_title_len = 30, wd_sent_len = 30, wd_doc_len = 10.(即content划分为10个句子,每个句子长度为30个词)
ch_title_len = 52, ch_sent_len = 52, ch_doc_len = 10.
不划分句子:
wd_title_len = 30, wd_content_len = 150.
ch_title_len = 52, ch_content_len = 300.

4.模型训练

切换到模型所在位置,然后进行训练和预测。比如:

cd zhihu-text-classification/models/wd-1-1-cnn-concat/
# 训练
python train.py [--max_epoch 1 --max_max_epoch 6 --lr 1e-3 decay_rate 0.65 decay_step 15000 last_f1 0.4]
# 预测
python predict.py

这里只整理了部分模型,所有模型都用的词向量。如果想要使用字向量,只需要把模型中的输入和序列长度修改即可。

5.模型融合

线性加权融合,模拟梯度下降的策略进行权值搜索。见:local_ensemble.ipynb 注意:

  • 此方法可能会对验证集过拟合,所以需要通过测试集进一步判断。在模型个数比较多时使用此方法效果更好。
  • 需要根据各个单模型的性能认为进行初始化。char 和 word 类型不能直接比较,char 的单模型的性能虽然较差,但是对融合提升非常明显。
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].