Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → zhanzecheng → Sohu_competition

zhanzecheng / Sohu_competition

Sohu's 2018 content recognition competition 1st solution(搜狐内容识别大赛第一名解决方案)

Labels

jupyter-notebook nlp competition stacking

Projects that are alternatives of or similar to Sohu competition

京东JData算法大赛-高潜用户购买意向预测入门程序(starter code)

Stars: ✭ 662 (+195.54%)

Mutual labels: competition, jupyter-notebook

Kaggle에서 진행하는 경진대회의 코드를 올려둔 공간입니다.

Stars: ✭ 29 (-87.05%)

Mutual labels: competition, jupyter-notebook

A Julia machine learning framework

Stars: ✭ 982 (+338.39%)

Mutual labels: jupyter-notebook, stacking

Sklearn pycon2014

Repository containing files for my PyCon 2014 scikit-learn tutorial.

Stars: ✭ 221 (-1.34%)

Mutual labels: jupyter-notebook

Melusine is a high-level library for emails classification and feature extraction "dédiée aux courriels français".

Stars: ✭ 222 (-0.89%)

Mutual labels: jupyter-notebook

Navigating the GAN Parameter Space for Semantic Image Editing

Stars: ✭ 221 (-1.34%)

Mutual labels: jupyter-notebook

Wi-Fi Geolocation Spoofing with the ESP8266

Stars: ✭ 223 (-0.45%)

Mutual labels: jupyter-notebook

Scikit Geometry

Scientific Python Geometric Algorithms Library

Stars: ✭ 220 (-1.79%)

Mutual labels: jupyter-notebook

Official mirror of https://gitlab.com/lfortran/lfortran. Please submit pull requests (PR) there. Any PR sent here will be closed automatically.

Stars: ✭ 220 (-1.79%)

Mutual labels: jupyter-notebook

Ipython Notebooks

A collection of IPython notebooks covering various topics.

Stars: ✭ 2,543 (+1035.27%)

Mutual labels: jupyter-notebook

Self hosted alternative to Google Photos

Stars: ✭ 2,587 (+1054.91%)

Mutual labels: jupyter-notebook

Natural Language Processing With Tensorflow

Natural Language Processing with TensorFlow, published by Packt

Stars: ✭ 222 (-0.89%)

Mutual labels: jupyter-notebook

Triplet Attention

Official PyTorch Implementation for "Rotate to Attend: Convolutional Triplet Attention Module." [WACV 2021]

Stars: ✭ 222 (-0.89%)

Mutual labels: jupyter-notebook

Team Learning Rs

主要存储Datawhale组队学习中“推荐系统”方向的资料。

Stars: ✭ 216 (-3.57%)

Mutual labels: jupyter-notebook

Convert human motion from video to .bvh

Stars: ✭ 222 (-0.89%)

Mutual labels: jupyter-notebook

Experimental algorithms. Unsupported.

Stars: ✭ 221 (-1.34%)

Mutual labels: jupyter-notebook

Ai Platform Samples

Official Repo for Google Cloud AI Platform

Stars: ✭ 222 (-0.89%)

Mutual labels: jupyter-notebook

Visual Question Answering Demo on pretrained model

Stars: ✭ 222 (-0.89%)

Mutual labels: jupyter-notebook

Seed, Expand, Constrain: Three Principles for Weakly-Supervised Image Segmentation

Stars: ✭ 221 (-1.34%)

Mutual labels: jupyter-notebook

Deep Vector Quantization

VQVAEs, GumbelSoftmaxes and friends

Stars: ✭ 222 (-0.89%)

Mutual labels: jupyter-notebook

View All Similar Projects ➔

简介

第二届搜狐内容识别大赛冠军LuckyRabbit团队的解决方案，关于参赛细节和详解，请参阅说明文档

代码流程

整个代码分为数据预处理、特征提取、单模型、stacking模型融合、trick部分

Input

输入数据是HTML格式的新闻和附带的图片

<title>惠尔新品 | 冷色系实木多层地板系列</title> <p>  </p> <br/><p>  <span style="font-size: 16px;">冷色系实木多层系列全新上市</span></p>	P0000001.JPEG;P0000002.JPEG;

Preprocessing

文本翻译数据增强：把中文翻译成英文、再把英文翻译回中文，代码中没有给出，可自行调API接口
图片数据增强：图片旋转、平移、加噪声、过采样
使用jieba分词为基本分词组件

Feature Extraction

gensim训练好的300维词向量百度云盘
TFIDF 特征 + SVD进行降维度
字向量特征
基础特征：是否含有手机号、微信号等
OCR文字提取提取图片上包含的文字来补充文本分类信息

单模型

这里拿一个经典的模型来做一个例子, 我们把ocr提取出来的文本和新闻文本分别输入到同一个embedding层，随后再连接做分类各种模型的得分如下

模型或方法	得分F1-measure
catboost	0.611
xgboost	0.621
lightgbm	0.625
dnn	0.621
textCNN	0.617
capsule	0.625
covlstm	0.630
dpcnn	0.626
lstm+gru	0.635
lstm+gru+attention	0.640
(ps 由于大赛评分系统关闭了，个别模型得分可能不太准确)

模型融合

Stacking

关于stacking这里有一篇很好的模型融合方法的介绍我们比赛中使用的stacking模型结构如下图所示

Snapshot Emsemble

在stacking第二层模型中我们还加入了深度融合的方法，论文地址

Pesudo Labeling

我们使用的另外一个trick就是pesudo-labeling 方法，它适用于所有给定测试集的比赛教程

方法效果如下

模型或方法	得分F1-measure
单一模型	0.642
stacking	0.647
stacking+trick	0.652

代码结构

|- SOHU_competition
|　　|- data 　　　　　　　　　
|　　|　　|-result　模型输出结果
|　　|　　|- ···　　　　　　　　　　
|　　|- ckpt　　　　　　　　　　　# 保存模型
|　　|- img　　　　　　　# 说明图片
|　　|- src　　　　　　　　　　# 模型代码
|　　|　　|- model　　　　 # 模型
|　　|　　|　　|- model_basic　　　　# 定义模型训练方法等　　
|　　|　　|　　|- attention_model　　　　# 模型定义　　
|　　|　　|　　|- ···　　　
|　　|　　|- preprocess　
|　　|　　|　　|- EDA&Extract.ipynb　　# 特征处理和提取流程　
|　　|　　|　　|- ···　　　
|　　|　　|- ocr　
|　　|　　|- train&predict.ipynb # 单模型的训练和测试　
|　　|　　|- stacking.ipynb # 模型融合　

使用方式：

git clone https://github.com/zhanzecheng/SOHU_competition.git
下载新闻文本训练文件，放在 ./data/路径下，地址，图片文件过大，请自行去官网下载
下载ocr模型文件，放在 ./src/ocr/ctpn/checkpoints/路径下，模型地址
pip3 install -r requirement.txt
下载词向量，放到 ./data目录下
执行 EDA&Extract.ipynb
执行 train&predict.ipynb
执行 stacking.ipynb

感谢

感谢两位帅气的队友HiYellowC和yupeihua

这里还有我们的答辩PPT，如果需要的话自行下载

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 224

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗