All Projects → Lapis-Hong → Atec Nlp

Lapis-Hong / Atec Nlp

Licence: mit
ATEC 金融大脑-金融智能NLP服务

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Atec Nlp

Meta Learning Bert
Meta learning with BERT as a learner
Stars: ✭ 52 (-29.73%)
Mutual labels:  text-classification
Textblob Ar
Arabic support for textblob
Stars: ✭ 60 (-18.92%)
Mutual labels:  text-classification
Keras bert classification
Bert-classification and bert-dssm implementation with keras.
Stars: ✭ 67 (-9.46%)
Mutual labels:  text-classification
Scdv
Text classification with Sparse Composite Document Vectors.
Stars: ✭ 54 (-27.03%)
Mutual labels:  text-classification
Freediscovery
Web Service for E-Discovery Analytics
Stars: ✭ 59 (-20.27%)
Mutual labels:  text-classification
Nlp News Classification
Train and deploy a News Classifier using language model (ULMFit) - Serverless container
Stars: ✭ 63 (-14.86%)
Mutual labels:  text-classification
Hiagm
Hierarchy-Aware Global Model for Hierarchical Text Classification
Stars: ✭ 49 (-33.78%)
Mutual labels:  text-classification
Sarcasm Detection
Detecting Sarcasm on Twitter using both traditonal machine learning and deep learning techniques.
Stars: ✭ 73 (-1.35%)
Mutual labels:  text-classification
Mycail
中国法研杯-司法人工智能挑战赛
Stars: ✭ 60 (-18.92%)
Mutual labels:  text-classification
Text Analytics With Python
Learn how to process, classify, cluster, summarize, understand syntax, semantics and sentiment of text data with the power of Python! This repository contains code and datasets used in my book, "Text Analytics with Python" published by Apress/Springer.
Stars: ✭ 1,132 (+1429.73%)
Mutual labels:  text-classification
Very Deep Convolutional Networks For Natural Language Processing In Tensorflow
implement the paper" Very Deep Convolutional Networks for Natural Language Processing"(https://arxiv.org/abs/1606.01781 ) in tensorflow
Stars: ✭ 54 (-27.03%)
Mutual labels:  text-classification
Applied Text Mining In Python
Repo for Applied Text Mining in Python (coursera) by University of Michigan
Stars: ✭ 59 (-20.27%)
Mutual labels:  text-classification
Deep Atrous Cnn Sentiment
Deep-Atrous-CNN-Text-Network: End-to-end word level model for sentiment analysis and other text classifications
Stars: ✭ 64 (-13.51%)
Mutual labels:  text-classification
Text Classification Keras
📚 Text classification library with Keras
Stars: ✭ 53 (-28.38%)
Mutual labels:  text-classification
Char Cnn Text Classification Tensorflow
Character-level Convolutional Networks for Text Classification论文仿真实现
Stars: ✭ 72 (-2.7%)
Mutual labels:  text-classification
Textclassification
All kinds of neural text classifiers implemented by Keras
Stars: ✭ 51 (-31.08%)
Mutual labels:  text-classification
Sentiment analysis albert
sentiment analysis、文本分类、ALBERT、TextCNN、classification、tensorflow、BERT、CNN、text classification
Stars: ✭ 61 (-17.57%)
Mutual labels:  text-classification
Nlp Tutorial
A list of NLP(Natural Language Processing) tutorials
Stars: ✭ 1,188 (+1505.41%)
Mutual labels:  text-classification
Text Classifier
text-classifier is a toolkit for text classification. It was developed to facilitate the designing, comparing, and sharing of text classification models.
Stars: ✭ 72 (-2.7%)
Mutual labels:  text-classification
Lstm Cnn classification
Stars: ✭ 64 (-13.51%)
Mutual labels:  text-classification

ATEC NLP sentence pair similarity competition

https://dc.cloud.alipay.com/index#/topic/intro?id=3

1 、赛题任务描述

问题相似度计算,即给定客服里用户描述的两句话,用算法来判断是否表示了相同的语义。

示例:

“花呗如何还款” --“花呗怎么还款”:同义问句 “花呗如何还款” -- “我怎么还我的花被呢”:同义问句 “花呗分期后逾期了如何还款”-- “花呗分期后逾期了哪里还款”:非同义问句 对于例子a,比较简单的方法就可以判定同义;对于例子b,包含了错别字、同义词、词序变换等问题,两个句子乍一看并不类似,想正确判断比较有挑战;对于例子c,两句话很类似,仅仅有一处细微的差别 “如何”和“哪里”,就导致语义不一致。

2、数据

本次大赛所有数据均来自蚂蚁金服金融大脑的实际应用场景,赛制分初赛和复赛两个阶段:

初赛阶段

我们提供10万对的标注数据(分批次更新),作为训练数据,包括同义对和不同义对,可下载。数据集中每一行就是一条样例。格式如下:

行号\t句1\t句2\t标注,举例:1 花呗如何还款 花呗怎么还款 1

行号指当前问题对在训练集中的第几行; 句1和句2分别表示问题句对的两个句子; 标注指当前问题对的同义或不同义标注,同义为1,不同义为0。 评测数据集总共1万条。为保证大赛的公平公正、避免恶意的刷榜行为,该数据集不公开。大家通过提交评测代码和模型的方法完成预测、获取相应的排名。格式如下:

行号\t句1\t句2

初赛阶段,评测数据集会在评测系统一个特定的路径下面,由官方的平台系统调用选手提交的评测工具执行。

复赛阶段

我们将训练数据集的量级会增加到海量。该阶段的数据不提供下载,会以数据表的形式在蚂蚁金服的数巢平台上供选手使用。和初赛阶段类似,数据集包含四个字段,分别是行号、句1、句2和标注。

评测数据集还是1万条,同样以数据表的形式在数巢平台上。该数据集包含三个字段,分别是行号、句1、句2。

3、评测及评估指标

初赛阶段,比赛选手在本地完成模型的训练调优,将评测代码和模型打包后,提交官方测评系统完成预测和排名更新。测评系统为标准Linux环境,内存8G,CPU4核,无网络访问权限。安装有python 2.7、java 8、tensorflow 1.5、jieba 0.39、pytorch 0.4.0、keras 2.1.6、gensim 3.4.0、pandas 0.22.0、sklearn 0.19.1、xgboost 0.71、lightgbm 2.1.1。 提交压缩包解压后,主目录下需包含脚本文件run.sh,该脚本以评测文件作为输入,评测结果作为输出(输出结果只有0和1),输出文件每行格式为“行号\t预测结果”,命令超时时间为30分钟,执行命令如下:

bash run.sh INPUT_PATH OUTPUT_PATH

预测结果为空或总行数不对,评测结果直接判为0。

复赛阶段,选手的模型训练、调优和预测都是在蚂蚁金服的机器学习平台上完成。因此评测只需要提供相应的UDF即可,以问题对的两句话作为输入,相似度预测结果(0或1)作为输出,同样输出为空则终止评估,评测结果为0。

本赛题评分以F1-score为准,得分相同时,参照accuracy排序。选手预测结果和真实标签进行比对,几个数值的定义先明确一下:

True Positive(TP)意思表示做出同义的判定,而且判定是正确的,TP的数值表示正确的同义判定的个数;

同理,False Positive(FP)数值表示错误的同义判定的个数;

依此,True Negative(TN)数值表示正确的不同义判定个数;

False Negative(FN)数值表示错误的不同义判定个数。

基于此,我们就可以计算出准确率(precision rate)、召回率(recall rate)和accuracy、F1-score:

precision rate = TP / (TP + FP)

recall rate = TP / (TP + FN)

accuracy = (TP + TN) / (TP + FP + TN + FN)

F1-score = 2 * precision rate * recall rate / (precision rate + recall rate)

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].