fandywang / Nuts

Licence: other

自然语言处理常见任务（主要包括文本分类，序列标注，自动问答等）解决方案试验田

Programming Languages

139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Nuts

Python library for feature selection for text features. It has filter method, genetic algorithm and TextFeatureSelectionEnsemble for improving text classification models. Helps improve your machine learning models

Stars: ✭ 42 (+100%)

Mutual labels: nlp-library, nlp-machine-learning, text-categorization

Rnn For Joint Nlu

Tensorflow implementation of "Attention-Based Recurrent Neural Network Models for Joint Intent Detection and Slot Filling" (https://arxiv.org/abs/1609.01454)

Stars: ✭ 281 (+1238.1%)

Mutual labels: seq2seq, sequence-labeling

Quick Nlp

Pytorch NLP library based on FastAI

Stars: ✭ 279 (+1228.57%)

Mutual labels: seq2seq, nlp-library

Natural Language Processing With Tensorflow

Natural Language Processing with TensorFlow, published by Packt

Stars: ✭ 222 (+957.14%)

Mutual labels: seq2seq, nlp-machine-learning

Multi Task Nlp

multi_task_NLP is a utility toolkit enabling NLP developers to easily train and infer a single model for multiple tasks.

Stars: ✭ 221 (+952.38%)

Mutual labels: nlp-library, sequence-labeling

classy

classy is a simple-to-use library for building high-performance Machine Learning models in NLP.

Stars: ✭ 61 (+190.48%)

Mutual labels: seq2seq, nlp-library

Kashgari

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Stars: ✭ 2,235 (+10542.86%)

Mutual labels: seq2seq, sequence-labeling

Lingua

👄 The most accurate natural language detection library for Java and the JVM, suitable for long and short text alike

Stars: ✭ 341 (+1523.81%)

Mutual labels: nlp-library, nlp-machine-learning

empythy

Automated NLP sentiment predictions- batteries included, or use your own data

Stars: ✭ 17 (-19.05%)

Mutual labels: nlp-library, nlp-machine-learning

mlconjug3

A Python library to conjugate verbs in French, English, Spanish, Italian, Portuguese and Romanian (more soon) using Machine Learning techniques.

Stars: ✭ 47 (+123.81%)

Mutual labels: nlp-library, nlp-machine-learning

NLP-Natural-Language-Processing

Projects and useful articles / links

Stars: ✭ 149 (+609.52%)

Mutual labels: nlp-library, nlp-machine-learning

Nlp profiler

A simple NLP library allows profiling datasets with one or more text columns. When given a dataset and a column name containing text data, NLP Profiler will return either high-level insights or low-level/granular statistical information about the text in that column.

Stars: ✭ 181 (+761.9%)

Mutual labels: nlp-library, nlp-machine-learning

Lingo

package lingo provides the data structures and algorithms required for natural language processing

Stars: ✭ 113 (+438.1%)

Mutual labels: nlp-library, nlp-machine-learning

Transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

Stars: ✭ 55,742 (+265338.1%)

Mutual labels: seq2seq, nlp-library

Tika Python

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

Stars: ✭ 997 (+4647.62%)

Mutual labels: nlp-library, nlp-machine-learning

Cluener2020

CLUENER2020 中文细粒度命名实体识别 Fine Grained Named Entity Recognition

Stars: ✭ 689 (+3180.95%)

Mutual labels: seq2seq, sequence-labeling

Nagisa

A Japanese tokenizer based on recurrent neural networks

Stars: ✭ 260 (+1138.1%)

Mutual labels: nlp-library, sequence-labeling

Contextualized Topic Models

A python package to run contextualized topic modeling. CTMs combine BERT with topic models to get coherent topics. Also supports multilingual tasks. Cross-lingual Zero-shot model published at EACL 2021.

Stars: ✭ 318 (+1414.29%)

Mutual labels: nlp-library, nlp-machine-learning

schrutepy

The Entire Transcript from the Office in Tidy Format

Stars: ✭ 22 (+4.76%)

Mutual labels: nlp-library, nlp-machine-learning

CVAE Dial

CVAE_XGate model in paper "Xu, Dusek, Konstas, Rieser. Better Conversations by Modeling, Filtering, and Optimizing for Coherence and Diversity"

Stars: ✭ 16 (-23.81%)

Mutual labels: seq2seq, nlp-machine-learning

View All Similar Projects ➔

Nuts

自然语言处理（Natural Language Processing，NLP）常见任务（主要包括文本分类，序列标注，自动问答等）解决方案试验田

文本分类（Text Classification）

将一段文本（可以是词、句子、文章）分门别类，打上预定义好的一个或多个标签，是非常常见的一类应用需求，如：

垃圾邮件分类：典型的二元分类问题，常被拿来举例
网页分类：包括新闻资讯，电商商品，微信公众号文章以及普通 web 网页，通常是长文本，信息较为充足，但噪声大，类目较多
query 分类：搜索引擎场景下核心处理模块，通常是短文本，信息不足是最大挑战
情感分类：处理文本通常是偏口语化的评论，粒度包括词语级、属性级、句子级和篇章级，类别个数通常较少（褒/贬/中性，喜/怒/悲/恐/惊）
意图识别/分类：聊天机器人场景下核心处理模块

数据集

本项目中，我们将针对如下数据集，详细对比多种不同的传统机器学习和深度学习解决方案，如 LR，MaxEnt，FastText，TextCNN，TextRNN，Attention Model，等等，希望可以找到普适最优的算法模型或技术思路，积累可复用的工具集。

淘宝电商商品分类：阿里巴巴在天池平台上开放的一个商品和类目的数据集，涵盖了 1500w 左右淘宝行业商品库，类目数将近 2,000 个。
20 Newsgroups 新闻分类: 包含 18K+ 篇新闻资讯，共 20 个类别。
IMDB 电影评论情感分类: 从 imdb.com 网站收集的电影评论，区分积极或消极的情绪。

分类策略

评估分析

序列标注（Sequence Labelling）

序列标注就是为给定的一维线性输入序列的每个元素，打上标签集合中的某个标签的过程，其本质上是对线性序列中每个元素根据上下文内容进行分类的问题，是结构化预测的一种特例。常见的任务有：

中文分词：将汉字序列切分成词序列。
词性标注：为句子的每个词打上一个词性类别。
命名实体识别：从句子的词序列中定位并识别出人名、地名、机构名等实体。
语义角色标注：一种浅层的语义分析技术，标注句子中某些短语为给定谓词的论元 (语义角色) ，如施事、受事、时间和地点等。

最常用的解决序列标注问题的模型是 HMM，MEMM，CRF，Structured Perceptron/SVM 等，尤其是 CRF 和 Structured Perceptron，是最主流的方法。近年来，基于 RNN 的深度学习方法，尤其是 LSTM，BiLSTM，LSTM+CRF，BiLSTM+CRF 取得了更优的效果，基本成了解决序列标注问题的标配方案。

数据集

本项目中，我们将针对如下数据集，主要对比 CRF，Structured Perceptron 和 RNN 三种解决方案。

中文分词 bakeoff2005：第二届国际汉语分词测评（The Second International Chinese Word Segmentation Bakeoff)发布的国际中文分词数据集，共有四家单位提供的测试语料（Academia Sinica、 City University 、Peking University 、Microsoft Research）, 前两家是繁体中文，后两家是简体中文，在评测提供的资源 icwb2-data 中包含了来自这四家单位的训练集（training）、测试集（testing）, 以及根据各自分词标准而提供的相应测试集的标准答案（icwb2-data/scripts/gold）．在icwb2-data/scripts 目录下含有对分词进行自动评分的 perl 脚本 score。
词性标注人民日报：对《人民日报》1998 年上半年的纯文本语料进行了词语切分和词性标注，严格按照人民日报的日期、版序、文章顺序编排。文章中每个词语都带有词性标记。目前的标记集里有 26 个基本词类标记（名词n、时间词t、处所词s、方位词f、数词m、量词q、区别词b、代词r、动词v、形容词a、状态词z、副词d、介词p、连词c、助词u、语气词y、叹词e、拟声词o、成语i、习惯用语l、简称j、前接成分h、后接成分k、语素g、非语素字x、标点符号w）外，从语料库应用的角度，增加了专有名词（人名nr、地名ns、机构名称nt、其他专有名词nz），从语言学角度也增加了一些标记，总共使用了 40 多个个标记。

标注策略

评估分析

自动问答（Question Answering）

机器翻译（Machine Translation）

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

fandywang / Nuts

Programming Languages

Labels

Projects that are alternatives of or similar to Nuts

Nuts

文本分类（Text Classification）

数据集

分类策略

评估分析

序列标注（Sequence Labelling）

数据集

标注策略

评估分析

自动问答（Question Answering）

机器翻译（Machine Translation）