All Projects → FudanNLP → Nlpcc Wordseg Weibo

FudanNLP / Nlpcc Wordseg Weibo

NLPCC 2016 微博分词评测项目

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Nlpcc Wordseg Weibo

Jcseg
Jcseg is a light weight NLP framework developed with Java. Provide CJK and English segmentation based on MMSEG algorithm, With also keywords extraction, key sentence extraction, summary extraction implemented based on TEXTRANK algorithm. Jcseg had a build-in http server and search modules for the latest lucene,solr,elasticsearch
Stars: ✭ 754 (+528.33%)
Mutual labels:  natural-language-processing, chinese-word-segmentation
Pyhanlp
中文分词 词性标注 命名实体识别 依存句法分析 新词发现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁 自然语言处理
Stars: ✭ 2,564 (+2036.67%)
Mutual labels:  natural-language-processing, chinese-word-segmentation
Deeplearning nlp
基于深度学习的自然语言处理库
Stars: ✭ 154 (+28.33%)
Mutual labels:  natural-language-processing, chinese-word-segmentation
Deepnlp
基于深度学习的自然语言处理库
Stars: ✭ 34 (-71.67%)
Mutual labels:  natural-language-processing, chinese-word-segmentation
Dat8
General Assembly's 2015 Data Science course in Washington, DC
Stars: ✭ 1,516 (+1163.33%)
Mutual labels:  natural-language-processing
Commonsense Rc
Code for Yuanfudao at SemEval-2018 Task 11: Three-way Attention and Relational Knowledge for Commonsense Machine Comprehension
Stars: ✭ 112 (-6.67%)
Mutual labels:  natural-language-processing
Nlp Papers
Papers and Book to look at when starting NLP 📚
Stars: ✭ 111 (-7.5%)
Mutual labels:  natural-language-processing
Awesome Emotion Recognition In Conversations
A comprehensive reading list for Emotion Recognition in Conversations
Stars: ✭ 111 (-7.5%)
Mutual labels:  natural-language-processing
Discobert
Code for paper "Discourse-Aware Neural Extractive Text Summarization" (ACL20)
Stars: ✭ 120 (+0%)
Mutual labels:  natural-language-processing
Nonautoreggenprogress
Tracking the progress in non-autoregressive generation (translation, transcription, etc.)
Stars: ✭ 118 (-1.67%)
Mutual labels:  natural-language-processing
Unified Summarization
Official codes for the paper: A Unified Model for Extractive and Abstractive Summarization using Inconsistency Loss.
Stars: ✭ 114 (-5%)
Mutual labels:  natural-language-processing
Deep Nlp Seminars
Materials for deep NLP course
Stars: ✭ 113 (-5.83%)
Mutual labels:  natural-language-processing
Stanford Tensorflow Tutorials
This repository contains code examples for the Stanford's course: TensorFlow for Deep Learning Research.
Stars: ✭ 10,098 (+8315%)
Mutual labels:  natural-language-processing
Opus Mt
Open neural machine translation models and web services
Stars: ✭ 111 (-7.5%)
Mutual labels:  natural-language-processing
Pytextrank
Python implementation of TextRank for phrase extraction and summarization of text documents
Stars: ✭ 1,675 (+1295.83%)
Mutual labels:  natural-language-processing
Danlp
DaNLP is a repository for Natural Language Processing resources for the Danish Language.
Stars: ✭ 111 (-7.5%)
Mutual labels:  natural-language-processing
Rbert
Implementation of BERT in R
Stars: ✭ 114 (-5%)
Mutual labels:  natural-language-processing
Dynamic Coattention Network Plus
Dynamic Coattention Network Plus (DCN+) TensorFlow implementation. Question answering using Deep NLP.
Stars: ✭ 117 (-2.5%)
Mutual labels:  natural-language-processing
Tensorflow Nlp
NLP and Text Generation Experiments in TensorFlow 2.x / 1.x
Stars: ✭ 1,487 (+1139.17%)
Mutual labels:  natural-language-processing
Declutr
The corresponding code from our paper "DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations". Do not hesitate to open an issue if you run into any trouble!
Stars: ✭ 111 (-7.5%)
Mutual labels:  natural-language-processing

NLPCC2016-WordSeg-Weibo

NLPCC 2016 微博分词评测项目

##Description of the Task

Word is the fundamental unit in natural language understanding. However, Chinese sentences consists of the continuous Chinese characters without natural delimiters. Therefore, Chinese word segmentation has become the first mission of Chinese natural language processing, which identifies the sequence of words in a sentence and marks the boundaries between words.

Different with the popular used news dataset, we use more informal texts from Sina Weibo. The training and test data consist of micro-blogs from various topics, such as finance, sports, entertainment, and so on.

Each participant will be allowed to submit the three runs: closed track run, semi-open track run and open track run.

  1. In the closed track, participants could only use information found in the provided training data. Information such as externally obtained word counts, part of speech information, or name lists was excluded.
  2. In the semi-open track, participants could use the information extracted from the provided background data in addition to the provided training data. Information such as externally obtained word counts, part of speech information, or name lists was excluded.
  3. In the open track, participants could use the information which should be public and be easily obtained. But it is not allowed to obtain the result by the manual labeling or crowdsourcing way.

Data

The data are collected from Sina Weibo. Both the training and test files are UTF-8 encoded. Besides the training data, we also provide the background data, from which the training and test data are drawn. The purpose of providing the background data is to find the more sophisticated features by the unsupervised way.

Download

The dataset provides a standard training/dev/test split. Specifically, the researchers interested in the dataset should download and fill up this Agreement Form and send the scanned version back to Xipeng Qiu ([email protected]; Email title: Fudan Micro-blog Dataset data request).

本数据集提供标准的训练集/开发集/测试集分割。如果您在论文中使用了本数据集,请您给我们发一份 使用协议。请签名后扫描,将扫描的协议书发给我们 (邮件地址:[email protected]; 邮件主题: 复旦微博数据集申请)。

Evaluation Metric

Different with the standard precision, recall, F1-score, we will provide a new measure metric this year. The detailed information can be found in http://aclweb.org/anthology/P/P16/P16-1206.pdf .

Papers

  1. Peng Qian, Xipeng Qiu, Xuanjing Huang, A New Psychometric-inspired Evaluation Metric for Chinese Word Segmentation, In Proceedings of Annual Meeting of the Association for Computational Linguistics (ACL), 2016. [PDF]
  2. Xipeng Qiu, Peng Qian, Zhan Shi, Overview of the NLPCC-ICCPOL 2016 Shared Task: Chinese Word Segmentation for Micro-blog Texts, In Proceedings of The Fifth Conference on Natural Language Processing and Chinese Computing & The Twenty Fourth International Conference on Computer Processing of Oriental Languages, 2016.

Citation

如果你在论文中使用了本数据集,请引用下面文献。

@InProceedings{qiu2016overview,
  Title                    = {Overview of the {NLPCC-ICCPOL} 2016 Shared Task: Chinese Word Segmentation for Micro-blog Texts},
  Author                   = {Xipeng Qiu and Peng Qian and Zhan Shi},
  Booktitle                = {Proceedings of The Fifth Conference on Natural Language Processing and Chinese Computing \& The Twenty Fourth
International Conference on Computer Processing of Oriental Languages},
  Year                     = {2016}
}

Contact Information

For any questions about this shared task, please contact: Xipeng Qiu Group of NLP & DL School of Computer Science, Fudan University Email: [email protected]

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].