All Projects → howl-anderson → Mitie_chinese_wikipedia_corpus

howl-anderson / Mitie_chinese_wikipedia_corpus

Licence: mit
Pre-trained Wikipedia corpus by MITIE

Projects that are alternatives of or similar to Mitie chinese wikipedia corpus

Russian news corpus
Russian mass media stemmed texts corpus / Корпус лемматизированных (морфологически нормализованных) текстов российских СМИ
Stars: ✭ 76 (+76.74%)
Mutual labels:  corpus, nlp-machine-learning
Letslearnai.github.io
Lets Learn AI
Stars: ✭ 33 (-23.26%)
Mutual labels:  nlp-machine-learning
Tapas
End-to-end neural table-text understanding models.
Stars: ✭ 583 (+1255.81%)
Mutual labels:  nlp-machine-learning
Click2analyze Androiddevchallenge
An app to analyze the text and fixing the anomaly of the message that deviates from what is standard, normal, or expected. #AndroidDevChallenge
Stars: ✭ 20 (-53.49%)
Mutual labels:  nlp-machine-learning
Quanteda
An R package for the Quantitative Analysis of Textual Data
Stars: ✭ 647 (+1404.65%)
Mutual labels:  corpus
Lyrics Corpora
An unofficial Python API that allows users to create a corpus of lyrical text from their favorite artists and billboard charts
Stars: ✭ 13 (-69.77%)
Mutual labels:  corpus
Nlp base
自然语言基础模型
Stars: ✭ 524 (+1118.6%)
Mutual labels:  nlp-machine-learning
Tika Python
Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
Stars: ✭ 997 (+2218.6%)
Mutual labels:  nlp-machine-learning
Chatterbot Corpus
A multilingual dialog corpus
Stars: ✭ 964 (+2141.86%)
Mutual labels:  corpus
Naive Bayes Classifier
Naive Bayes classifier is classification algorithm. It uses Naive based Bernoulli and Multinomial equation to classify documents(Text) as ham or spam.
Stars: ✭ 6 (-86.05%)
Mutual labels:  corpus
Insuranceqa Corpus Zh
🚁 保险行业语料库,聊天机器人
Stars: ✭ 821 (+1809.3%)
Mutual labels:  corpus
Nlp chinese corpus
大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
Stars: ✭ 6,656 (+15379.07%)
Mutual labels:  corpus
Sdtm mapper
AI SDTM mapping (R for ML, Python, TensorFlow for DL)
Stars: ✭ 27 (-37.21%)
Mutual labels:  nlp-machine-learning
Deeppavlov
An open source library for deep learning end-to-end dialog systems and chatbots.
Stars: ✭ 5,525 (+12748.84%)
Mutual labels:  nlp-machine-learning
Talismane
NLP framework: sentence detector, tokeniser, pos-tagger and dependency parser
Stars: ✭ 38 (-11.63%)
Mutual labels:  nlp-machine-learning
Chinese models for spacy
SpaCy 中文模型 | Models for SpaCy that support Chinese
Stars: ✭ 543 (+1162.79%)
Mutual labels:  nlp-machine-learning
Rasa Ui
Rasa UI is a frontend for the Rasa Framework
Stars: ✭ 796 (+1751.16%)
Mutual labels:  nlp-machine-learning
Company Names Corpus
公司名语料库。机构名语料库。公司简称,缩写,品牌词,企业名。可用于中文分词、机构名实体识别。
Stars: ✭ 868 (+1918.6%)
Mutual labels:  corpus
Predicting Myers Briggs Type Indicator With Recurrent Neural Networks
Stars: ✭ 43 (+0%)
Mutual labels:  nlp-machine-learning
Coursera Natural Language Processing Specialization
Programming assignments from all courses in the Coursera Natural Language Processing Specialization offered by deeplearning.ai.
Stars: ✭ 39 (-9.3%)
Mutual labels:  nlp-machine-learning

中文维基百科 MITIE 语料库

这个项目旨在为训练 MITIE 中文语料库提供工具和指南. 通常情况下,训练这个模型,需要一台高配置、高网速的服务器大约运行三天,才能训练完毕,为了节约时间,本项目也将提供预训练好的模型。

从零开始训练

构建维基百科语料库

见项目 chinese-wikipedia-corpus-creator,维基百科的语料库的最终数据目录为 third-party/chinese-wikipedia-corpus-creator/token_cleaned_plain_files。可以使用两种方式获得数据:直接下载已经预处理好的语料库 或者 从零开始处理语料库

直接下载已经预处理好的语料库

直接下载 chinese-wikipedia-corpus-creator 已经处理好的文件,下载地址在 Release of chinese-wikipedia-corpus-creator,下载后放置到 third-party/chinese-wikipedia-corpus-creator/token_cleaned_plain_files

从零开始处理语料库

chinese-wikipedia-corpus-creator 源代码下载或者克隆至 third-party/chinese-wikipedia-corpus-creator,按照该项目文档的说明,运行相关代码,产生中文维基百科语料库。确保最后的输出文件位于 third-party/chinese-wikipedia-corpus-creator/token_cleaned_plain_files

构建 MITIE 工具

获取 MITIE 源代码

这里选择将 MITIE clone 至本项目的 third-party 目录:

$ git clone https://github.com/mit-nlp/MITIE.git

编译 MITEIE

MITIE 是一个工具的集合包,本项目所需的只是其中的 wordrep 工具

$ cd third-party/MITIE/tools/wordrep
$ mkdir build
$ cd build
$ cmake ..
$ cmake --build . --config Release

训练模型

$ ./third-party/MITIE/tools/wordrep/build/wordrep --count-words 800000 --word-vects --basic-morph --cca-morph ./third-party/chinese-wikipedia-corpus-creator/token_cleaned_plain_files

下载预训练好的模型

可下载的模型列表见 releases (已提供针对中国用户的快速下载链接)

如何贡献代码

请阅读 CONTRIBUTING.md 并向我们发送 pull requests.

版本控制方案

使用 SemVer 的标准方案. 访问 tags on this repository 可了解所有版本信息.

作者

全体贡献者信息在 contributors 处可见。

授权协议

本项目采用 MIT License - 详情请见 LICENSE.md

致谢

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].