Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → howl-anderson → Mitie_chinese_wikipedia_corpus

howl-anderson / Mitie_chinese_wikipedia_corpus

Licence: mit

Pre-trained Wikipedia corpus by MITIE

Labels

nlp nlp-machine-learning corpus

Projects that are alternatives of or similar to Mitie chinese wikipedia corpus

Russian news corpus

Russian mass media stemmed texts corpus / Корпус лемматизированных (морфологически нормализованных) текстов российских СМИ

Stars: ✭ 76 (+76.74%)

Mutual labels: corpus, nlp-machine-learning

Letslearnai.github.io

Lets Learn AI

Stars: ✭ 33 (-23.26%)

Mutual labels: nlp-machine-learning

Tapas

End-to-end neural table-text understanding models.

Stars: ✭ 583 (+1255.81%)

Mutual labels: nlp-machine-learning

Click2analyze Androiddevchallenge

An app to analyze the text and fixing the anomaly of the message that deviates from what is standard, normal, or expected. #AndroidDevChallenge

Stars: ✭ 20 (-53.49%)

Mutual labels: nlp-machine-learning

Quanteda

An R package for the Quantitative Analysis of Textual Data

Stars: ✭ 647 (+1404.65%)

Mutual labels: corpus

Lyrics Corpora

An unofficial Python API that allows users to create a corpus of lyrical text from their favorite artists and billboard charts

Stars: ✭ 13 (-69.77%)

Mutual labels: corpus

Nlp base

自然语言基础模型

Stars: ✭ 524 (+1118.6%)

Mutual labels: nlp-machine-learning

Tika Python

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

Stars: ✭ 997 (+2218.6%)

Mutual labels: nlp-machine-learning

Chatterbot Corpus

A multilingual dialog corpus

Stars: ✭ 964 (+2141.86%)

Mutual labels: corpus

Naive Bayes Classifier

Naive Bayes classifier is classification algorithm. It uses Naive based Bernoulli and Multinomial equation to classify documents(Text) as ham or spam.

Stars: ✭ 6 (-86.05%)

Mutual labels: corpus

Insuranceqa Corpus Zh

🚁 保险行业语料库，聊天机器人

Stars: ✭ 821 (+1809.3%)

Mutual labels: corpus

Nlp chinese corpus

大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP

Stars: ✭ 6,656 (+15379.07%)

Mutual labels: corpus

Sdtm mapper

AI SDTM mapping (R for ML, Python, TensorFlow for DL)

Stars: ✭ 27 (-37.21%)

Mutual labels: nlp-machine-learning

Deeppavlov

An open source library for deep learning end-to-end dialog systems and chatbots.

Stars: ✭ 5,525 (+12748.84%)

Mutual labels: nlp-machine-learning

Talismane

NLP framework: sentence detector, tokeniser, pos-tagger and dependency parser

Stars: ✭ 38 (-11.63%)

Mutual labels: nlp-machine-learning

Chinese models for spacy

SpaCy 中文模型 | Models for SpaCy that support Chinese

Stars: ✭ 543 (+1162.79%)

Mutual labels: nlp-machine-learning

Rasa Ui

Rasa UI is a frontend for the Rasa Framework

Stars: ✭ 796 (+1751.16%)

Mutual labels: nlp-machine-learning

Company Names Corpus

公司名语料库。机构名语料库。公司简称,缩写,品牌词,企业名。可用于中文分词、机构名实体识别。

Stars: ✭ 868 (+1918.6%)

Mutual labels: corpus

Predicting Myers Briggs Type Indicator With Recurrent Neural Networks

Stars: ✭ 43 (+0%)

Mutual labels: nlp-machine-learning

Coursera Natural Language Processing Specialization

Programming assignments from all courses in the Coursera Natural Language Processing Specialization offered by deeplearning.ai.

Stars: ✭ 39 (-9.3%)

Mutual labels: nlp-machine-learning

View All Similar Projects ➔

中文维基百科 MITIE 语料库

这个项目旨在为训练 MITIE 中文语料库提供工具和指南. 通常情况下，训练这个模型，需要一台高配置、高网速的服务器大约运行三天，才能训练完毕，为了节约时间，本项目也将提供预训练好的模型。

从零开始训练

构建维基百科语料库

见项目 chinese-wikipedia-corpus-creator，维基百科的语料库的最终数据目录为 third-party/chinese-wikipedia-corpus-creator/token_cleaned_plain_files。可以使用两种方式获得数据：直接下载已经预处理好的语料库或者从零开始处理语料库

直接下载已经预处理好的语料库

直接下载 chinese-wikipedia-corpus-creator 已经处理好的文件，下载地址在 Release of chinese-wikipedia-corpus-creator，下载后放置到 third-party/chinese-wikipedia-corpus-creator/token_cleaned_plain_files

从零开始处理语料库

将 chinese-wikipedia-corpus-creator 源代码下载或者克隆至 third-party/chinese-wikipedia-corpus-creator，按照该项目文档的说明，运行相关代码，产生中文维基百科语料库。确保最后的输出文件位于 third-party/chinese-wikipedia-corpus-creator/token_cleaned_plain_files

构建 `MITIE` 工具

获取 `MITIE` 源代码

这里选择将 MITIE clone 至本项目的 third-party 目录：

$ git clone https://github.com/mit-nlp/MITIE.git

编译 `MITEIE`

MITIE 是一个工具的集合包，本项目所需的只是其中的 wordrep 工具

$ cd third-party/MITIE/tools/wordrep
$ mkdir build
$ cd build
$ cmake ..
$ cmake --build . --config Release

训练模型

$ ./third-party/MITIE/tools/wordrep/build/wordrep --count-words 800000 --word-vects --basic-morph --cca-morph ./third-party/chinese-wikipedia-corpus-creator/token_cleaned_plain_files

下载预训练好的模型

可下载的模型列表见 releases (已提供针对中国用户的快速下载链接)

如何贡献代码

请阅读 CONTRIBUTING.md 并向我们发送 pull requests.

版本控制方案

使用 SemVer 的标准方案. 访问 tags on this repository 可了解所有版本信息.

作者

Xiaoquan Kong - Initial work - howl-anderson

全体贡献者信息在 contributors 处可见。

授权协议

本项目采用 MIT License - 详情请见 LICENSE.md

致谢

MITIE 软件编译的部分，参考了 WANG Guan 的博文用Rasa NLU构建自己的中文NLU系统

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 43

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (1) 🔗

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

howl-anderson / Mitie_chinese_wikipedia_corpus

Labels

Projects that are alternatives of or similar to Mitie chinese wikipedia corpus

中文维基百科 MITIE 语料库

从零开始训练

构建维基百科语料库

直接下载已经预处理好的语料库

从零开始处理语料库

构建 MITIE 工具

获取 MITIE 源代码

编译 MITEIE

训练模型

下载预训练好的模型

如何贡献代码

版本控制方案

作者

授权协议

致谢

构建 `MITIE` 工具

获取 `MITIE` 源代码

编译 `MITEIE`