All Projects → howl-anderson → Chinese_models_for_spacy

howl-anderson / Chinese_models_for_spacy

Licence: mit
SpaCy 中文模型 | Models for SpaCy that support Chinese

Projects that are alternatives of or similar to Chinese models for spacy

Seq2seq tutorial
Code For Medium Article "How To Create Data Products That Are Magical Using Sequence-to-Sequence Models"
Stars: ✭ 132 (-75.69%)
Mutual labels:  jupyter-notebook, nlp-machine-learning
Nlp profiler
A simple NLP library allows profiling datasets with one or more text columns. When given a dataset and a column name containing text data, NLP Profiler will return either high-level insights or low-level/granular statistical information about the text in that column.
Stars: ✭ 181 (-66.67%)
Mutual labels:  jupyter-notebook, nlp-machine-learning
Gossiping Chinese Corpus
PTT 八卦版問答中文語料
Stars: ✭ 137 (-74.77%)
Mutual labels:  jupyter-notebook, chinese-nlp
Codesearchnet
Datasets, tools, and benchmarks for representation learning of code.
Stars: ✭ 1,378 (+153.78%)
Mutual labels:  jupyter-notebook, nlp-machine-learning
Nemo
NeMo: a toolkit for conversational AI
Stars: ✭ 3,685 (+578.64%)
Mutual labels:  jupyter-notebook, nlp-machine-learning
Bertqa Attention On Steroids
BertQA - Attention on Steroids
Stars: ✭ 112 (-79.37%)
Mutual labels:  jupyter-notebook, nlp-machine-learning
Pytorch Question Answering
Important paper implementations for Question Answering using PyTorch
Stars: ✭ 154 (-71.64%)
Mutual labels:  jupyter-notebook, nlp-machine-learning
Natural Language Processing Specialization
This repo contains my coursework, assignments, and Slides for Natural Language Processing Specialization by deeplearning.ai on Coursera
Stars: ✭ 151 (-72.19%)
Mutual labels:  jupyter-notebook, nlp-machine-learning
Natural Language Processing With Tensorflow
Natural Language Processing with TensorFlow, published by Packt
Stars: ✭ 222 (-59.12%)
Mutual labels:  jupyter-notebook, nlp-machine-learning
Melusine
Melusine is a high-level library for emails classification and feature extraction "dédiée aux courriels français".
Stars: ✭ 222 (-59.12%)
Mutual labels:  jupyter-notebook, nlp-machine-learning
News push project
Real Time News Scraping and Recommendation System - React | Tensorflow | NLP | News Scrapers
Stars: ✭ 44 (-91.9%)
Mutual labels:  jupyter-notebook, nlp-machine-learning
Dab
Data Augmentation by Backtranslation (DAB) ヽ( •_-)ᕗ
Stars: ✭ 294 (-45.86%)
Mutual labels:  jupyter-notebook, nlp-machine-learning
Coursera Natural Language Processing Specialization
Programming assignments from all courses in the Coursera Natural Language Processing Specialization offered by deeplearning.ai.
Stars: ✭ 39 (-92.82%)
Mutual labels:  jupyter-notebook, nlp-machine-learning
Chinese Chatbot
中文聊天机器人,基于10万组对白训练而成,采用注意力机制,对一般问题都会生成一个有意义的答复。已上传模型,可直接运行,跑不起来直播吃键盘。
Stars: ✭ 124 (-77.16%)
Mutual labels:  jupyter-notebook, chinese-nlp
Sdtm mapper
AI SDTM mapping (R for ML, Python, TensorFlow for DL)
Stars: ✭ 27 (-95.03%)
Mutual labels:  jupyter-notebook, nlp-machine-learning
Ktext
Utilities for preprocessing text for deep learning with Keras
Stars: ✭ 182 (-66.48%)
Mutual labels:  jupyter-notebook, nlp-machine-learning
Data Science Hacks
Data Science Hacks consists of tips, tricks to help you become a better data scientist. Data science hacks are for all - beginner to advanced. Data science hacks consist of python, jupyter notebook, pandas hacks and so on.
Stars: ✭ 273 (-49.72%)
Mutual labels:  jupyter-notebook, nlp-machine-learning
Hands On Nltk Tutorial
The hands-on NLTK tutorial for NLP in Python
Stars: ✭ 419 (-22.84%)
Mutual labels:  jupyter-notebook, nlp-machine-learning
Intro To Python
An intro to Python & programming for wanna-be data scientists
Stars: ✭ 536 (-1.29%)
Mutual labels:  jupyter-notebook
Photomosaic
Creating fun photomosaics, GIFs, and murals from your family pictures using ML & similarity search
Stars: ✭ 540 (-0.55%)
Mutual labels:  jupyter-notebook

README written in English

SpaCy 官方中文模型已经上线(https://spacy.io/models/zh), 本项目『推动 SpaCy 中文模型开发』的使命已经完成,本项目将进入维护状态,后续更新将只进行 bug 修复,感谢各位用户长期的关注和支持。

SpaCy 中文模型

为 SpaCy 提供的中文数据模型. 模型目前还处于 beta 公开测试的状态 。

在线演示

基于 Jupyter notebook 的在线演示在 Binder

特性

部分 王小明在北京的清华大学读书 这个 Doc 对象的属性信息:

attributes_of_doc

NER (New!)

部分 王小明在北京的清华大学读书 这个 Doc 对象的 NER 信息:

ner_of_doc

开始使用

模型用二进制文件的形式进行分发, 用户应该具备基础的 SpaCy (version > 2) 的基础知识.

系统要求

Python 3 (也许支持 python2, 但未经过良好测试)

安装

下载模型

releases 页面下载模型 (New! 为中国地区的用户提供了加速下载的链接)。假设所下载的模型名为 zh_core_web_sm-2.x.x.tar.gz

安装模型

pip install zh_core_web_sm-2.x.x.tar.gz

为了方便后续在 Rasa NLU 等框架中使用,需要再为这个模型建立一个链接,by 执行以下命令:

spacy link zh_core_web_sm zh

运行完成后就可以使用 zh 这个别名来访问这个模型了。

运行 Demo 代码

Demo 代码位于 test.py. 在安装好模型后,用户下载或者克隆本仓库的代码,然后可以直接执行

python3 ./test.py

打开地址 http://127.0.0.1:5000, 将看到如下:

Dependency of doc

如何从零构造这个模型

workflow

语料库

本项目使用的语料库是 OntoNotes 5.0。

由于 OntoNotes 5.0 是 LDC (Linguistic Data Consortium) 的版权材料,无法直接包含在本项目中。好消息是,OntoNotes 5.0 对于 团体用户(包含企业和学术组织)是完全免费的。用户可以建立一个企业或者学术组织账号,然后免费获取 OntoNotes 5.0。

TODO list

  • 属性 pos_ 不正确. 这个和 SpaCy 中中文语言 Class 相关。
  • 属性 shape_ and is_alpha 似乎对中文并无意义, 但需要权威信息确认一下.
  • 属性 is_stop 不正确. 这个和 SpaCy 中中文语言 Class 相关。
  • 属性 vector 似乎没有训练的很好。
  • 属性 is_oov 完全错误. 第一优先级修复。
  • NER 模型,因为缺少 LDC 语料库,目前不可用. 正在解决中正在训练中。
  • 将训练中所用的中间结果 release 出来, 方便用户自行定制模型

使用的组件

  • TODO

如何贡献

请阅读 CONTRIBUTING.md , 然后提交 pull requests 给我们.

版本化控制

我们使用 SemVer 做版本化的标准. 查看 tags 以了解所有的版本.

作者

更多贡献者信息,请参考 contributors.

版权

MIT License - 详见 LICENSE.md

致谢

  • TODO
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].