All Projects → nonamestreet → Weixin_public_corpus

nonamestreet / Weixin_public_corpus

微信公众号语料库

Projects that are alternatives of or similar to Weixin public corpus

Gossiping Chinese Corpus
PTT 八卦版問答中文語料
Stars: ✭ 137 (-70.54%)
Mutual labels:  corpus, chinese-nlp
Awesome Persian Nlp Ir
Curated List of Persian Natural Language Processing and Information Retrieval Tools and Resources
Stars: ✭ 460 (-1.08%)
Mutual labels:  corpus, natural-language-processing
Prosody
Helsinki Prosody Corpus and A System for Predicting Prosodic Prominence from Text
Stars: ✭ 139 (-70.11%)
Mutual labels:  corpus, natural-language-processing
Colibri Core
Colibri core is an NLP tool as well as a C++ and Python library for working with basic linguistic constructions such as n-grams and skipgrams (i.e patterns with one or more gaps, either of fixed or dynamic size) in a quick and memory-efficient way. At the core is the tool ``colibri-patternmodeller`` whi ch allows you to build, view, manipulate and query pattern models.
Stars: ✭ 112 (-75.91%)
Mutual labels:  corpus, linguistics
Fakenewscorpus
A dataset of millions of news articles scraped from a curated list of data sources.
Stars: ✭ 255 (-45.16%)
Mutual labels:  corpus, natural-language-processing
Awesome Hungarian Nlp
A curated list of NLP resources for Hungarian
Stars: ✭ 121 (-73.98%)
Mutual labels:  corpus, natural-language-processing
Efaqa Corpus Zh
❤️Emotional First Aid Dataset, 心理咨询问答、聊天机器人语料库
Stars: ✭ 170 (-63.44%)
Mutual labels:  corpus, natural-language-processing
Typing Assistant
Typing Assistant provides the ability to autocomplete words and suggests predictions for the next word. This makes typing faster, more intelligent and reduces effort.
Stars: ✭ 32 (-93.12%)
Mutual labels:  corpus, natural-language-processing
folia
FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for proces…
Stars: ✭ 56 (-87.96%)
Mutual labels:  corpus, linguistics
proiel-treebank
Official releases of the PROIEL treebank of ancient Indo-European languages
Stars: ✭ 30 (-93.55%)
Mutual labels:  corpus, linguistics
Ua Gec
UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language
Stars: ✭ 108 (-76.77%)
Mutual labels:  corpus, natural-language-processing
Pynlpl
PyNLPl, pronounced as 'pineapple', is a Python library for Natural Language Processing. It contains various modules useful for common, and less common, NLP tasks. PyNLPl can be used for basic tasks such as the extraction of n-grams and frequency lists, and to build simple language model. There are also more complex data types and algorithms. Moreover, there are parsers for file formats common in NLP (e.g. FoLiA/Giza/Moses/ARPA/Timbl/CQL). There are also clients to interface with various NLP specific servers. PyNLPl most notably features a very extensive library for working with FoLiA XML (Format for Linguistic Annotation).
Stars: ✭ 426 (-8.39%)
Mutual labels:  natural-language-processing, linguistics
Ja.text8
Japanese text8 corpus for word embedding.
Stars: ✭ 79 (-83.01%)
Mutual labels:  corpus, natural-language-processing
Chinese Nlp Corpus
Collections of Chinese NLP corpus
Stars: ✭ 438 (-5.81%)
Mutual labels:  corpus, chinese-nlp
Coarij
Corpus of Annual Reports in Japan
Stars: ✭ 55 (-88.17%)
Mutual labels:  corpus, natural-language-processing
Nlp bahasa resources
A Curated List of Dataset and Usable Library Resources for NLP in Bahasa Indonesia
Stars: ✭ 158 (-66.02%)
Mutual labels:  corpus, natural-language-processing
Nlp chinese corpus
大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
Stars: ✭ 6,656 (+1331.4%)
Mutual labels:  corpus, chinese-nlp
Insuranceqa Corpus Zh
🚁 保险行业语料库,聊天机器人
Stars: ✭ 821 (+76.56%)
Mutual labels:  corpus, natural-language-processing
Nlvr
Cornell NLVR and NLVR2 are natural language grounding datasets. Each example shows a visual input and a sentence describing it, and is annotated with the truth-value of the sentence.
Stars: ✭ 192 (-58.71%)
Mutual labels:  corpus, natural-language-processing
Ltp
Language Technology Platform
Stars: ✭ 3,648 (+684.52%)
Mutual labels:  natural-language-processing, chinese-nlp

微信公众号语料库

部分网络抓取的微信公众号的文章,已经去除HTML,只包含了纯文本。每行一篇,是JSON格式,name是微信公众号名字,account是微信公众号ID,title是题目,content是正文。

数据用zip分卷压缩过的, 没有密码。预览可以看preview.json。

目前数据大约3G,数据会定期更新增加。

请只用于研究用途。

有问题或者特殊需求直接建Issue。

[email protected]

欢迎志同道合的小伙伴加入校宝一起来搞有意思的事情!https://www.xiaobaoonline.com/pc/contactjoin

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].