Colibri CoreColibri core is an NLP tool as well as a C++ and Python library for working with basic linguistic constructions such as n-grams and skipgrams (i.e patterns with one or more gaps, either of fixed or dynamic size) in a quick and memory-efficient way. At the core is the tool ``colibri-patternmodeller`` whi ch allows you to build, view, manipulate and query pattern models.
Stars: ✭ 112 (+273.33%)
gumRepository for the Georgetown University Multilayer Corpus (GUM)
Stars: ✭ 71 (+136.67%)
foliaFoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for proces…
Stars: ✭ 56 (+86.67%)
Nlp bahasa resourcesA Curated List of Dataset and Usable Library Resources for NLP in Bahasa Indonesia
Stars: ✭ 158 (+426.67%)
PansoriTools for ASR Corpus Generation from Online Video
Stars: ✭ 106 (+253.33%)
PycluePython toolkit for Chinese Language Understanding(CLUE) Evaluation benchmark
Stars: ✭ 91 (+203.33%)
BlacklabA corpus retrieval engine based on Apache Lucene
Stars: ✭ 69 (+130%)
WonderfulPolishLanguageThis is a repository created for the list of resources for learning and exploring Wonderful Polish language.
Stars: ✭ 31 (+3.33%)
ProsodyHelsinki Prosody Corpus and A System for Predicting Prosodic Prominence from Text
Stars: ✭ 139 (+363.33%)
Typing AssistantTyping Assistant provides the ability to autocomplete words and suggests predictions for the next word. This makes typing faster, more intelligent and reduces effort.
Stars: ✭ 32 (+6.67%)
DatasetsPoetry-related datasets developed by THUAIPoet (Jiuge) group.
Stars: ✭ 111 (+270%)
NlvrCornell NLVR and NLVR2 are natural language grounding datasets. Each example shows a visual input and a sentence describing it, and is annotated with the truth-value of the sentence.
Stars: ✭ 192 (+540%)
poesyPoetic processing, for Python.
Stars: ✭ 28 (-6.67%)
Ja.text8Japanese text8 corpus for word embedding.
Stars: ✭ 79 (+163.33%)
Wp2txtWP2TXT extracts plain text data from Wikipedia dump file (encoded in XML/compressed with Bzip2) stripping all the MediaWiki markups and other metadata.
Stars: ✭ 145 (+383.33%)
pylangacqLanguage Acquisition Research Tools
Stars: ✭ 33 (+10%)
Code Docstring CorpusPreprocessed Python functions and docstrings for automated code documentation (code2doc) and automated code generation (doc2code) tasks.
Stars: ✭ 137 (+356.67%)
Nlp chinese corpus大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
Stars: ✭ 6,656 (+22086.67%)
KhcoderKH Coder: for Quantitative Content Analysis or Text Mining
Stars: ✭ 126 (+320%)
Weibo terminaterFinal Weibo Crawler Scrap Anything From Weibo, comments, weibo contents, followers, anything. The Terminator
Stars: ✭ 2,295 (+7550%)
Ua GecUA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language
Stars: ✭ 108 (+260%)
Pubmed RctPubMed 200k RCT dataset: a large dataset for sequential sentence classification.
Stars: ✭ 101 (+236.67%)
Efaqa Corpus Zh❤️Emotional First Aid Dataset, 心理咨询问答、聊天机器人语料库
Stars: ✭ 170 (+466.67%)
Dataset Listlists of text corpus and more (mainly Japanese)
Stars: ✭ 84 (+180%)
Russian news corpusRussian mass media stemmed texts corpus / Корпус лемматизированных (морфологически нормализованных) текстов российских СМИ
Stars: ✭ 76 (+153.33%)
pfootprintPolitical Discourse Analysis Using Pre-Trained Word Vectors.
Stars: ✭ 20 (-33.33%)
CoarijCorpus of Annual Reports in Japan
Stars: ✭ 55 (+83.33%)
Clue中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard
Stars: ✭ 2,425 (+7983.33%)
DANeSDANeS is an open-source E-newspaper dataset by collaboration between DATASET JSC (dataset.vn) and AIV Group (aivgroup.vn)
Stars: ✭ 64 (+113.33%)
Lyrics CorporaAn unofficial Python API that allows users to create a corpus of lyrical text from their favorite artists and billboard charts
Stars: ✭ 13 (-56.67%)
Naive Bayes ClassifierNaive Bayes classifier is classification algorithm. It uses Naive based Bernoulli and Multinomial equation to classify documents(Text) as ham or spam.
Stars: ✭ 6 (-80%)
megsA merged version of multiple open-source German speech datasets.
Stars: ✭ 21 (-30%)
Seq2seq ChatbotChatbot in 200 lines of code using TensorLayer
Stars: ✭ 777 (+2490%)
Awesome ChatbotAwesome Chatbot Projects,Corpus,Papers,Tutorials.Chinese Chatbot =>:
Stars: ✭ 1,785 (+5850%)
QuantedaAn R package for the Quantitative Analysis of Textual Data
Stars: ✭ 647 (+2056.67%)
rclcRich Context leaderboard competition, including the corpus and current SOTA for required tasks.
Stars: ✭ 20 (-33.33%)
Dialog corpus用于训练中英文对话系统的语料库 Datasets for Training Chatbot System
Stars: ✭ 1,662 (+5440%)
BookcorpusCrawl BookCorpus
Stars: ✭ 443 (+1376.67%)
Awesome Persian Nlp IrCurated List of Persian Natural Language Processing and Information Retrieval Tools and Resources
Stars: ✭ 460 (+1433.33%)
Chinese Names Corpus中文人名语料库。人名生成器。中文姓名,姓氏,名字,称呼,日本人名,翻译人名,英文人名。可用于中文分词、人名实体识别。
Stars: ✭ 3,053 (+10076.67%)
nyt-first-saidTweets when words are published for the first time in the NYT
Stars: ✭ 222 (+640%)
german-nounsA list of ~100,000 German nouns and their grammatical properties compiled from WiktionaryDE as CSV file. Plus a module to look up the data and parse compound words.
Stars: ✭ 101 (+236.67%)