All Projects → dariusk → Corpora

dariusk / Corpora

A collection of small corpuses of interesting data for the creation of bots and similar stuff.

Programming Languages

javascript
184084 projects - #8 most used programming language

Projects that are alternatives of or similar to Corpora

go-pluralize
Pluralize and singularize any word (golang adaptation of https://www.npmjs.com/package/pluralize)
Stars: ✭ 60 (-98.6%)
Mutual labels:  words
Botbuilder Samples
Welcome to the Bot Framework samples repository. Here you will find task-focused samples in C#, JavaScript and TypeScript to help you get started with the Bot Framework SDK!
Stars: ✭ 3,484 (-18.84%)
Mutual labels:  bots
Wordless
An Integrated Corpus Tool With Multilingual Support for the Study of Language, Literature, and Translation
Stars: ✭ 378 (-91.19%)
Mutual labels:  corpus
Korpora
Korean corpus repository
Stars: ✭ 270 (-93.71%)
Mutual labels:  corpus
Messenger
Package messenger is used for making bots for use with Facebook messenger
Stars: ✭ 278 (-93.52%)
Mutual labels:  bots
Intelligo
🤖 Chatbot Framework for Node.js.
Stars: ✭ 347 (-91.92%)
Mutual labels:  bots
Medical-Names-Corpus
医疗语料库。医疗机构名语料库。药品本位码。
Stars: ✭ 26 (-99.39%)
Mutual labels:  corpus
Awesome Bots
Awesome Links about bots.
Stars: ✭ 412 (-90.4%)
Mutual labels:  bots
Cluecorpus2020
Large-scale Pre-training Corpus for Chinese 100G 中文预训练语料
Stars: ✭ 278 (-93.52%)
Mutual labels:  corpus
Fuzzdata
Fuzzing resources for feeding various fuzzers with input. 🔧
Stars: ✭ 376 (-91.24%)
Mutual labels:  corpus
Telegram Bot Swift
Telegram Bot SDK for Swift (unofficial)
Stars: ✭ 275 (-93.59%)
Mutual labels:  bots
Botframework
⚠ The content in this repo has been moved to https://github.com/microsoft/botframework-sdk ⚠
Stars: ✭ 277 (-93.55%)
Mutual labels:  bots
Rivescript Js
A RiveScript interpreter for JavaScript. RiveScript is a scripting language for chatterbots.
Stars: ✭ 350 (-91.85%)
Mutual labels:  bots
Fakenewscorpus
A dataset of millions of news articles scraped from a curated list of data sources.
Stars: ✭ 255 (-94.06%)
Mutual labels:  corpus
Botbuilder Tools
Welcome to the Bot Framework Tools repository, which is the home for a set of tools for developers building bots with the Microsoft Bot Framework
Stars: ✭ 402 (-90.64%)
Mutual labels:  bots
similar-english-words
Give me a word and I’ll give you an array of words that differ by a single letter.
Stars: ✭ 25 (-99.42%)
Mutual labels:  words
Badwords
A javascript filter for badwords
Stars: ✭ 336 (-92.17%)
Mutual labels:  words
Botlibre
An open platform for artificial intelligence, chat bots, virtual agents, social media automation, and live chat automation.
Stars: ✭ 412 (-90.4%)
Mutual labels:  bots
Poshbot
Powershell-based bot framework
Stars: ✭ 410 (-90.45%)
Mutual labels:  bots
Flask Assistant
Framework for Building Virtual Assistants with Dialogflow and python
Stars: ✭ 358 (-91.66%)
Mutual labels:  bots

Corpora

This project is a collection of static corpora (plural of "corpus") that are potentially useful in the creation of weird internet stuff. I've found that, as a creator, sometimes I am making something that needs access to a lot of adjectives, but not necessarily every adjective in the English language. So for the last year I've been copy/pasting an adjs.json file from project to project. This is kind of awful, so I'm hoping that this project will at least help me keep everything in one place.

I would like this to help with rapid prototyping of projects. For example: you might use nouns.json to start with, just to see if an idea you had was any good. Once you've built the project quickly around the nouns collection, you can then rip it out and replace it with a more complex or exhaustive data source.

I'm also hoping that this can be used as a teaching tool: maybe someone has three hours to teach how to make Twitter bots. That doesn't give the student much time to find/scrape/clean/parse interesting data. My hope is that students can be pointed to this project and they can pick and choose different interesting data sources to meld together for the creation of prototypes.

License

Since Corpora is more data than code, I have chosen to CC0 license this (rather than MIT license or similar).

To the extent possible under law, Darius Kazemi has waived all copyright and related or neighboring rights to Corpora. This work is published from: United States.

What is Corpora NOT?

This project is not meant to replace exhaustive APIs -- if you want nouns, and you want every noun in the English language, replete with metadata, consider Wordnik. If you want the title of every Wikipedia article, use the MediaWiki API.

What is Corpora?

  • Corpora is repository of JSON files, meant to be language-neutral. If you want to create an NPM repo or whatever based on this, be my guest, but this repository will remain a collection of data files that can be interpreted by any language that can parse JSON.
  • Corpora is a collection of small files. It is not meant to be an exhaustive source of anything: a list of resources should contain somewhere in the vicinity of 1000 items.
    • For example, Corpora will not contain any complete "dictionary" style files. Instead we host a sampling of 1000 common nouns, adjectives, and verbs.
    • Some lists are small enough by nature that we may contain a complete list of things in their category. For example, a list of heavily populated U.S. cities may only have 75 cities and be considered complete.

List of Corpora-related tools

I have some data, how do I submit?

We accept pull requests to this repository. Some guidelines:

  • BY SUBMITTING DATA AS A PULL REQUEST, YOU AGREE TO OUR APPLYING A CC0 FREE CULTURE LICENSE TO THE DATA, MEANING THAT ANYONE CAN USE THE DATA FOR ANY REASON WITHOUT ATTRIBUTION IN PERPETUITY.
  • Please submit all data as JSON format in a file with a .json extension, and please JSONLint your files before submitting -- also, thanks to Matt Rothenberg we have Travis-CI testing, which will jsonlint your pull request automatically. If you see a test failure notification in your PR after you submit, there's a problem with your JSON!
  • Keep individual files to about 1000 "things" maximum. Fewer than 1000 is fine, too.
  • If you'd like attribution, I'm happy to include your name in this Readme file. Just remember that nobody who uses this data is obligated to include attribution in their own projects.

Contributors

By Darius Kazemi and Many Wonderful Contributors.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].