All Projects → rodrigopivi → Chatito

rodrigopivi / Chatito

Licence: mit
🎯🗯 Generate datasets for AI chatbots, NLP tasks, named entity recognition or text classification models using a simple DSL!

Programming Languages

typescript
32286 projects

Projects that are alternatives of or similar to Chatito

Chatette
A powerful dataset generator for Rasa NLU, inspired by Chatito
Stars: ✭ 205 (-69.76%)
Mutual labels:  chatbot, chatbots, nlu, nlg
Snips Nlu
Snips Python library to extract meaning from text
Stars: ✭ 3,583 (+428.47%)
Mutual labels:  chatbot, text-classification, named-entity-recognition, nlu
Botpress
🤖 Dev tools to reliably understand text and automate conversations. Built-in NLU. Connect & deploy on any messaging channel (Slack, MS Teams, website, Telegram, etc).
Stars: ✭ 9,486 (+1299.12%)
Mutual labels:  chatbot, chatbots, nlu
Chatbot cn
基于金融-司法领域(兼有闲聊性质)的聊天机器人,其中的主要模块有信息抽取、NLU、NLG、知识图谱等,并且利用Django整合了前端展示,目前已经封装了nlp和kg的restful接口
Stars: ✭ 791 (+16.67%)
Mutual labels:  text-classification, nlu, nlg
Awesome Hungarian Nlp
A curated list of NLP resources for Hungarian
Stars: ✭ 121 (-82.15%)
Mutual labels:  dataset, named-entity-recognition, nlu
virtual-assistant
Virtual Assistant
Stars: ✭ 67 (-90.12%)
Mutual labels:  chatbot, nlu, chatbots
Botfuel Dialog
Botfuel SDK to build highly conversational chatbots
Stars: ✭ 96 (-85.84%)
Mutual labels:  chatbot, chatbots, nlu
Rasa
💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants
Stars: ✭ 13,219 (+1849.71%)
Mutual labels:  chatbot, chatbots, nlu
Wisty.js
🧚‍♀️ Chatbot library turning conversations into actions, locally, in the browser.
Stars: ✭ 24 (-96.46%)
Mutual labels:  nlu, named-entity-recognition, chatbots
Chatbot ner
chatbot_ner: Named Entity Recognition for chatbots.
Stars: ✭ 273 (-59.73%)
Mutual labels:  chatbot, chatbots, named-entity-recognition
Clause
🏇 聊天机器人,自然语言理解,语义理解
Stars: ✭ 323 (-52.36%)
Mutual labels:  chatbot, nlu
Facemoji
😆 A voice chatbot that can imitate your expression. OpenCV+Dlib+Live2D+Moments Recorder+Turing Robot+Iflytek IAT+Iflytek TTS
Stars: ✭ 320 (-52.8%)
Mutual labels:  chatbot, chatbots
Rivescript Js
A RiveScript interpreter for JavaScript. RiveScript is a scripting language for chatterbots.
Stars: ✭ 350 (-48.38%)
Mutual labels:  chatbot, chatbots
Nlp Recipes
Natural Language Processing Best Practices & Examples
Stars: ✭ 5,783 (+752.95%)
Mutual labels:  text-classification, nlu
Spacy Streamlit
👑 spaCy building blocks and visualizers for Streamlit apps
Stars: ✭ 360 (-46.9%)
Mutual labels:  text-classification, named-entity-recognition
Dynamic Seq2seq
seq2seq中文聊天机器人
Stars: ✭ 303 (-55.31%)
Mutual labels:  chatbot, chatbots
Poshbot
Powershell-based bot framework
Stars: ✭ 410 (-39.53%)
Mutual labels:  chatbot, chatbots
Botlibre
An open platform for artificial intelligence, chat bots, virtual agents, social media automation, and live chat automation.
Stars: ✭ 412 (-39.23%)
Mutual labels:  chatbot, nlu
Bertweet
BERTweet: A pre-trained language model for English Tweets (EMNLP-2020)
Stars: ✭ 282 (-58.41%)
Mutual labels:  text-classification, named-entity-recognition
Bert Multitask Learning
BERT for Multitask Learning
Stars: ✭ 380 (-43.95%)
Mutual labels:  text-classification, named-entity-recognition

Chatito

npm version CircleCI branch npm License

Alt text

Try the online IDE!

Overview

Chatito helps you generate datasets for training and validating chatbot models using a simple DSL.

If you are building chatbots using commercial models, open source frameworks or writing your own natural language processing model, you need training and testing examples. Chatito is here to help you.

This project contains the:

Chatito language

For the full language specification and documentation, please refer to the DSL spec document.

Tips

Prevent overfit

Overfitting is a problem that can be prevented if we use Chatito correctly. The idea behind this tool, is to have an intersection between data augmentation and a description of possible sentences combinations. It is not intended to generate deterministic datasets that may overfit a single sentence model, in those cases, you can have some control over the generation paths only pull samples as required.

Tools and resources

Adapters

The language is independent from the generated output format and because each model can receive different parameters and settings, this are the currently implemented data formats, if your provider is not listed, at the Tools and resources section there is more information on how to support more formats.

NOTE: Samples are not shuffled between intents for easier review and because some adapters stream samples directly to the file and it's recommended to split intents in different files for easier review and maintenance.

Rasa

Rasa is an open source machine learning framework for automated text and voice-based conversations. Understand messages, hold conversations, and connect to messaging channels and APIs. Chatito can help you build a dataset for the Rasa NLU component.

One particular behavior of the Rasa adapter is that when a slot definition sentence only contains one alias, and that alias defines the 'synonym' argument with 'true', the generated Rasa dataset will map the alias as a synonym. e.g.:

%[some intent]('training': '1')
    @[some slot]

@[some slot]
    ~[some slot synonyms]

~[some slot synonyms]('synonym': 'true')
    synonym 1
    synonym 2

In this example, the generated Rasa dataset will contain the entity_synonyms of synonym 1 and synonym 2 mapping to some slot synonyms.

Flair

Flair A very simple framework for state-of-the-art NLP. Developed by Zalando Research. It provides state of the art (GPT, BERT, RoBERTa, XLNet, ELMo, etc...) pre trained embeddings for many languages that work out of the box. This adapter supports the text classification dataset in FastText format and the named entity recognition dataset in two column BIO annotated words, as documented at flair corpus documentation. This two data formats are very common and with many other providers or models.

The NER dataset requires a word tokenization processing that is currently done using a simple tokenizer.

NOTE: Flair adapter is only available for the NodeJS NPM CLI package, not for the IDE.

LUIS

LUIS is part of Microsoft's Cognitive services. Chatito supports training a LUIS NLU model through its batch add labeled utterances endpoint, and its batch testing api.

To train a LUIS model, you will need to post the utterance in batches to the relevant API for training or testing.

Reference issue: #61

Snips NLU

Snips NLU is another great open source framework for NLU. One particular behavior of the Snips adapter is that you can define entity types for the slots. e.g.:

%[date search]('training':'1')
   for @[date]

@[date]('entity': 'snips/datetime')
    ~[today]
    ~[tomorrow]

In the previous example, all @[date] values will be tagged with the snips/datetime entity tag.

Default format

Use the default format if you plan to train a custom model or if you are writing a custom adapter. This is the most flexible format because you can annotate Slots and Intents with custom entity arguments, and they all will be present at the generated output, so for example, you could also include dialog/response generation logic with the DSL. E.g.:

%[some intent]('context': 'some annotation')
    @[some slot] ~[please?]

@[some slot]('required': 'true', 'type': 'some type')
    ~[some alias here]

Custom entities like 'context', 'required' and 'type' will be available at the output so you can handle this custom arguments as you want.

NPM package

Chatito supports Node.js >= v8.11.

Install it with yarn or npm:

npm i chatito --save

Then create a definition file (e.g.: trainClimateBot.chatito) with your code.

Run the npm generator:

npx chatito trainClimateBot.chatito

The generated dataset should be available next to your definition file.

Here is the full npm generator options:

npx chatito <pathToFileOrDirectory> --format=<format> --formatOptions=<formatOptions> --outputPath=<outputPath> --trainingFileName=<trainingFileName> --testingFileName=<testingFileName> --defaultDistribution=<defaultDistribution> --autoAliases=<autoAliases>
  • <pathToFileOrDirectory> path to a .chatito file or a directory that contains chatito files. If it is a directory, will search recursively for all *.chatito files inside and use them to generate the dataset. e.g.: lightsChange.chatito or ./chatitoFilesFolder

  • <format> Optional. default, rasa, luis, flair or snips.

  • <formatOptions> Optional. Path to a .json file that each adapter optionally can use

  • <outputPath> Optional. The directory where to save the generated datasets. Uses the current directory as default.

  • <trainingFileName> Optional. The name of the generated training dataset file. Do not forget to add a .json extension at the end. Uses <format>_dataset_training.json as default file name.

  • <testingFileName> Optional. The name of the generated testing dataset file. Do not forget to add a .json extension at the end. Uses <format>_dataset_testing.json as default file name.

  • <defaultDistribution> Optional. The default frequency distribution if not defined at the entity level. Defaults to regular and can be set to even.

  • <autoAliases> Optional. The generaor behavior when finding an undefined alias. Valid opions are allow, warn, restrict. Defauls to 'allow'.

Author and maintainer

Rodrigo Pimentel

sr.rodrigopv[at]gmail

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].