All Projects → Erlemar → Simple_chat_bot

Erlemar / Simple_chat_bot

Licence: other
Simple nlp chatbot

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Simple chat bot

img2vec-keras
Image to dense vector embedding. Clone of https://github.com/christiansafka/img2vec for Keras users
Stars: ✭ 36 (+56.52%)
Mutual labels:  embeddings
Entity Embedding
Reference implementation of the paper "Word Embeddings for Entity-annotated Texts"
Stars: ✭ 19 (-17.39%)
Mutual labels:  embeddings
meemi
Improving cross-lingual word embeddings by meeting in the middle
Stars: ✭ 20 (-13.04%)
Mutual labels:  embeddings
geometric embedding
"Zero-Training Sentence Embedding via Orthogonal Basis" paper implementation
Stars: ✭ 19 (-17.39%)
Mutual labels:  embeddings
event-embedding-multitask
*SEM 2018: Learning Distributed Event Representations with a Multi-Task Approach
Stars: ✭ 22 (-4.35%)
Mutual labels:  embeddings
embedding evaluation
Evaluate your word embeddings
Stars: ✭ 32 (+39.13%)
Mutual labels:  embeddings
Whatlies
Toolkit to help understand "what lies" in word embeddings. Also benchmarking!
Stars: ✭ 246 (+969.57%)
Mutual labels:  embeddings
ar-embeddings
Sentiment Analysis for Arabic Text (tweets, reviews, and standard Arabic) using word2vec
Stars: ✭ 83 (+260.87%)
Mutual labels:  embeddings
image embeddings
Using efficientnet to provide embeddings for retrieval
Stars: ✭ 107 (+365.22%)
Mutual labels:  embeddings
ClusterTransformer
Topic clustering library built on Transformer embeddings and cosine similarity metrics.Compatible with all BERT base transformers from huggingface.
Stars: ✭ 36 (+56.52%)
Mutual labels:  embeddings
Probabilistic-RNN-DA-Classifier
Probabilistic Dialogue Act Classification for the Switchboard Corpus using an LSTM model
Stars: ✭ 22 (-4.35%)
Mutual labels:  embeddings
embedding study
中文预训练模型生成字向量学习,测试BERT,ELMO的中文效果
Stars: ✭ 94 (+308.7%)
Mutual labels:  embeddings
PersianNER
Named-Entity Recognition in Persian Language
Stars: ✭ 48 (+108.7%)
Mutual labels:  embeddings
word-embeddings-from-scratch
Creating word embeddings from scratch and visualize them on TensorBoard. Using trained embeddings in Keras.
Stars: ✭ 22 (-4.35%)
Mutual labels:  embeddings
cskg
CSKG: The CommonSense Knowledge Graph
Stars: ✭ 86 (+273.91%)
Mutual labels:  embeddings
simple elmo
Simple library to work with pre-trained ELMo models in TensorFlow
Stars: ✭ 49 (+113.04%)
Mutual labels:  embeddings
DeepLearningReading
Deep Learning and Machine Learning mini-projects. Current Project: Deepmind Attentive Reader (rc-data)
Stars: ✭ 78 (+239.13%)
Mutual labels:  embeddings
TCE
This repository contains the code implementation used in the paper Temporally Coherent Embeddings for Self-Supervised Video Representation Learning (TCE).
Stars: ✭ 51 (+121.74%)
Mutual labels:  embeddings
CaRE
EMNLP 2019: CaRe: Open Knowledge Graph Embeddings
Stars: ✭ 34 (+47.83%)
Mutual labels:  embeddings
SubGNN
Subgraph Neural Networks (NeurIPS 2020)
Stars: ✭ 136 (+491.3%)
Mutual labels:  embeddings

Telegram chat-bot is stopped

This bot was a project for fun, so there was quite a low number of users. Running chat-bot on Amazon instance is quite costly (~40$ per month), so I stopped it.

Basic NLP chat-bot

This chatbot was created based on the final project of this course: https://www.coursera.org/learn/language-processing/home/welcome and later updated to meet the requirements of the honor assignment.

The main functionality of the bot is to distinguish two types of questions (questions related to programming and others) and then either give an answer or talk using a conversational model.

Distinguishing intent

At first I had two datasets: StackOverflow posts (programming questions for 10 languages) and dialogue phrases from movie subtitles (non-programming questions). TfidfVectorizer with LogisticRegression were used to build model to classify user's question into these categories.

Programming language clasification

OneVsRestClassifier was trained on StackOverflow posts with 10 tags to predict them.

Finding the most relevant answer for the programming question

Starspace (facebook model) embeddings were trained on StackOverflow posts. All the posts were represented as vectors using these embeddings. For each tag a separate file with embeddings was created so that it wouldn't be necessary to load all the embeddings in the memory at once. User's question is also vectorized and most relevant answer is selected based on cosine similarity between the question and answers belonging to the predicted tag.

Conversation

If the question is classified as non-programming, then conversational bot is activated. I have used the idea of ChatBot from this repository: https://github.com/oswaldoludwig/Seq2seq-Chatbot-for-Keras The chatbot is trained using teacher forcing. I used the pre-trained weights and fine-tuned the model on my own data. Also I rewrote most of the code, so it would be easier to use and understand. Most of the model's parameters are set in settings.ini file.

Additional functionality

I decided to include some more functionality.

  • Bot can give weather forecast for 5 days. I use for this openweathermap api;
  • Bot can show the latest tweet of a certain user using twitter api;
  • Bot can give a give a random fact about current date using http://numbersapi.com/;

Bot limitations

Originally this bot was hosted on t2.micro tier of Amazon EC2, which implied quite limited resources. This is the reason that embeddings for each tag were saved separately to limit memory usage. Currently bot is hosted on t2.medium tier so that Keras model would have enough memory. The quality of model could be better, but it requires a lot of resources.

Files

dialogue_manager.py - generates an answer for the user's input.

main_bot.py - main functionality of the bot - receiving user's input and sending the answer.

utils.py - additional functions for dealing with data.

settings.ini - model parameters.

settings_secret.ini - twitter and telegram tokens, paths to files.

data folder - contains embeddings and pickled models.

thread_embeddings_by_tags - embeddings for stackoverflow posts.

processer.py - processing data for training with chatbot.

chatbot.py - chatbot on keras.

Files in these two folders are too big to be uploaded on github. Also I'm not sure I may upload them as they are a part of coursera course. If you with to make something similar - join the course, please.

Link to the bot

http://t.me/amlnlpbot

Training the bot on your own data

It is possible to train the bot on your own data. Data can be in one or several files with each utterance on a separate line. If utteraces are marked by names of some special symbols, add them to to_spaces list, so that they will be replaces by spaces. The model will perform better if dialogues have several lines, as in thic case more questions will have context.

If you want to train the model from scratch, you can do the following:

from processer import DataProcesser
from chatbot import Chatbot

# list of files to use
list_of_files = []
processer = DataProcesser()
processer.process_text_initial(list_of_files)

bot = Chatbot()
bot.train

In this case a new vocabulary will be created based on your data. You'll need to download Glove embeddings (or some other embeddings) and define them in settings.ini file. I'd recommend training for at least 100 epochs. An important point is that the model requires a lot of memory due to it's architecture and teacher forcing. If you get memory error, set num_subsets in settings.ini to a higher value.

If you want to fine-tune the model on additional data, use add_data method of DataProcesser(). Also you'll need to download pre-trained weights - file my_model_weights20.h5 here

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].