All Projects → adamshamsudeen → Vaaku2Vec

adamshamsudeen / Vaaku2Vec

Licence: GPL-3.0 license
Language Modeling and Text Classification in Malayalam Language using ULMFiT

Programming Languages

Jupyter Notebook
11667 projects
python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Vaaku2Vec

Nlp chinese corpus
大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
Stars: ✭ 6,656 (+9688.24%)
Mutual labels:  text-classification, word2vec, language-model
Lightnlp
基于Pytorch和torchtext的自然语言处理深度学习框架。
Stars: ✭ 739 (+986.76%)
Mutual labels:  text-classification, word2vec, language-model
Sentiment analysis fine grain
Multi-label Classification with BERT; Fine Grained Sentiment Analysis from AI challenger
Stars: ✭ 546 (+702.94%)
Mutual labels:  text-classification, language-model
Nlp In Practice
Starter code to solve real world text data problems. Includes: Gensim Word2Vec, phrase embeddings, Text Classification with Logistic Regression, word count with pyspark, simple text preprocessing, pre-trained embeddings and more.
Stars: ✭ 790 (+1061.76%)
Mutual labels:  text-classification, word2vec
Bert language understanding
Pre-training of Deep Bidirectional Transformers for Language Understanding: pre-train TextCNN
Stars: ✭ 933 (+1272.06%)
Mutual labels:  text-classification, language-model
Text Cnn
嵌入Word2vec词向量的CNN中文文本分类
Stars: ✭ 298 (+338.24%)
Mutual labels:  text-classification, word2vec
Nlp Projects
word2vec, sentence2vec, machine reading comprehension, dialog system, text classification, pretrained language model (i.e., XLNet, BERT, ELMo, GPT), sequence labeling, information retrieval, information extraction (i.e., entity, relation and event extraction), knowledge graph, text generation, network embedding
Stars: ✭ 360 (+429.41%)
Mutual labels:  text-classification, word2vec
Deep-NLP-Resources
Curated list of all NLP Resources
Stars: ✭ 65 (-4.41%)
Mutual labels:  text-classification, language-model
Text rnn attention
嵌入Word2vec词向量的RNN+ATTENTION中文文本分类
Stars: ✭ 117 (+72.06%)
Mutual labels:  text-classification, word2vec
Ml Projects
ML based projects such as Spam Classification, Time Series Analysis, Text Classification using Random Forest, Deep Learning, Bayesian, Xgboost in Python
Stars: ✭ 127 (+86.76%)
Mutual labels:  text-classification, word2vec
Fasttext.js
FastText for Node.js
Stars: ✭ 127 (+86.76%)
Mutual labels:  text-classification, word2vec
Bertweet
BERTweet: A pre-trained language model for English Tweets (EMNLP-2020)
Stars: ✭ 282 (+314.71%)
Mutual labels:  text-classification, language-model
Product-Categorization-NLP
Multi-Class Text Classification for products based on their description with Machine Learning algorithms and Neural Networks (MLP, CNN, Distilbert).
Stars: ✭ 30 (-55.88%)
Mutual labels:  text-classification, word2vec
Spacy
💫 Industrial-strength Natural Language Processing (NLP) in Python
Stars: ✭ 21,978 (+32220.59%)
Mutual labels:  text-classification, tokenization
FNet-pytorch
Unofficial implementation of Google's FNet: Mixing Tokens with Fourier Transforms
Stars: ✭ 204 (+200%)
Mutual labels:  text-classification, language-model
Text Pairs Relation Classification
About Text Pairs (Sentence Level) Classification (Similarity Modeling) Based on Neural Network.
Stars: ✭ 182 (+167.65%)
Mutual labels:  text-classification, word2vec
sarcasm-detection-for-sentiment-analysis
Sarcasm Detection for Sentiment Analysis
Stars: ✭ 21 (-69.12%)
Mutual labels:  text-classification, word2vec
text-classification-cn
中文文本分类实践,基于搜狗新闻语料库,采用传统机器学习方法以及预训练模型等方法
Stars: ✭ 81 (+19.12%)
Mutual labels:  text-classification, word2vec
Few Shot Text Classification
Few-shot binary text classification with Induction Networks and Word2Vec weights initialization
Stars: ✭ 32 (-52.94%)
Mutual labels:  text-classification, word2vec
Lotclass
[EMNLP 2020] Text Classification Using Label Names Only: A Language Model Self-Training Approach
Stars: ✭ 160 (+135.29%)
Mutual labels:  text-classification, language-model

Vaaku2Vec

State-of-the-Art Language Modeling and Text Classification in Malayalam Language

Results

We trained a Malayalam language model on the Wikipedia article dump from Oct, 2018. The Wikipedia dump had 55k+ articles. The difficuly in training a Malayalam language model is text tokenization, since Malayalam is a highly inflectional and agglutinative language. In the current model, we are using nltk tokenizer (will try better alternative in the future) and the vocab size is 30k. The language model was used to train a classifier which classifies a news into 5 categories (India, Kerala, Sports, Business, Entertainment). Our classifier came out with a whooping 92% accuracy in the classification task.

Releases

  • Proccessed wikipedia dump of articles split into test and train.
  • Script and weights for Malayalam Language model.
  • Malayalam text classifier with pretrained weights.
  • Inference code for text classifier.

Downloads

Requirements

Installing dependencies

python3.6>=

If you are using virtualenvwrapper use the following steps:

  1. git clone https://github.com/adamshamsudeen/Vaaku2Vec.git
  2. mkvirtualenv -p python3.6 venv
  3. workon venv
  4. cd Vaaku2Vec
  5. pip install -r requirements.txt

Usage

Training language model with preprocessed data:

  1. Download the pretrained language model folder, it contains the preprocessed test and train csv. If you would like to preproccess and retrain the LM using the latest dump article dump using the scripts provided here.
  2. Create tokens:
    python lm/create_toks.py <path_to_processed_wiki_dump>
    eg: python lm/create_toks.py /home/adamshamsudeen/mal/Vaaku2Vec/wiki/ml/
  3. Create a token to id mapping:
    python lm/tok2id.py <path_to_processed_wiki_dump>
    eg: python lm/tok2id.py /home/adamshamsudeen/mal/Vaaku2Vec/wiki/ml/
  4. Train language model:
    python lm/pretrain_lm.py <path_to_processed_wiki_dump> 0 --lr 1e-3 --cl 40
    eg: python lm/pretrain_lm.py /home/adamshamsudeen/mal/Vaaku2Vec/wiki/ml/ 0 --lr 1e-3 --cl 40
    lr is the learning rate and cl is the no of epochs.

Training the classifier:

  1. Use train_classifier.ipynb to train a malayalam text classifier.
  2. We have not released the news dataset, raise a request if you want to experiment with the same.

Testing the classifier:

  1. To test the classifier trained on Manorama news, download the Pretrained Malyalam Text Classifier mentioned in the downloads.
  2. Use prediction.ipynb and test out your input.

We manually tested the model on news from other leading news paper and the model performed pretty well. result

Word2Vec:

  1. We also trained a word2vec model using gensim with the Wikipedia dump.
  2. You can also use word2vec model to train a text classifier. News Classifier
  3. You can see the word2vec demo in the below link.

Demo

TODO

  • Malayalam Language modeling based on wikipedia articles.
  • Release Trained Language Models weights.
  • Malayalam Text classifier script.
  • Benchmark with mlmorph for tokenization.
  • Benchmark with Byte pair encoding for tokenization
  • UI to train and test classifier.
  • Basic Chatbot using this implementation.

Thanks

  1. Special thanks to Sebastian Ruder and Jeremy Howard and other contributors to fastai and ULTMFiT.
  2. Logo base design
  3. Raeesa for designing the logo.

Contibutors

  1. Kamal K Raj
  2. Adam Shamsudeen
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].