All Projects → tugstugi → Mongolian Bert

tugstugi / Mongolian Bert

Pre-trained Mongolian BERT models

Projects that are alternatives of or similar to Mongolian Bert

Python Tutorial Notebooks
Python tutorials as Jupyter Notebooks for NLP, ML, AI
Stars: ✭ 52 (+147.62%)
Mutual labels:  jupyter-notebook, natural-language-processing, natural-language-understanding
Speech Emotion Analyzer
The neural network model is capable of detecting five different male/female emotions from audio speeches. (Deep Learning, NLP, Python)
Stars: ✭ 633 (+2914.29%)
Mutual labels:  jupyter-notebook, natural-language-processing, natural-language-understanding
Deep Nlp Seminars
Materials for deep NLP course
Stars: ✭ 113 (+438.1%)
Mutual labels:  jupyter-notebook, natural-language-processing, natural-language-understanding
Coursera Natural Language Processing Specialization
Programming assignments from all courses in the Coursera Natural Language Processing Specialization offered by deeplearning.ai.
Stars: ✭ 39 (+85.71%)
Mutual labels:  jupyter-notebook, natural-language-processing, natural-language-understanding
Spark Nlp Models
Models and Pipelines for the Spark NLP library
Stars: ✭ 88 (+319.05%)
Mutual labels:  jupyter-notebook, natural-language-processing, natural-language-understanding
Natural Language Processing Specialization
This repo contains my coursework, assignments, and Slides for Natural Language Processing Specialization by deeplearning.ai on Coursera
Stars: ✭ 151 (+619.05%)
Mutual labels:  jupyter-notebook, natural-language-processing, natural-language-understanding
Practical Nlp
Official Repository for 'Practical Natural Language Processing' by O'Reilly Media
Stars: ✭ 452 (+2052.38%)
Mutual labels:  jupyter-notebook, natural-language-processing, natural-language-understanding
Conv Emotion
This repo contains implementation of different architectures for emotion recognition in conversations.
Stars: ✭ 646 (+2976.19%)
Mutual labels:  natural-language-processing, natural-language-understanding
Madewithml
Learn how to responsibly deliver value with ML.
Stars: ✭ 29,253 (+139200%)
Mutual labels:  jupyter-notebook, natural-language-processing
Ai Series
📚 [.md & .ipynb] Series of Artificial Intelligence & Deep Learning, including Mathematics Fundamentals, Python Practices, NLP Application, etc. 💫 人工智能与深度学习实战,数理统计篇 | 机器学习篇 | 深度学习篇 | 自然语言处理篇 | 工具实践 Scikit & Tensoflow & PyTorch 篇 | 行业应用 & 课程笔记
Stars: ✭ 702 (+3242.86%)
Mutual labels:  jupyter-notebook, natural-language-processing
Coursera
Quiz & Assignment of Coursera
Stars: ✭ 774 (+3585.71%)
Mutual labels:  jupyter-notebook, natural-language-processing
Machine Learning
머신러닝 입문자 혹은 스터디를 준비하시는 분들에게 도움이 되고자 만든 repository입니다. (This repository is intented for helping whom are interested in machine learning study)
Stars: ✭ 705 (+3257.14%)
Mutual labels:  jupyter-notebook, natural-language-processing
Nlp In Practice
Starter code to solve real world text data problems. Includes: Gensim Word2Vec, phrase embeddings, Text Classification with Logistic Regression, word count with pyspark, simple text preprocessing, pre-trained embeddings and more.
Stars: ✭ 790 (+3661.9%)
Mutual labels:  jupyter-notebook, natural-language-processing
Nlp Recipes
Natural Language Processing Best Practices & Examples
Stars: ✭ 5,783 (+27438.1%)
Mutual labels:  natural-language-processing, natural-language-understanding
This Word Does Not Exist
This Word Does Not Exist
Stars: ✭ 640 (+2947.62%)
Mutual labels:  natural-language-processing, natural-language-understanding
Bert
TensorFlow code and pre-trained models for BERT
Stars: ✭ 29,971 (+142619.05%)
Mutual labels:  natural-language-processing, natural-language-understanding
Me bot
Build a bot that speaks like you!
Stars: ✭ 641 (+2952.38%)
Mutual labels:  jupyter-notebook, natural-language-processing
Ecco
Visualize and explore NLP language models. Ecco creates interactive visualizations directly in Jupyter notebooks explaining the behavior of Transformer-based language models (like GPT2).
Stars: ✭ 723 (+3342.86%)
Mutual labels:  jupyter-notebook, natural-language-processing
Covid 19 Bert Researchpapers Semantic Search
BERT semantic search engine for searching literature research papers for coronavirus covid-19 in google colab
Stars: ✭ 23 (+9.52%)
Mutual labels:  jupyter-notebook, natural-language-processing
Awesome Ai Ml Dl
Awesome Artificial Intelligence, Machine Learning and Deep Learning as we learn it. Study notes and a curated list of awesome resources of such topics.
Stars: ✭ 831 (+3857.14%)
Mutual labels:  jupyter-notebook, natural-language-processing

Mongolian BERT models

This repository contains pre-trained Mongolian BERT models trained by tugstugi, enod and sharavsambuu. Special thanks to nabar who provided 5x TPUs.

This repository is based on the following open source projects: google-research/bert, huggingface/pytorch-pretrained-BERT and yoheikikuta/bert-japanese.

Models

SentencePiece with a vocabulary size 32000 is used as the text tokenizer. You can use the masked language model notebook Open In Colab to test how good the pre-trained models could predict masked Mongolian words.

Cased BERT-Base

Download either TensorFlow checkpoint or PyTorch model. Eval results:

global_step = 4000000
loss = 1.3476765
masked_lm_accuracy = 0.7069192
masked_lm_loss = 1.2822781
next_sentence_accuracy = 0.99875
next_sentence_loss = 0.0038988923

Uncased BERT-Base

Download either TensorFlow checkpoint or PyTorch model. Eval results:

global_step = 4000000
loss = 1.3115116
masked_lm_accuracy = 0.7018335
masked_lm_loss = 1.3155857
next_sentence_accuracy = 0.995
next_sentence_loss = 0.015816934

Loading in Tensorflow 2.x

Little changes needed in order to load weights as Keras Layer in Tensorflow 2.x Open In Colab

Finetuning

This repo contains only pre-trained BERT models, for finetuning see:

Pre-Training

This repo already provides pre-trained models. If you really want to pre-train from scratch, you will need a TPU. A base model can be trained in 13 days (4M steps) on TPUv2. For a big model, you will need more than a month. We have used max_seq_length=512 instead of training first with max_seq_length=128 and then with max_seq_length=512 because it had better masked LM accuracy.

Install

Checkout the project and install dependencies:

git clone --recursive https://github.com/tugstugi/mongolian-bert.git
pip3 install -r requirements.txt

Data preparation

Download the Mongolian Wikipedia and the 700 million word Mongolian news data set and pre process them into the directory mn_corpus/:

# Mongolian Wikipedia
python3 datasets/dl_and_preprop_mn_wiki.py
# 700 million words Mongolian news data set
python3 datasets/dl_and_preprop_mn_news.py

After pre-processing, the dataset will contain around 500M words.

Train SentencePiece vocabulary

Now, train the cased SentencePiece model i.e. with the vocabulary size 32000 :

cd sentencepiece
cat ../mn_corpus/*.txt > all.txt
python3 train_sentencepiece.py --input all.txt --vocab-size 32000 --prefix mn_cased

If the training was successful, the following files should be created: mn_cased.model and mn_cased.vocab. You can also test whether the SentencePiece model is working as intended:

>>> import sentencepiece as spm
>>> s = spm.SentencePieceProcessor()
>>> s.Load('mn_cased.model')
>>> s.EncodeAsPieces('Мөнгөө тушаачихсаныхаа дараа мэдэгдээрэй')
['▁Мөнгөө', '▁тушаа', 'чихсан', 'ыхаа', '▁дараа', '▁мэдэгд', 'ээрэй']

For a uncased SentencePiece model, convert the content of all.txt to lower case and train with:

python3 train_sentencepiece.py --input all.txt --vocab-size 32000 --prefix mn_uncased

Create/Upload TFRecord files

Create TFRecord files for cased:

python3 create_pretraining_data_helper.py --max_seq_length=512 --max_predictions_per_seq=77 --cased

Upload to your GCloud bucket:

gsutil cp mn_corpus/maxseq512*.tfrecord gs://YOUR_BUCKET/data-cased/

For uncased, adjust above steps accordingly.

Train a model

To train, i.e. uncased BERT-Base on TPUv2, use the following command:

export INPUT_FILES=gs://YOUR_BUCKET/data-uncased/maxseq512-mn_news_700m_1.tfrecord,gs://YOUR_BUCKET/data-uncased/maxseq512-mn_news_700m_10.tfrecord,gs://YOUR_BUCKET/data-uncased/maxseq512-mn_news_700m_11.tfrecord,gs://YOUR_BUCKET/data-uncased/maxseq512-mn_news_700m_12.tfrecord,gs://YOUR_BUCKET/data-uncased/maxseq512-mn_news_700m_13.tfrecord,gs://YOUR_BUCKET/data-uncased/maxseq512-mn_news_700m_14.tfrecord,gs://YOUR_BUCKET/data-uncased/maxseq512-mn_news_700m_15.tfrecord,gs://YOUR_BUCKET/data-uncased/maxseq512-mn_news_700m_16.tfrecord,gs://YOUR_BUCKET/data-uncased/maxseq512-mn_news_700m_17.tfrecord,gs://YOUR_BUCKET/data-uncased/maxseq512-mn_news_700m_18.tfrecord,gs://YOUR_BUCKET/data-uncased/maxseq512-mn_news_700m_19.tfrecord,gs://YOUR_BUCKET/data-uncased/maxseq512-mn_news_700m_2.tfrecord,gs://YOUR_BUCKET/data-uncased/maxseq512-mn_news_700m_3.tfrecord,gs://YOUR_BUCKET/data-uncased/maxseq512-mn_news_700m_4.tfrecord,gs://YOUR_BUCKET/data-uncased/maxseq512-mn_news_700m_5.tfrecord,gs://YOUR_BUCKET/data-uncased/maxseq512-mn_news_700m_6.tfrecord,gs://YOUR_BUCKET/data-uncased/maxseq512-mn_news_700m_7.tfrecord,gs://YOUR_BUCKET/data-uncased/maxseq512-mn_news_700m_8.tfrecord,gs://YOUR_BUCKET/data-uncased/maxseq512-mn_news_700m_9.tfrecord,gs://YOUR_BUCKET/data-uncased/maxseq512-mn_wiki.tfrecord
python3 bert/run_pretraining.py \
  --input_file=$INPUT_FILES \
  --output_dir=gs://YOUR_BUCKET/uncased_bert_base \
  --use_tpu=True \
  --tpu_name=YOUR_TPU_ADDRESS \
  --num_tpu_cores=8 \
  --do_train=True \
  --do_eval=True \
  --bert_config_file=bert_configs/bert_base_config.json \
  --train_batch_size=256 \
  --max_seq_length=128 \
  --max_predictions_per_seq=20 \
  --num_train_steps=4000000 \
  --num_warmup_steps=10000 \
  --learning_rate=1e-4

For a large model, use bert_config_file=bert_configs/bert_large_config.json and train_batch_size=32.

Citation

@misc{mongolian-bert,
  author = {Tuguldur, Erdene-Ochir and Gunchinish, Sharavsambuu and Bataa, Enkhbold},
  title = {BERT Pretrained Models on Mongolian Datasets},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/tugstugi/mongolian-bert/}}
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].