Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → tugstugi → Mongolian Bert

tugstugi / Mongolian Bert

Pre-trained Mongolian BERT models

Labels

jupyter-notebook machine-learning pytorch tensorflow nlp natural-language-processing natural-language-understanding

Projects that are alternatives of or similar to Mongolian Bert

Python Tutorial Notebooks

Python tutorials as Jupyter Notebooks for NLP, ML, AI

Stars: ✭ 52 (+147.62%)

Mutual labels: jupyter-notebook, natural-language-processing, natural-language-understanding

Speech Emotion Analyzer

The neural network model is capable of detecting five different male/female emotions from audio speeches. (Deep Learning, NLP, Python)

Stars: ✭ 633 (+2914.29%)

Mutual labels: jupyter-notebook, natural-language-processing, natural-language-understanding

Deep Nlp Seminars

Materials for deep NLP course

Stars: ✭ 113 (+438.1%)

Mutual labels: jupyter-notebook, natural-language-processing, natural-language-understanding

Coursera Natural Language Processing Specialization

Programming assignments from all courses in the Coursera Natural Language Processing Specialization offered by deeplearning.ai.

Stars: ✭ 39 (+85.71%)

Mutual labels: jupyter-notebook, natural-language-processing, natural-language-understanding

Spark Nlp Models

Models and Pipelines for the Spark NLP library

Stars: ✭ 88 (+319.05%)

Mutual labels: jupyter-notebook, natural-language-processing, natural-language-understanding

Natural Language Processing Specialization

This repo contains my coursework, assignments, and Slides for Natural Language Processing Specialization by deeplearning.ai on Coursera

Stars: ✭ 151 (+619.05%)

Mutual labels: jupyter-notebook, natural-language-processing, natural-language-understanding

Practical Nlp

Official Repository for 'Practical Natural Language Processing' by O'Reilly Media

Stars: ✭ 452 (+2052.38%)

Mutual labels: jupyter-notebook, natural-language-processing, natural-language-understanding

Conv Emotion

This repo contains implementation of different architectures for emotion recognition in conversations.

Stars: ✭ 646 (+2976.19%)

Mutual labels: natural-language-processing, natural-language-understanding

Madewithml

Learn how to responsibly deliver value with ML.

Stars: ✭ 29,253 (+139200%)

Mutual labels: jupyter-notebook, natural-language-processing

Ai Series

Stars: ✭ 702 (+3242.86%)

Mutual labels: jupyter-notebook, natural-language-processing

Coursera

Quiz & Assignment of Coursera

Stars: ✭ 774 (+3585.71%)

Mutual labels: jupyter-notebook, natural-language-processing

Machine Learning

머신러닝 입문자 혹은 스터디를 준비하시는 분들에게 도움이 되고자 만든 repository입니다. (This repository is intented for helping whom are interested in machine learning study)

Stars: ✭ 705 (+3257.14%)

Mutual labels: jupyter-notebook, natural-language-processing

Nlp In Practice

Starter code to solve real world text data problems. Includes: Gensim Word2Vec, phrase embeddings, Text Classification with Logistic Regression, word count with pyspark, simple text preprocessing, pre-trained embeddings and more.

Stars: ✭ 790 (+3661.9%)

Mutual labels: jupyter-notebook, natural-language-processing

Nlp Recipes

Natural Language Processing Best Practices & Examples

Stars: ✭ 5,783 (+27438.1%)

Mutual labels: natural-language-processing, natural-language-understanding

This Word Does Not Exist

Stars: ✭ 640 (+2947.62%)

Mutual labels: natural-language-processing, natural-language-understanding

Bert

TensorFlow code and pre-trained models for BERT

Stars: ✭ 29,971 (+142619.05%)

Mutual labels: natural-language-processing, natural-language-understanding

Me bot

Build a bot that speaks like you!

Stars: ✭ 641 (+2952.38%)

Mutual labels: jupyter-notebook, natural-language-processing

Ecco

Visualize and explore NLP language models. Ecco creates interactive visualizations directly in Jupyter notebooks explaining the behavior of Transformer-based language models (like GPT2).

Stars: ✭ 723 (+3342.86%)

Mutual labels: jupyter-notebook, natural-language-processing

Covid 19 Bert Researchpapers Semantic Search

BERT semantic search engine for searching literature research papers for coronavirus covid-19 in google colab

Stars: ✭ 23 (+9.52%)

Mutual labels: jupyter-notebook, natural-language-processing

Awesome Ai Ml Dl

Awesome Artificial Intelligence, Machine Learning and Deep Learning as we learn it. Study notes and a curated list of awesome resources of such topics.

Stars: ✭ 831 (+3857.14%)

Mutual labels: jupyter-notebook, natural-language-processing

View All Similar Projects ➔

Mongolian BERT models

This repository contains pre-trained Mongolian BERT models trained by tugstugi, enod and sharavsambuu. Special thanks to nabar who provided 5x TPUs.

This repository is based on the following open source projects: google-research/bert, huggingface/pytorch-pretrained-BERT and yoheikikuta/bert-japanese.

Models

SentencePiece with a vocabulary size 32000 is used as the text tokenizer. You can use the masked language model notebook to test how good the pre-trained models could predict masked Mongolian words.

cased BERT-Base: TensorFlow checkpoint and PyTorch model
cased BERT-Large: to be released
uncased BERT-Base: TensorFlow checkpoint and PyTorch model
uncased BERT-Large: to be released

Cased BERT-Base

Download either TensorFlow checkpoint or PyTorch model. Eval results:

global_step = 4000000
loss = 1.3476765
masked_lm_accuracy = 0.7069192
masked_lm_loss = 1.2822781
next_sentence_accuracy = 0.99875
next_sentence_loss = 0.0038988923

Uncased BERT-Base

Download either TensorFlow checkpoint or PyTorch model. Eval results:

global_step = 4000000
loss = 1.3115116
masked_lm_accuracy = 0.7018335
masked_lm_loss = 1.3155857
next_sentence_accuracy = 0.995
next_sentence_loss = 0.015816934

Loading in Tensorflow 2.x

Little changes needed in order to load weights as Keras Layer in Tensorflow 2.x

Finetuning

This repo contains only pre-trained BERT models, for finetuning see:

Mongolian named entity recognition enod/mongolian-bert-ner using the Mongolian NER dataset
Mongolian text classification sharavsambuu/mongolian-text-classification using the Eduge news classification dataset

Pre-Training

This repo already provides pre-trained models. If you really want to pre-train from scratch, you will need a TPU. A base model can be trained in 13 days (4M steps) on TPUv2. For a big model, you will need more than a month. We have used max_seq_length=512 instead of training first with max_seq_length=128 and then with max_seq_length=512 because it had better masked LM accuracy.

Install

Checkout the project and install dependencies:

git clone --recursive https://github.com/tugstugi/mongolian-bert.git
pip3 install -r requirements.txt

Data preparation

Download the Mongolian Wikipedia and the 700 million word Mongolian news data set and pre process them into the directory mn_corpus/:

# Mongolian Wikipedia
python3 datasets/dl_and_preprop_mn_wiki.py
# 700 million words Mongolian news data set
python3 datasets/dl_and_preprop_mn_news.py

After pre-processing, the dataset will contain around 500M words.

Train SentencePiece vocabulary

Now, train the cased SentencePiece model i.e. with the vocabulary size 32000 :

cd sentencepiece
cat ../mn_corpus/*.txt > all.txt
python3 train_sentencepiece.py --input all.txt --vocab-size 32000 --prefix mn_cased

If the training was successful, the following files should be created: mn_cased.model and mn_cased.vocab. You can also test whether the SentencePiece model is working as intended:

>>> import sentencepiece as spm
>>> s = spm.SentencePieceProcessor()
>>> s.Load('mn_cased.model')
>>> s.EncodeAsPieces('Мөнгөө тушаачихсаныхаа дараа мэдэгдээрэй')
['▁Мөнгөө', '▁тушаа', 'чихсан', 'ыхаа', '▁дараа', '▁мэдэгд', 'ээрэй']

For a uncased SentencePiece model, convert the content of all.txt to lower case and train with:

python3 train_sentencepiece.py --input all.txt --vocab-size 32000 --prefix mn_uncased

Create/Upload TFRecord files

Create TFRecord files for cased:

python3 create_pretraining_data_helper.py --max_seq_length=512 --max_predictions_per_seq=77 --cased

Upload to your GCloud bucket:

gsutil cp mn_corpus/maxseq512*.tfrecord gs://YOUR_BUCKET/data-cased/

For uncased, adjust above steps accordingly.

Train a model

To train, i.e. uncased BERT-Base on TPUv2, use the following command:

export INPUT_FILES=gs://YOUR_BUCKET/data-uncased/maxseq512-mn_news_700m_1.tfrecord,gs://YOUR_BUCKET/data-uncased/maxseq512-mn_news_700m_10.tfrecord,gs://YOUR_BUCKET/data-uncased/maxseq512-mn_news_700m_11.tfrecord,gs://YOUR_BUCKET/data-uncased/maxseq512-mn_news_700m_12.tfrecord,gs://YOUR_BUCKET/data-uncased/maxseq512-mn_news_700m_13.tfrecord,gs://YOUR_BUCKET/data-uncased/maxseq512-mn_news_700m_14.tfrecord,gs://YOUR_BUCKET/data-uncased/maxseq512-mn_news_700m_15.tfrecord,gs://YOUR_BUCKET/data-uncased/maxseq512-mn_news_700m_16.tfrecord,gs://YOUR_BUCKET/data-uncased/maxseq512-mn_news_700m_17.tfrecord,gs://YOUR_BUCKET/data-uncased/maxseq512-mn_news_700m_18.tfrecord,gs://YOUR_BUCKET/data-uncased/maxseq512-mn_news_700m_19.tfrecord,gs://YOUR_BUCKET/data-uncased/maxseq512-mn_news_700m_2.tfrecord,gs://YOUR_BUCKET/data-uncased/maxseq512-mn_news_700m_3.tfrecord,gs://YOUR_BUCKET/data-uncased/maxseq512-mn_news_700m_4.tfrecord,gs://YOUR_BUCKET/data-uncased/maxseq512-mn_news_700m_5.tfrecord,gs://YOUR_BUCKET/data-uncased/maxseq512-mn_news_700m_6.tfrecord,gs://YOUR_BUCKET/data-uncased/maxseq512-mn_news_700m_7.tfrecord,gs://YOUR_BUCKET/data-uncased/maxseq512-mn_news_700m_8.tfrecord,gs://YOUR_BUCKET/data-uncased/maxseq512-mn_news_700m_9.tfrecord,gs://YOUR_BUCKET/data-uncased/maxseq512-mn_wiki.tfrecord
python3 bert/run_pretraining.py \
  --input_file=$INPUT_FILES \
  --output_dir=gs://YOUR_BUCKET/uncased_bert_base \
  --use_tpu=True \
  --tpu_name=YOUR_TPU_ADDRESS \
  --num_tpu_cores=8 \
  --do_train=True \
  --do_eval=True \
  --bert_config_file=bert_configs/bert_base_config.json \
  --train_batch_size=256 \
  --max_seq_length=128 \
  --max_predictions_per_seq=20 \
  --num_train_steps=4000000 \
  --num_warmup_steps=10000 \
  --learning_rate=1e-4

For a large model, use bert_config_file=bert_configs/bert_large_config.json and train_batch_size=32.

Citation

@misc{mongolian-bert,
  author = {Tuguldur, Erdene-Ochir and Gunchinish, Sharavsambuu and Bataa, Enkhbold},
  title = {BERT Pretrained Models on Mongolian Datasets},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/tugstugi/mongolian-bert/}}
}

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 21

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗