All Projects → nguyenvulebinh → vietnamese-roberta

nguyenvulebinh / vietnamese-roberta

Licence: other
A Robustly Optimized BERT Pretraining Approach for Vietnamese

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to vietnamese-roberta

les-military-mrc-rank7
莱斯杯:全国第二届“军事智能机器阅读”挑战赛 - Rank7 解决方案
Stars: ✭ 37 (+68.18%)
Mutual labels:  transformer, bert, roberta
roberta-wwm-base-distill
this is roberta wwm base distilled model which was distilled from roberta wwm by roberta wwm large
Stars: ✭ 61 (+177.27%)
Mutual labels:  pretrained-models, bert, roberta
transformer-models
Deep Learning Transformer models in MATLAB
Stars: ✭ 90 (+309.09%)
Mutual labels:  transformer, pretrained-models, bert
Clue
中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard
Stars: ✭ 2,425 (+10922.73%)
Mutual labels:  pretrained-models, bert, roberta
Bertviz
Tool for visualizing attention in the Transformer model (BERT, GPT-2, Albert, XLNet, RoBERTa, CTRL, etc.)
Stars: ✭ 3,443 (+15550%)
Mutual labels:  transformer, bert, roberta
COVID-19-Tweet-Classification-using-Roberta-and-Bert-Simple-Transformers
Rank 1 / 216
Stars: ✭ 24 (+9.09%)
Mutual labels:  transformer, bert, roberta
Transformers
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
Stars: ✭ 55,742 (+253272.73%)
Mutual labels:  transformer, pretrained-models, bert
SIGIR2021 Conure
One Person, One Model, One World: Learning Continual User Representation without Forgetting
Stars: ✭ 23 (+4.55%)
Mutual labels:  transformer, bert
bert in a flask
A dockerized flask API, serving ALBERT and BERT predictions using TensorFlow 2.0.
Stars: ✭ 32 (+45.45%)
Mutual labels:  transformer, bert
AiSpace
AiSpace: Better practices for deep learning model development and deployment For Tensorflow 2.0
Stars: ✭ 28 (+27.27%)
Mutual labels:  pretrained-models, bert
Bert Keras
Keras implementation of BERT with pre-trained weights
Stars: ✭ 820 (+3627.27%)
Mutual labels:  transformer, pretrained-models
bert-as-a-service TFX
End-to-end pipeline with TFX to train and deploy a BERT model for sentiment analysis.
Stars: ✭ 32 (+45.45%)
Mutual labels:  transformer, bert
text-generation-transformer
text generation based on transformer
Stars: ✭ 36 (+63.64%)
Mutual labels:  transformer, bert
Ghostnet
CV backbones including GhostNet, TinyNet and TNT, developed by Huawei Noah's Ark Lab.
Stars: ✭ 1,744 (+7827.27%)
Mutual labels:  transformer, pretrained-models
Filipino-Text-Benchmarks
Open-source benchmark datasets and pretrained transformer models in the Filipino language.
Stars: ✭ 22 (+0%)
Mutual labels:  transformer, bert
Bert Pytorch
Google AI 2018 BERT pytorch implementation
Stars: ✭ 4,642 (+21000%)
Mutual labels:  transformer, bert
semantic-document-relations
Implementation, trained models and result data for the paper "Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles"
Stars: ✭ 21 (-4.55%)
Mutual labels:  transformer, bert
Bert Multitask Learning
BERT for Multitask Learning
Stars: ✭ 380 (+1627.27%)
Mutual labels:  transformer, pretrained-models
Nlp Tutorial
Natural Language Processing Tutorial for Deep Learning Researchers
Stars: ✭ 9,895 (+44877.27%)
Mutual labels:  transformer, bert
word tokenize
Vietnamese Word Tokenize
Stars: ✭ 45 (+104.55%)
Mutual labels:  vietnamese, vietnamese-nlp

Pre-trained embedding using RoBERTa architecture on Vietnamese corpus

Overview

RoBERTa is an improved recipe for training BERT models that can match or exceed the performance of all of the post-BERT methods. The different between RoBERTa and BERT:

  • Training the model longer, with bigger batches, over more data.
  • Removing the next sentence prediction objective.
  • Training on longer sequences.
  • Dynamically changing the masking pattern applied to the training data.

Data to train this model is Vietnamese corpus crawled from many online newspapers: 50GB of text with approximate 7.7 billion words that crawl from many domains on the internet including news, law, entertainment, wikipedia and so on. Data was cleaned using visen library and tokenize using sentence piece. With envibert model, we use another 50GB of text in English, so a total of 100GB text is used to train envibert model.

Prepare environment

model-bin
├── envibert
│   ├── dict.txt
│   ├── model.pt
│   └── sentencepiece.bpe.model
└── uncased
|   ├── dict.txt
|   ├── model.pt
|   └── sentencepiece.bpe.model
└── cased
    ├── dict.txt
    ├── model.pt
    └── sentencepiece.bpe.model

  • Install environment library
pip install -r requirements.txt

Example usage

Load envibert model with Huggingface

from transformers import RobertaModel
from transformers.file_utils import cached_path, hf_bucket_url
from importlib.machinery import SourceFileLoader
import os

cache_dir='./cache'
model_name='nguyenvulebinh/envibert'

def download_tokenizer_files():
  resources = ['envibert_tokenizer.py', 'dict.txt', 'sentencepiece.bpe.model']
  for item in resources:
    if not os.path.exists(os.path.join(cache_dir, item)):
      tmp_file = hf_bucket_url(model_name, filename=item)
      tmp_file = cached_path(tmp_file,cache_dir=cache_dir)
      os.rename(tmp_file, os.path.join(cache_dir, item))
      
download_tokenizer_files()
tokenizer = SourceFileLoader("envibert.tokenizer", os.path.join(cache_dir,'envibert_tokenizer.py')).load_module().RobertaTokenizer(cache_dir)
model = RobertaModel.from_pretrained(model_name,cache_dir=cache_dir)

# Encode text
text_input = 'Đại học Bách Khoa Hà Nội .'
text_ids = tokenizer(text_input, return_tensors='pt').input_ids
# tensor([[   0,  705,  131, 8751, 2878,  347,  477,    5,    2]])

# Extract features
text_features = model(text_ids)
text_features['last_hidden_state'].shape
# torch.Size([1, 9, 768])
len(text_features['hidden_states'])
# 7

Load RoBERTa model

from fairseq.models.roberta import XLMRModel

# Using cased model
pretrained_path = './model-bin/envibert/'

# Load RoBERTa model. That already include loading sentence piece model
roberta = XLMRModel.from_pretrained(pretrained_path, checkpoint_file='model.pt')
roberta.eval()  # disable dropout (or leave in train mode to finetune)

Extract features from RoBERTa

text_input = 'Đại học Bách Khoa Hà Nội.'
# Encode using roberta class
tokens_ids = roberta.encode(text_input)
# assert tokens_ids.tolist() == [0, 451, 71, 3401, 1384, 168, 234, 5, 2]
# Extracted feature using roberta model
tokens_embed = roberta.extract_features(tokens_ids)
# assert tokens_embed.shape == (1, 9, 512)

Filling masks

RoBERTa can be used to fill <mask> tokens in the input.

masked_line = 'Đại học <mask> Khoa Hà Nội'
roberta.fill_mask(masked_line, topk=5)

#('Đại học Bách Khoa Hà Nội', 0.9954977035522461, ' Bách'),
#('Đại học Y Khoa Hà Nội', 0.001166337518952787, ' Y'),
#('Đại học Đa Khoa Hà Nội', 0.0005696234875358641, ' Đa'),
#('Đại học Văn Khoa Hà Nội', 0.000467598409159109, ' Văn'),
#('Đại học Anh Khoa Hà Nội', 0.00035955727798864245, ' Anh')

Model detail

This model was a custom version from RoBERTa with less hidden layers (6 layers). Three versions: envibert (with dictionary case sensitive in two languages), cased (with dictionary case sensitive) and uncased (all word is lower)

Training model

To train this model, please follow this repository instruction.

Citation

@inproceedings{nguyen20d_interspeech,
  author={Thai Binh Nguyen and Quang Minh Nguyen and Thi Thu Hien Nguyen and Quoc Truong Do and Chi Mai Luong},
  title={{Improving Vietnamese Named Entity Recognition from Speech Using Word Capitalization and Punctuation Recovery Models}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={4263--4267},
  doi={10.21437/Interspeech.2020-1896}
}

Please CITE our repo when it is used to help produce published results or is incorporated into other software.

Contact

[email protected]

Follow

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].