Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → ymcui → Macbert

ymcui / Macbert

Licence: apache-2.0

Revisiting Pre-trained Models for Chinese Natural Language Processing (Findings of EMNLP)

Labels

tensorflow nlp language-model

Projects that are alternatives of or similar to Macbert

Kogpt2 Finetuning

🔥 Korean GPT-2, KoGPT2 FineTuning cased. 한국어 가사 데이터 학습 🔥

Stars: ✭ 124 (-25.75%)

Mutual labels: language-model

Awesome Sentence Embedding

A curated list of pretrained sentence and word embedding models

Stars: ✭ 1,973 (+1081.44%)

Mutual labels: language-model

Lotclass

[EMNLP 2020] Text Classification Using Label Names Only: A Language Model Self-Training Approach

Stars: ✭ 160 (-4.19%)

Mutual labels: language-model

Electra

中文预训练 ELECTRA 模型: 基于对抗学习 pretrain Chinese Model

Stars: ✭ 132 (-20.96%)

Mutual labels: language-model

Awesome Speech Recognition Speech Synthesis Papers

Automatic Speech Recognition (ASR), Speaker Verification, Speech Synthesis, Text-to-Speech (TTS), Language Modelling, Singing Voice Synthesis (SVS), Voice Conversion (VC)

Stars: ✭ 2,085 (+1148.5%)

Mutual labels: language-model

Speecht

An opensource speech-to-text software written in tensorflow

Stars: ✭ 152 (-8.98%)

Mutual labels: language-model

Robbert

A Dutch RoBERTa-based language model

Stars: ✭ 120 (-28.14%)

Mutual labels: language-model

Indic Bert

BERT-based Multilingual Model for Indian Languages

Stars: ✭ 160 (-4.19%)

Mutual labels: language-model

Awd Lstm Lm

LSTM and QRNN Language Model Toolkit for PyTorch

Stars: ✭ 1,834 (+998.2%)

Mutual labels: language-model

Keras Xlnet

Implementation of XLNet that can load pretrained checkpoints

Stars: ✭ 159 (-4.79%)

Mutual labels: language-model

Clue

中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard

Stars: ✭ 2,425 (+1352.1%)

Mutual labels: language-model

Ld Net

Efficient Contextualized Representation: Language Model Pruning for Sequence Labeling

Stars: ✭ 148 (-11.38%)

Mutual labels: language-model

Transformer Lm

Transformer language model (GPT-2) with sentencepiece tokenizer

Stars: ✭ 154 (-7.78%)

Mutual labels: language-model

Chars2vec

Character-based word embeddings model based on RNN for handling real world texts

Stars: ✭ 130 (-22.16%)

Mutual labels: language-model

Lazynlp

Library to scrape and clean web pages to create massive datasets.

Stars: ✭ 1,985 (+1088.62%)

Mutual labels: language-model

Dynamic Memory Networks Plus Pytorch

Implementation of Dynamic memory networks plus in Pytorch

Stars: ✭ 123 (-26.35%)

Mutual labels: language-model

Electra pytorch

Pretrain and finetune ELECTRA with fastai and huggingface. (Results of the paper replicated !)

Stars: ✭ 149 (-10.78%)

Mutual labels: language-model

Gpt Neo

An implementation of model parallel GPT2& GPT3-like models, with the ability to scale up to full GPT3 sizes (and possibly more!), using the mesh-tensorflow library.

Stars: ✭ 1,252 (+649.7%)

Mutual labels: language-model

Xlnet Gen

XLNet for generating language.

Stars: ✭ 164 (-1.8%)

Mutual labels: language-model

F Lm

Language Modeling

Stars: ✭ 156 (-6.59%)

Mutual labels: language-model

View All Similar Projects ➔

This repository contains the resources in our paper "Revisiting Pre-trained Models for Chinese Natural Language Processing", which will be published in "Findings of EMNLP". You can read our camera-ready paper through ACL Anthology or arXiv pre-print.

Revisiting Pre-trained Models for Chinese Natural Language Processing
Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Shijin Wang, Guoping Hu

For resources other than MacBERT, please visit the following repositories:

Chinese BERT-wwm series: https://github.com/ymcui/Chinese-BERT-wwm
Chinese ELECTRA: https://github.com/ymcui/Chinese-ELECTRA
Chinese XLNet: https://github.com/ymcui/Chinese-XLNet

More resources by HFL: https://github.com/ymcui/HFL-Anthology

News

[Nov 3, 2020] Pre-trained MacBERT models are available through direct Download or Quick Load. Use it as if you are using original BERT (except for it cannot perform the original MLM).

[Sep 15, 2020] Our paper "Revisiting Pre-Trained Models for Chinese Natural Language Processing" is accepted to Findings of EMNLP as a long paper.

Guide

Section	Description
Introduction	Introduction to MacBERT
Download	Download links for MacBERT
Quick Load	Learn how to quickly load our models through 🤗Transformers
Results	Results on several Chinese NLP datasets
FAQ	Frequently Asked Questions
Citation	Citation

Introduction

MacBERT is an improved BERT with novel MLM as correction pre-training task, which mitigates the discrepancy of pre-training and fine-tuning.

Instead of masking with [MASK] token, which never appears in the ﬁne-tuning stage, we propose to use similar words for the masking purpose. A similar word is obtained by using Synonyms toolkit (Wang and Hu, 2017), which is based on word2vec (Mikolov et al., 2013) similarity calculations. If an N-gram is selected to mask, we will ﬁnd similar words individually. In rare cases, when there is no similar word, we will degrade to use random word replacement.

Here is an example of our pre-training task. | | Example | | -------------- | ----------------- | | Original Sentence | we use a language model to predict the probability of the next word. | | MLM | we use a language [M] to [M] ##di ##ct the pro [M] ##bility of the next word . | | Whole word masking | we use a language [M] to [M] [M] [M] the [M] [M] [M] of the next word . | | N-gram masking | we use a [M] [M] to [M] [M] [M] the [M] [M] [M] [M] [M] next word . | | MLM as correction | we use a text system to ca ##lc ##ulate the po ##si ##bility of the next word . |

Except for the new pre-training task, we also incorporate the following techniques.

Whole Word Masking (WWM)
N-gram masking
Sentence-Order Prediction (SOP)

Note that our MacBERT can be directly replaced with the original BERT as there is no differences in the main neural architecture.

For more technical details, please check our paper: Revisiting Pre-trained Models for Chinese Natural Language Processing

Download

We mainly provide pre-trained MacBERT models in TensorFlow 1.x.

MacBERT-large, Chinese: 24-layer, 1024-hidden, 16-heads, 324M parameters
MacBERT-base, Chinese：12-layer, 768-hidden, 12-heads, 102M parameters

Model	Google Drive	iFLYTEK Cloud	Size
`MacBERT-large, Chinese`	TensorFlow	TensorFlow（pw:3Yg3）	1.2G
`MacBERT-base, Chinese`	TensorFlow	TensorFlow（pw:E2cP）	383M

PyTorch/TensorFlow2 Version

If you need these models in PyTorch/TensorFlow2,

Convert TensorFlow checkpoint into PyTorch/TensorFlow2, using 🤗Transformers
Download from https://huggingface.co/hfl

Steps: select one of the model in the page above → click "list all files in model" at the end of the model page → download bin/json files from the pop-up window.

Quick Load

With Huggingface-Transformers, the models above could be easily accessed and loaded through the following codes.

tokenizer = BertTokenizer.from_pretrained("MODEL_NAME")
model = BertModel.from_pretrained("MODEL_NAME")

**Notice: Please use BertTokenizer and BertModel for loading MacBERT models. **

The actual model and its MODEL_NAME are listed below.

Original Model	MODEL_NAME
MacBERT-large	hfl/chinese-macbert-large
MacBERT-base	hfl/chinese-macbert-base

Results

We present the results of MacBERT on the following six tasks (please read our paper for other results).

To ensure the stability of the results, we run 10 times for each experiment and report the maximum and average scores (in brackets).

CMRC 2018

CMRC 2018 dataset is released by the Joint Laboratory of HIT and iFLYTEK Research. The model should answer the questions based on the given passage, which is identical to SQuAD. Evaluation metrics: EM / F1

Model	Development	Test	Challenge	#Params
BERT-base	65.5 (64.4) / 84.5 (84.0)	70.0 (68.7) / 87.0 (86.3)	18.6 (17.0) / 43.3 (41.3)	102M
BERT-wwm	66.3 (65.0) / 85.6 (84.7)	70.5 (69.1) / 87.4 (86.7)	21.0 (19.3) / 47.0 (43.9)	102M
BERT-wwm-ext	67.1 (65.6) / 85.7 (85.0)	71.4 (70.0) / 87.7 (87.0)	24.0 (20.0) / 47.3 (44.6)	102M
RoBERTa-wwm-ext	67.4 (66.5) / 87.2 (86.5)	72.6 (71.4) / 89.4 (88.8)	26.2 (24.6) / 51.0 (49.1)	102M
ELECTRA-base	68.4 (68.0) / 84.8 (84.6)	73.1 (72.7) / 87.1 (86.9)	22.6 (21.7) / 45.0 (43.8)	102M
MacBERT-base	68.5 (67.3) / 87.9 (87.1)	73.2 (72.4) / 89.5 (89.2)	30.2 (26.4) / 54.0 (52.2)	102M
ELECTRA-large	69.1 (68.2) / 85.2 (84.5)	73.9 (72.8) / 87.1 (86.6)	23.0 (21.6) / 44.2 (43.2)	324M
RoBERTa-wwm-ext-large	68.5 (67.6) / 88.4 (87.9)	74.2 (72.4) / 90.6 (90.0)	31.5 (30.1) / 60.1 (57.5)	324M
MacBERT-large	70.7 (68.6) / 88.9 (88.2)	74.8 (73.2) / 90.7 (90.1)	31.9 (29.6) / 60.2 (57.6)	324M

DRCD

DRCD is also a span-extraction machine reading comprehension dataset, released by Delta Research Center. The text is written in Traditional Chinese. Evaluation metrics: EM / F1

Model	Development	Test	#Params
BERT-base	83.1 (82.7) / 89.9 (89.6)	82.2 (81.6) / 89.2 (88.8)	102M
BERT-wwm	84.3 (83.4) / 90.5 (90.2)	82.8 (81.8) / 89.7 (89.0)	102M
BERT-wwm-ext	85.0 (84.5) / 91.2 (90.9)	83.6 (83.0) / 90.4 (89.9)	102M
RoBERTa-wwm-ext	86.6 (85.9) / 92.5 (92.2)	85.6 (85.2) / 92.0 (91.7)	102M
ELECTRA-base	87.5 (87.0) / 92.5 (92.3)	86.9 (86.6) / 91.8 (91.7)	102M
MacBERT-base	89.4 (89.2) / 94.3 (94.1)	89.5 (88.7) / 93.8 (93.5)	102M
ELECTRA-large	88.8 (88.7) / 93.3 (93.2)	88.8 (88.2) / 93.6 (93.2)	324M
RoBERTa-wwm-ext-large	89.6 (89.1) / 94.8 (94.4)	89.6 (88.9) / 94.5 (94.1)	324M
MacBERT-large	91.2 (90.8) / 95.6 (95.3)	91.7 (90.9) / 95.6 (95.3)	324M

XNLI

We use XNLI data for testing the NLI task. Evaluation metrics: Accuracy

Model	Development	Test	#Params
BERT-base	77.8 (77.4)	77.8 (77.5)	102M
BERT-wwm	79.0 (78.4)	78.2 (78.0)	102M
BERT-wwm-ext	79.4 (78.6)	78.7 (78.3)	102M
RoBERTa-wwm-ext	80.0 (79.2)	78.8 (78.3)	102M
ELECTRA-base	77.9 (77.0)	78.4 (77.8)	102M
MacBERT-base	80.3 (79.7)	79.3 (78.8)	102M
ELECTRA-large	81.5 (80.8)	81.0 (80.9)	324M
RoBERTa-wwm-ext-large	82.1 (81.3)	81.2 (80.6)	324M
MacBERT-large	82.4 (81.8)	81.3 (80.6)	324M

ChnSentiCorp

We use ChnSentiCorp data for testing sentiment analysis. Evaluation metrics: Accuracy

Model	Development	Test	#Params
BERT-base	94.7 (94.3)	95.0 (94.7)	102M
BERT-wwm	95.1 (94.5)	95.4 (95.0)	102M
BERT-wwm-ext	95.4 (94.6)	95.3 (94.7)	102M
RoBERTa-wwm-ext	95.0 (94.6)	95.6 (94.8)	102M
ELECTRA-base	93.8 (93.0)	94.5 (93.5)	102M
MacBERT-base	95.2 (94.8)	95.6 (94.9)	102M
ELECTRA-large	95.2 (94.6)	95.3 (94.8)	324M
RoBERTa-wwm-ext-large	95.8 (94.9)	95.8 (94.9)	324M
MacBERT-large	95.7 (95.0)	95.9 (95.1)	324M

LCQMC

LCQMC is a sentence pair matching dataset, which could be seen as a binary classification task. Evaluation metrics: Accuracy

Model	Development	Test	#Params
BERT	89.4 (88.4)	86.9 (86.4)	102M
BERT-wwm	89.4 (89.2)	87.0 (86.8)	102M
BERT-wwm-ext	89.6 (89.2)	87.1 (86.6)	102M
RoBERTa-wwm-ext	89.0 (88.7)	86.4 (86.1)	102M
ELECTRA-base	90.2 (89.8)	87.6 (87.3)	102M
MacBERT-base	89.5 (89.3)	87.0 (86.5)	102M
ELECTRA-large	90.7 (90.4)	87.3 (87.2)	324M
RoBERTa-wwm-ext-large	90.4 (90.0)	87.0 (86.8)	324M
MacBERT-large	90.6 (90.3)	87.6 (87.1)	324M

BQ Corpus

BQ Corpus is a sentence pair matching dataset, which could be seen as a binary classification task. Evaluation metrics: Accuracy

Model	Development	Test	#Params
BERT	86.0 (85.5)	84.8 (84.6)	102M
BERT-wwm	86.1 (85.6)	85.2 (84.9)	102M
BERT-wwm-ext	86.4 (85.5)	85.3 (84.8)	102M
RoBERTa-wwm-ext	86.0 (85.4)	85.0 (84.6)	102M
ELECTRA-base	84.8 (84.7)	84.5 (84.0)	102M
MacBERT-base	86.0 (85.5)	85.2 (84.9)	102M
ELECTRA-large	86.7 (86.2)	85.1 (84.8)	324M
RoBERTa-wwm-ext-large	86.3 (85.7)	85.8 (84.9)	324M
MacBERT-large	86.2 (85.7)	85.6 (85.0)	324M

FAQ

Question 1: Do you have an English version of MacBERT?

A1: Sorry, we do not have English version of pre-trained MacBERT.

Question 2: How to use MacBERT?

A2: Use it as if you are using original BERT in the fine-tuning stage (just replace the checkpoint and config files). Also, you can perform further pre-training on our checkpoint with MLM/NSP/SOP objectives.

Question 3: Could you provide pre-training code for MacBERT?

A3: Sorry, we cannot provide source code at the moment, and maybe we'll release them in the future, but there is no guarantee.

Question 4: How about releasing the pre-training data?

A4: We have no right to redistribute these data, which will have potential legal violations.

Question 5: Will you release pre-trained MacBERT on a larger data?

A5: Currently, we have no plans on this.

Citation

If you find our resource or paper is useful, please consider including the following citation in your paper.

https://arxiv.org/abs/2004.13922

@inproceedings{cui-etal-2020-revisiting,
    title = "Revisiting Pre-Trained Models for {C}hinese Natural Language Processing",
    author = "Cui, Yiming  and
      Che, Wanxiang  and
      Liu, Ting  and
      Qin, Bing  and
      Wang, Shijin  and
      Hu, Guoping",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.findings-emnlp.58",
    pages = "657--668",
}

Acknowledgment

The first author would like to thank Google TensorFlow Research Cloud (TFRC) Program.

Issues

Before you submit an issue:

You are advised to read FAQ first before you submit an issue.
Repetitive and irrelevant issues will be ignored and closed by [stable-bot](stale · GitHub Marketplace). Thank you for your understanding and support.
We cannot acommodate EVERY request, and thus please bare in mind that there is no guarantee that your request will be met.
Always be polite when you submit an issue.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 167

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗