For further accelerating Chinese natural language processing, we provide Chinese pre-trained BERT with Whole Word Masking. Meanwhile, we also compare the state-of-the-art Chinese pre-trained models in depth, including BERT、ERNIE、BERT-wwm.
Pre-Training with Whole Word Masking for Chinese BERT
Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Ziqing Yang, Shijin Wang, Guoping Hu
This repository is developed based on:https://github.com/google-research/bert
You may also interested in,
More resources by HFL: https://github.com/ymcui/HFL-Anthology
2021/1/27 All models support TensorFlow 2 now. Please use transformers library to access them or download from https://huggingface.co/hfl
2020/9/15 Our paper "Revisiting Pre-Trained Models for Chinese Natural Language Processing" is accepted to Findings of EMNLP as a long paper.
2020/8/27 We are happy to announce that our model is on top of GLUE benchmark, check leaderboard.
2020/3/23 The models in this repository now can be easily accessed through PaddleHub, check Quick Load
2020/2/26 We release a knowledge distillation toolkit TextBrewer
2020/1/20 Happy Chinese New Year! We've released RBT3 and RBTL3 (3-layer RoBERTa-wwm-ext-base/large), check Small Models
2019/10/14 We release RoBERTa-wwm-ext-large
, check Download
2019/9/10 We release RoBERTa-wwm-ext
, check Download
2019/7/30 We release BERT-wwm-ext
, which was trained on larger data, check Download
2019/6/20 Initial version, pre-trained models could be downloaded through Google Drive, check Download
Section | Description |
---|---|
Introduction | Introduction to BERT with Whole Word Masking (WWM) |
Download | Download links for Chinese BERT-wwm |
Quick Load | Learn how to quickly load our models through 🤗Transformers or PaddleHub |
Model Comparison | Compare the models published in this repository |
Baselines | Baseline results for several Chinese NLP datasets (partial) |
Small Models | 3-layer Transformer models |
Useful Tips | Provide several useful tips for using Chinese pre-trained models |
English BERT-wwm | Download English BERT-wwm (by Google) |
FAQ | Frequently Asked Questions |
Citation | Citation |
Whole Word Masking (wwm) is an upgraded version by BERT released on late May 2019.
The following introductions are copied from BERT repository.
In the original pre-processing code, we randomly select WordPiece tokens to mask. For example:
Input Text: the man jumped up , put his basket on phil ##am ##mon ' s head
Original Masked Input: [MASK] man [MASK] up , put his [MASK] on phil [MASK] ##mon ' s head
The new technique is called Whole Word Masking. In this case, we always mask all of the the tokens corresponding to a word at once. The overall masking rate remains the same.
Whole Word Masked Input: the man [MASK] up , put his basket on [MASK] [MASK] [MASK] ' s head
The training is identical -- we still predict each masked WordPiece token independently. The improvement comes from the fact that the original prediction task was too 'easy' for words that had been split into multiple WordPieces.
Important Note: Terminology Masking
does not ONLY represent replace a word into [MASK]
token.
It could also be in another form, such as keep original word
or randomly replaced by another word
.
In the Chinese language, it is straightforward to utilize whole word masking, as traditional text processing in Chinese should include Chinese Word Segmentation (CWS)
.
In the original BERT-base, Chinese
by Google, the segmentation is done by splitting the Chinese characters while neglecting the importance of CWS.
In this repository, we utilize Language Technology Platform (LTP) by Harbin Institute of Technology for CWS, and adapt whole word masking in Chinese text.
As all models are 'BERT-base' variants, we do not incidate 'base' in the following model names.
BERT-base
:12-layer, 768-hidden, 12-heads, 110M parametersModel | Data | Google Drive | iFLYTEK Cloud |
---|---|---|---|
RBT6, Chinese |
Wikipedia+Extended data[1] | - | TensorFlow(pw:XNMA) |
RBT4, Chinese |
Wikipedia+Extended data[1] | - | TensorFlow(pw:e8dN) |
RBTL3, Chinese |
Wikipedia+Extended data[1] |
TensorFlow PyTorch |
TensorFlow(pw:vySW) |
RBT3, Chinese |
Wikipedia+Extended data[1] |
TensorFlow PyTorch |
TensorFlow(pw:b9nx) |
RoBERTa-wwm-ext-large, Chinese |
Wikipedia+Extended data[1] |
TensorFlow PyTorch |
TensorFlow(pw:u6gC) |
RoBERTa-wwm-ext, Chinese |
Wikipedia+Extended data[1] |
TensorFlow PyTorch |
TensorFlow(pw:Xe1p) |
BERT-wwm-ext, Chinese |
Wikipedia+Extended data[1] |
TensorFlow PyTorch |
TensorFlow(pw:4cMG) |
BERT-wwm, Chinese |
Wikipedia |
TensorFlow PyTorch |
TensorFlow(pw:07Xj) |
BERT-base, Chinese Google
|
Wikipedia | Google Cloud | - |
BERT-base, Multilingual Cased Google
|
Wikipedia | Google Cloud | - |
BERT-base, Multilingual Uncased Google
|
Wikipedia | Google Cloud | - |
If you need these models in PyTorch,
Convert TensorFlow checkpoint into PyTorch, using 🤗Transformers
Download from https://huggingface.co/hfl
Steps: select one of the model in the page above → click "list all files in model" at the end of the model page → download bin/json files from the pop-up window
The whole zip package roughly takes ~400M. ZIP package includes the following files:
chinese_wwm_L-12_H-768_A-12.zip
|- bert_model.ckpt # Model Weights
|- bert_model.meta # Meta info
|- bert_model.index # Index info
|- bert_config.json # Config file
|- vocab.txt # Vocabulary
bert_config.json
and vocab.txt
are identical to the original BERT-base, Chinese
by Google。
With Huggingface-Transformers, the models above could be easily accessed and loaded through the following codes.
tokenizer = BertTokenizer.from_pretrained("MODEL_NAME")
model = BertModel.from_pretrained("MODEL_NAME")
Notice: Please use BertTokenizer and BertModel for loading these model. DO NOT use RobertaTokenizer/RobertaModel!
The actual model and its MODEL_NAME
are listed below.
Original Model | MODEL_NAME |
---|---|
RoBERTa-wwm-ext-large | hfl/chinese-roberta-wwm-ext-large |
RoBERTa-wwm-ext | hfl/chinese-roberta-wwm-ext |
BERT-wwm-ext | hfl/chinese-bert-wwm-ext |
BERT-wwm | hfl/chinese-bert-wwm |
RBT3 | hfl/rbt3 |
RBTL3 | hfl/rbtl3 |
With PaddleHub, we can download and install the model with one line of code.
import paddlehub as hub
module = hub.Module(name=MODULE_NAME)
The actual model and its MODULE_NAME
are listed below.
Original Model | MODULE_NAME |
---|---|
RoBERTa-wwm-ext-large | chinese-roberta-wwm-ext-large |
RoBERTa-wwm-ext | chinese-roberta-wwm-ext |
BERT-wwm-ext | chinese-bert-wwm-ext |
BERT-wwm | chinese-bert-wwm |
RBT3 | rbt3 |
RBTL3 | rbtl3 |
We list comparisons on the models that were released in this project.
~BERT
means to inherit the attributes from original Google's BERT.
- | BERTGoogle | BERT-wwm | BERT-wwm-ext | RoBERTa-wwm-ext | RoBERTa-wwm-ext-large |
---|---|---|---|---|---|
Masking | WordPiece | WWM[1] | WWM | WWM | WWM |
Type | BERT-base | BERT-base | BERT-base | BERT-base | BERT-large |
Data Source | wiki | wiki | wiki+ext[2] | wiki+ext | wiki+ext |
Training Tokens # | 0.4B | 0.4B | 5.4B | 5.4B | 5.4B |
Device | TPU Pod v2 | TPU v3 | TPU v3 | TPU v3 | TPU Pod v3-32[3] |
Training Steps | ? | 100KMAX128 +100KMAX512 |
1MMAX128 +400KMAX512 |
1MMAX512 | 2MMAX512 |
Batch Size | ? | 2,560 / 384 | 2,560 / 384 | 384 | 512 |
Optimizer | AdamW | LAMB | LAMB | AdamW | AdamW |
Vocabulary | 21,128 | ~BERT[4] vocab | ~BERT vocab | ~BERT vocab | ~BERT vocab |
Init Checkpoint | Random Init | ~BERT weight | ~BERT weight | ~BERT weight | Random Init |
We experiment on several Chinese datasets, including sentence-level to document-level tasks.
We only list partial results here and kindly advise the readers to read our technical report.
Best Learning Rate:
Model | BERT | ERNIE | BERT-wwm* |
---|---|---|---|
CMRC 2018 | 3e-5 | 8e-5 | 3e-5 |
DRCD | 3e-5 | 8e-5 | 3e-5 |
CJRC | 4e-5 | 8e-5 | 4e-5 |
XNLI | 3e-5 | 5e-5 | 3e-5 |
ChnSentiCorp | 2e-5 | 5e-5 | 2e-5 |
LCQMC | 2e-5 | 3e-5 | 2e-5 |
BQ Corpus | 3e-5 | 5e-5 | 3e-5 |
THUCNews | 2e-5 | 5e-5 | 2e-5 |
Note: To ensure the stability of the results, we run 10 times for each experiment and report maximum and average scores.
Average scores are in brackets, and max performances are the numbers that out of brackets.
CMRC 2018 dataset is released by Joint Laboratory of HIT and iFLYTEK Research. The model should answer the questions based on the given passage, which is identical to SQuAD. Evaluation Metrics: EM / F1
Model | Development | Test | Challenge |
---|---|---|---|
BERT | 65.5 (64.4) / 84.5 (84.0) | 70.0 (68.7) / 87.0 (86.3) | 18.6 (17.0) / 43.3 (41.3) |
ERNIE | 65.4 (64.3) / 84.7 (84.2) | 69.4 (68.2) / 86.6 (86.1) | 19.6 (17.0) / 44.3 (42.8) |
BERT-wwm | 66.3 (65.0) / 85.6 (84.7) | 70.5 (69.1) / 87.4 (86.7) | 21.0 (19.3) / 47.0 (43.9) |
BERT-wwm-ext | 67.1 (65.6) / 85.7 (85.0) | 71.4 (70.0) / 87.7 (87.0) | 24.0 (20.0) / 47.3 (44.6) |
RoBERTa-wwm-ext | 67.4 (66.5) / 87.2 (86.5) | 72.6 (71.4) / 89.4 (88.8) | 26.2 (24.6) / 51.0 (49.1) |
RoBERTa-wwm-ext-large | 68.5 (67.6) / 88.4 (87.9) | 74.2 (72.4) / 90.6 (90.0) | 31.5 (30.1) / 60.1 (57.5) |
DRCD is also a span-extraction machine reading comprehension dataset, released by Delta Research Center. The text is written in Traditional Chinese. Evaluation Metrics: EM / F1
Model | Development | Test |
---|---|---|
BERT | 83.1 (82.7) / 89.9 (89.6) | 82.2 (81.6) / 89.2 (88.8) |
ERNIE | 73.2 (73.0) / 83.9 (83.8) | 71.9 (71.4) / 82.5 (82.3) |
BERT-wwm | 84.3 (83.4) / 90.5 (90.2) | 82.8 (81.8) / 89.7 (89.0) |
BERT-wwm-ext | 85.0 (84.5) / 91.2 (90.9) | 83.6 (83.0) / 90.4 (89.9) |
RoBERTa-wwm-ext | 86.6 (85.9) / 92.5 (92.2) | 85.6 (85.2) / 92.0 (91.7) |
RoBERTa-wwm-ext-large | 89.6 (89.1) / 94.8 (94.4) | 89.6 (88.9) / 94.5 (94.1) |
CJRC is a Chinese judiciary reading comprehension dataset, released by Joint Laboratory of HIT and iFLYTEK Research. Note that, the data used in these experiments are NOT identical to the official one. Evaluation Metrics: EM / F1
Model | Development | Test |
---|---|---|
BERT | 54.6 (54.0) / 75.4 (74.5) | 55.1 (54.1) / 75.2 (74.3) |
ERNIE | 54.3 (53.9) / 75.3 (74.6) | 55.0 (53.9) / 75.0 (73.9) |
BERT-wwm | 54.7 (54.0) / 75.2 (74.8) | 55.1 (54.1) / 75.4 (74.4) |
BERT-wwm-ext | 55.6 (54.8) / 76.0 (75.3) | 55.6 (54.9) / 75.8 (75.0) |
RoBERTa-wwm-ext | 58.7 (57.6) / 79.1 (78.3) | 59.0 (57.8) / 79.0 (78.0) |
RoBERTa-wwm-ext-large | 62.1 (61.1) / 82.4 (81.6) | 62.4 (61.4) / 82.2 (81.0) |
We use XNLI data for testing NLI task. Evaluation Metrics: Accuracy
Model | Development | Test |
---|---|---|
BERT | 77.8 (77.4) | 77.8 (77.5) |
ERNIE | 79.7 (79.4) | 78.6 (78.2) |
BERT-wwm | 79.0 (78.4) | 78.2 (78.0) |
BERT-wwm-ext | 79.4 (78.6) | 78.7 (78.3) |
RoBERTa-wwm-ext | 80.0 (79.2) | 78.8 (78.3) |
RoBERTa-wwm-ext-large | 82.1 (81.3) | 81.2 (80.6) |
We use ChnSentiCorp data for testing sentiment analysis. Evaluation Metrics: Accuracy
Model | Development | Test |
---|---|---|
BERT | 94.7 (94.3) | 95.0 (94.7) |
ERNIE | 95.4 (94.8) | 95.4 (95.3) |
BERT-wwm | 95.1 (94.5) | 95.4 (95.0) |
BERT-wwm-ext | 95.4 (94.6) | 95.3 (94.7) |
RoBERTa-wwm-ext | 95.0 (94.6) | 95.6 (94.8) |
RoBERTa-wwm-ext-large | 95.8 (94.9) | 95.8 (94.9) |
Evaluation Metrics: Accuracy
Model | Development | Test |
---|---|---|
BERT | 89.4 (88.4) | 86.9 (86.4) |
ERNIE | 89.8 (89.6) | 87.2 (87.0) |
BERT-wwm | 89.4 (89.2) | 87.0 (86.8) |
BERT-wwm-ext | 89.6 (89.2) | 87.1 (86.6) |
RoBERTa-wwm-ext | 89.0 (88.7) | 86.4 (86.1) |
RoBERTa-wwm-ext-large | 90.4 (90.0) | 87.0 (86.8) |
Evaluation Metrics: Accuracy
Model | Development | Test |
---|---|---|
BERT | 86.0 (85.5) | 84.8 (84.6) |
ERNIE | 86.3 (85.5) | 85.0 (84.6) |
BERT-wwm | 86.1 (85.6) | 85.2 (84.9) |
BERT-wwm-ext | 86.4 (85.5) | 85.3 (84.8) |
RoBERTa-wwm-ext | 86.0 (85.4) | 85.0 (84.6) |
RoBERTa-wwm-ext-large | 86.3 (85.7) | 85.8 (84.9) |
Released by Tsinghua University, which contains news in 10 categories. Evaluation Metrics: Accuracy
Model | Development | Test |
---|---|---|
BERT | 97.7 (97.4) | 97.8 (97.6) |
ERNIE | 97.6 (97.3) | 97.5 (97.3) |
BERT-wwm | 98.0 (97.6) | 97.8 (97.6) |
BERT-wwm-ext | 97.7 (97.5) | 97.7 (97.5) |
RoBERTa-wwm-ext | 98.3 (97.9) | 97.7 (97.5) |
RoBERTa-wwm-ext-large | 98.3 (97.7) | 97.8 (97.6) |
We list RBT3 and RBTL3 results on several NLP tasks. Note that, we only list test set results.
Model | CMRC 2018 | DRCD | XNLI | CSC | LCQMC | BQ | Average | Params |
---|---|---|---|---|---|---|---|---|
RoBERTa-wwm-ext-large | 74.2 / 90.6 | 89.6 / 94.5 | 81.2 | 95.8 | 87.0 | 85.8 | 87.335 | 325M |
RoBERTa-wwm-ext | 72.6 / 89.4 | 85.6 / 92.0 | 78.8 | 95.6 | 86.4 | 85.0 | 85.675 | 102M |
RBTL3 | 63.3 / 83.4 | 77.2 / 85.6 | 74.0 | 94.2 | 85.1 | 83.6 | 80.800 | 61M (59.8%) |
RBT3 | 62.2 / 81.8 | 75.0 / 83.9 | 72.3 | 92.8 | 85.1 | 83.3 | 79.550 | 38M (37.3%) |
Relative performance:
Model | CMRC 2018 | DRCD | XNLI | CSC | LCQMC | BQ | Average | AVG-C |
---|---|---|---|---|---|---|---|---|
RoBERTa-wwm-ext-large | 102.2% / 101.3% | 104.7% / 102.7% | 103.0% | 100.2% | 100.7% | 100.9% | 101.9% | 101.2% |
RoBERTa-wwm-ext | 100% / 100% | 100% / 100% | 100% | 100% | 100% | 100% | 100% | 100% |
RBTL3 | 87.2% / 93.3% | 90.2% / 93.0% | 93.9% | 98.5% | 98.5% | 98.4% | 94.3% | 97.35% |
RBT3 | 85.7% / 91.5% | 87.6% / 91.2% | 91.8% | 97.1% | 98.5% | 98.0% | 92.9% | 96.35% |
We also repost English BERT-wwm (by Google official) here for your perusal.
BERT-Large, Uncased (Whole Word Masking)
:
24-layer, 1024-hidden, 16-heads, 340M parameters
BERT-Large, Cased (Whole Word Masking)
:
24-layer, 1024-hidden, 16-heads, 340M parameters
Q: How to use this model?
A: Use it as if you are using original BERT. Note that, you don't need to do CWS for your text, as wwm only change the pre-training input but not the input for down-stream tasks.
Q: Do you have any plans to release the code?
A: Unfortunately, I am not be able to release the code at the moment. As implementation is quite easy, I would suggest you to read #10 and #13.
Q: How can I download XXXXX dataset?
A: We only provide the data that is publically available, check data
directory. For copyright reasons, some of the datasets are not publically available. In that case, please search on GitHub or consult original authors for accessing.
Q: How to use this model?
A: Use it as if you are using original BERT. Note that, you don't need to do CWS for your text, as wwm only change the pre-training input but not the input for down-stream tasks.
Q: Do you have any plans on releasing the larger model? Say BERT-large-wwm?
A: If we could get significant gains from BERT-large, we will release a larger version in the future.
Q: You lier! I can not reproduce the result! 😂
A: We use the simplist models in the downstream tasks. For example, in the classification task, we directly use run_classifier.py
by Google. If you are not able to reach the average score that we reported, then there should be some bugs in your code. As there is randomness in reaching maximum scores, there is no guarantee that you will reproduce them.
Q: I could get better performance than you!
A: Congratulations!
Q: How long did it take to train such a model?
A: The training was done on Google Cloud TPU v3 with 128HBM, and it roughly takes 1.5 days. Note that, in the pre-training stage, we use LAMB Optimizer
which is optimized for the larger batch. In fine-tuning downstream task, we use normal AdamWeightDecayOptimizer
as default.
Q: Who is ERNIE?
A: The ERNIE in this repository refer to the model released by Baidu, but not the one that published by Tsinghua University which was also called ERNIE.
Q: BERT-wwm does not perform well on some tasks.
A: The aim of this project is to provide researchers with a variety of pre-training models.
You are free to choose one of these models.
We only provide experimental results, and we strongly suggest trying these models in your own task.
One more model, one more choice.
Q: Why not trying on more dataset?
A: To be honest: 1) no time to find more data; 2) no need; 3) no money;
Q: Say something about these models
A: Each has its own emphasis and merits. Development of Chinese NLP needs joint efforts.
Q: Any comments on the name of next generation of the pre-trained model?
A: Maybe ZOE: Zero-shOt Embeddings from language model
Q: Tell me a little bit more about RoBERTa-wwm-ext
A: integrate whole word masking (wwm) into RoBERTa model, specifically:
max_len=512
(but not from max_len=128
for several steps then max_len=512
)If you find the technical report or resource is useful, please cite the following technical report in your paper.
@inproceedings{cui-etal-2020-revisiting,
title = "Revisiting Pre-Trained Models for {C}hinese Natural Language Processing",
author = "Cui, Yiming and
Che, Wanxiang and
Liu, Ting and
Qin, Bing and
Wang, Shijin and
Hu, Guoping",
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.findings-emnlp.58",
pages = "657--668",
}
@article{chinese-bert-wwm,
title={Pre-Training with Whole Word Masking for Chinese BERT},
author={Cui, Yiming and Che, Wanxiang and Liu, Ting and Qin, Bing and Yang, Ziqing and Wang, Shijin and Hu, Guoping},
journal={arXiv preprint arXiv:1906.08101},
year={2019}
}
This is NOT a project by Google official. Also, this is NOT an official product by HIT and iFLYTEK. The experiments only represent the empirical results in certain conditions and should not be regarded as the nature of the respective models. The results may vary using different random seeds, computing devices, etc. The contents in this repository are for academic research purpose, and we do not provide any conclusive remarks. Users are free to use anythings in this repository within the scope of Apache-2.0 licence. However, we are not responsible for direct or indirect losses that was caused by using the content in this project.
The first author of this project is partially supported by Google TensorFlow Research Cloud (TFRC) Program.
If there is any problem, please submit a GitHub Issue.
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。
1. Open source ecosystem
2. Collaboration, People, Software
3. Evaluation model