Chinese LERT | Chinese/English PERT Chinese MacBERT | Chinese ELECTRA | Chinese XLNet | Chinese BERT | TextBrewer | TextPruner
More resources by HFL: https://github.com/ymcui/HFL-Anthology
Oct 29, 2022 We release a new pre-trained model called LERT, check https://github.com/ymcui/LERT/
Aug 23, 2022 CINO has been accepted as a long paper at COLING 2022. We will update the final paper and release the corresponding resources after the camera-ready deadline.
Feb 21, 2022 CINO-small (6-layer, 148M parameters) have been released.
Jan 25, 2022 CINO-base-v2, CINO-large-v2, and WCM-v2 have been released.
Dec 17, 2021 We have released a model pruning toolkit TextPruner. Check https://github.com/airaria/TextPruner
Oct 25, 2021 CINO-large and Wiki-Chinese-Minority(WCM)dataset have been released.
Section | Description |
---|---|
Introduction | Introduction to CINO |
Download | Download links and how-to-use |
Quick Load | Learn how to quickly load our models through 🤗Transformers |
Dataset for Chinese Minority Languages | Introduce Wiki-Chinese-Minority (WCM) and other datasets |
Results | Results on several datasets |
Citation | Citation and technical report |
Multilingual Pre-trained Language Model, such as mBERT and XLM-R, adopts masked language model (MLM) and other self-supervised approaches to support multilingual and cross-lingual abilities in NLP systems, using training corpus in various languages.
However, due to the scarcity of corpus in Chinese minority languages and neglection of relevant research, current multilingual PLMs are not capable of dealing with these languages.
We made the following contributions.
We propose CINO (Chinese mINOrity PLM), which is built on XLM-R. We further pre-train XLM-R with corpus in Chinese minority languages.
To evaluate CINO as well as other multilingual PLMs, we also propose a new classification dataset called Wiki-Chinese-Minority(WCM), which is built on Wikipedia.
The experimental results on WCM, Tibetan News Classification Corpus (TNCC), and KLUE-TC (YNAT) show that CINO achieves state-of-the-art performances.
CINO supports the following languages:
We provide CINO-small, CINO-base and CINO-large of PyTorch version (preferred version: v2). We will release more models in the future.
CINO-large-v2
:24-layer, 1024-hidden, 16-heads, vocabulary size 136K, 442M parametersCINO-base-v2
12-layer, 768-hidden, 12-heads, vocabulary size 136K, 190M parametersCINO-small-v2
6-layer, 768-hidden, 12-heads, vocabulary size 136K, 148M parametersCINO-large
:24-layer, 1024-hidden, 16-heads, vocabulary size 275K, 585M parametersNotice:
Model | Size | Google Drive | Baidu Disk |
---|---|---|---|
CINO-large-v2 | 1.6GB | PyTorch | PyTorch(pw: 3fjt) |
CINO-base-v2 | 705MB | PyTorch | PyTorch(pw: qnvc) |
CINO-small-v2 | 564MB | PyTorch | PyTorch todo(pw: 9mc8) |
CINO-large | 2.2GB | PyTorch | PyTorch (pw: wpyh) |
You can also download our models from 🤗transformers Model Hub, including PyTorch and Tensorflow2 models.
Model | Size | transformers model hub URL |
---|---|---|
CINO-large-v2 | 1.6GB | https://huggingface.co/hfl/cino-large-v2 |
CINO-base-v2 | 705MB | https://huggingface.co/hfl/cino-base-v2 |
CINO-small-v2 | 564MB | https://huggingface.co/hfl/cino-small-v2 |
CINO-large | 2.2GB | https://huggingface.co/hfl/cino-large |
How-to: click the model link that you wish to download (e.g., https://huggingface.co/hfl/cino-large) → Select "Files and versions" tab → Download!
There are three files in PyTorch model:
pytorch_model.bin # Model Weight
config.json # Model Config
sentencepiece.bpe.model # Vocabulary
CINO uses exactly the same neural architecture with XLM-R, which can be direclty loaded using XLMRobertaModel
class in Transformers.
from transformers import XLMRobertaTokenizer, XLMRobertaModel
tokenizer = XLMRobertaTokenizer.from_pretrained("PATH_TO_MODEL_DIR")
model = XLMRobertaModel.from_pretrained("PATH_TO_MODEL_DIR")
With 🤗Transformers, the models above could be easily accessed and loaded through the following codes.
from transformers import XLMRobertaTokenizer, XLMRobertaModel
tokenizer = XLMRobertaTokenizer.from_pretrained("MODEL_NAME")
model = XLMRobertaModel.from_pretrained("MODEL_NAME")
The actual model and its MODEL_NAME
are listed below.
Actual Model | MODEL_NAME |
---|---|
CINO-large-v2 | hfl/cino-large-v2 |
CINO-base-v2 | hfl/cino-base-v2 |
CINO-small-v2 | hfl/cino-small-v2 |
CINO-large | hfl/cino-large |
We built a new classification dataset Wiki-Chinese-Minority (WCM). The dataset covers Mongolian, Tibetan, Uyghur, Cantonese, Korean, Kazakh, and Chinese, including ten categories of art, geography, history, nature, natural science, people, technology, education, economy, and health.
We use weighted-F1 for evaluation.
Name | Google Drive | Baidu Disk |
---|---|---|
Wiki-Chinese-Minority-v2(WCM-v2) | Google Drive | - |
Wiki-Chinese-Minority(WCM) | Google Drive | - |
WCM-v2 has a more balanced data distribution across categories and languages.
Dataset Statistics of WCM-v2:
Category | mn | bo | ug | yue | ko | Kk | zh-Train | zh-Dev | zh-Test |
---|---|---|---|---|---|---|---|---|---|
Art | 135 | 141 | 3 | 387 | 806 | 348 | 2657 | 331 | 335 |
Geography | 76 | 339 | 256 | 1550 | 1197 | 572 | 12854 | 1589 | 1644 |
History | 66 | 111 | 0 | 499 | 776 | 491 | 1771 | 227 | 248 |
Nature | 7 | 0 | 7 | 606 | 442 | 361 | 1105 | 134 | 110 |
Natural Science | 779 | 133 | 20 | 336 | 532 | 880 | 2314 | 317 | 287 |
People | 1402 | 111 | 0 | 1230 | 684 | 169 | 7706 | 953 | 924 |
Technology | 191 | 163 | 8 | 329 | 808 | 515 | 1184 | 134 | 152 |
Education | 6 | 1 | 0 | 289 | 439 | 1392 | 936 | 130 | 118 |
Economy | 205 | 0 | 0 | 445 | 575 | 637 | 922 | 113 | 109 |
Health | 106 | 111 | 6 | 272 | 299 | 893 | 551 | 67 | 73 |
Total | 2973 | 1110 | 300 | 5943 | 6558 | 6258 | 32000 | 3995 | 4000 |
Note:
zh
and minority
The dataset is still in its alpha stage, with possible modifications in the future.
We evaluate on YNAT, TNCC, and Wiki-Chinese-Minority. For each dataset, we use the same hyper-params for all models.
#Train | #Dev | #Test | #Classes | Metric |
---|---|---|---|---|
45,678 | 9,107 | 9,107 | 7 | macro-F1 |
Hyper-params: Initial LR1e-5, batch size 16.
Results:
Model | Dev |
---|---|
XLM-R-large[1] | 87.3 |
XLM-R-large[2] | 86.3 |
CINO-small-v2 | 84.1 |
CINO-base-v2 | 85.5 |
CINO-large-v2 | 87.2 |
CINO-large | 87.4 |
[1] The results in the original paper.
[2] Reproduced result using the same initial LR with CINO-large.
#Train[1] | #Dev | #Test | #Classes | Metric |
---|---|---|---|---|
7,363 | 920 | 920 | 12 | macro-F1 |
Hyper-params: initial LR 5e-6, batch size 16
Results:
Model | Dev | Test |
---|---|---|
TextCNN | 65.1 | 63.4 |
XLM-R-large | 14.3 | 13.3 |
CINO-small-v2 | 72.1 | 66.7 |
CINO-base-v2 | 70.3 | 68.4 |
CINO-large-v2 | 72.9 | 71.0 |
CINO-large | 71.3 | 68.6 |
Note: there is no official train/dev/test split in this dataset. We split the dataset with the ratio of 8:1:1. Our splits are available at data/TNCC. The version "with_space_separated" reserves the spaces provided by the original author, but in our paper, we use the version "without_space_separated" where the spaces for separation have been removed.
We use Chinese training set to train our model and test on other languages (zero-shot). We use weighted-F1 for evaluation.
Hyper-params: initial LR 7e-6, batch size 32.
Results on WCM-v2:
Model | MN | BO | UG | YUE | KO | KK | ZH | Average |
---|---|---|---|---|---|---|---|---|
XLM-R-base | 41.2 | 25.7 | 84.5 | 66.1 | 43.1 | 23.0 | 88.3 | 53.1 |
XLM-R-large | 53.8 | 24.5 | 89.4 | 67.3 | 45.4 | 30.0 | 88.3 | 57.0 |
CINO-small-v2 | 60.3 | 47.9 | 86.5 | 64.6 | 43.2 | 33.2 | 87.9 | 60.5 |
CINO-base-v2 | 62.1 | 52.7 | 87.8 | 68.1 | 45.6 | 38.3 | 89.0 | 63.4 |
CINO-large-v2 | 73.1 | 58.9 | 90.1 | 66.9 | 45.1 | 42.0 | 88.9 | 66.4 |
See examples
. It currently includes
If you find the technical report or resource is useful, please cite our work in your paper.
@inproceedings{yang-etal-2022-cino,
title = "{CINO}: A {C}hinese Minority Pre-trained Language Model",
author = "Yang, Ziqing and
Xu, Zihang and
Cui, Yiming and
Wang, Baoxin and
Lin, Min and
Wu, Dayong and
Chen, Zhigang",
booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
month = oct,
year = "2022",
address = "Gyeongju, Republic of Korea",
publisher = "International Committee on Computational Linguistics",
url = "https://aclanthology.org/2022.coling-1.346",
pages = "3937--3949"
}
Follow our official WeChat account to keep updated with our latest technologies!
Please submit an issue.
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。
1. Open source ecosystem
2. Collaboration, People, Software
3. Evaluation model