MiLMo:Minority Multilingual Pre-trained Language Model - Summary
The paper presents a multilingual pre-trained language model named MiLMo that performs better on minority language tasks, including Mongolian, Tibetan, Uyghur, Kazakh and Korean. The authors also construct a minority multilingual text classification dataset named MiTC, and train a word2vec model fo
Arxiv URL: https://arxiv.org/abs/2212.01779v2
Authors: Junjie Deng, Hanru Shi, Xinhe Yu, Wugedele Bao, Yuan Sun, Xiaobing Zhao
Summary:
The paper presents a multilingual pre-trained language model named MiLMo that performs better on minority language tasks, including Mongolian, Tibetan, Uyghur, Kazakh and Korean. The authors also construct a minority multilingual text classification dataset named MiTC, and train a word2vec model for each language to provide optimal scheme for the downstream task research of minority languages.
Key Insights & Learnings:
- MiLMo is a multilingual pre-trained language model that performs better on minority language tasks.
- MiTC is a minority multilingual text classification dataset constructed to solve the problem of scarcity of minority language datasets.
- MiLMo outperforms the word2vec representation in the downstream task of text classification.
- The paper provides the best scheme for the research of downstream task of minority languages.
- The existing multilingual pre-trained models do not work well on minority languages, which seriously affects the construction of minority language informatization.
Terms Mentioned: pre-trained language models, multilingual pre-trained language models, minority languages, MiLMo, MiTC, word2vec model, downstream task
Technologies / Libraries Mentioned: BERT, ELMo, Transformer, GPT, ALBERT, SpanBERT, RoBERTa, XLM, XLM-R, mBERT