MiLMo:Minority Multilingual Pre-trained Language Model - Summary

The paper presents a multilingual pre-trained language model named MiLMo that performs better on minority language tasks, including Mongolian, Tibetan, Uyghur, Kazakh and Korean. The authors also construct a minority multilingual text classification dataset named MiTC, and train a word2vec model fo

Arxiv URL: https://arxiv.org/abs/2212.01779v2

Authors: Junjie Deng, Hanru Shi, Xinhe Yu, Wugedele Bao, Yuan Sun, Xiaobing Zhao

Summary:

The paper presents a multilingual pre-trained language model named MiLMo that performs better on minority language tasks, including Mongolian, Tibetan, Uyghur, Kazakh and Korean. The authors also construct a minority multilingual text classification dataset named MiTC, and train a word2vec model for each language to provide optimal scheme for the downstream task research of minority languages.

Key Insights & Learnings:

  • MiLMo is a multilingual pre-trained language model that performs better on minority language tasks.
  • MiTC is a minority multilingual text classification dataset constructed to solve the problem of scarcity of minority language datasets.
  • MiLMo outperforms the word2vec representation in the downstream task of text classification.
  • The paper provides the best scheme for the research of downstream task of minority languages.
  • The existing multilingual pre-trained models do not work well on minority languages, which seriously affects the construction of minority language informatization.


Terms Mentioned: pre-trained language models, multilingual pre-trained language models, minority languages, MiLMo, MiTC, word2vec model, downstream task

Technologies / Libraries Mentioned: BERT, ELMo, Transformer, GPT, ALBERT, SpanBERT, RoBERTa, XLM, XLM-R, mBERT