Comparing Pre-Training Schemes for Luxembourgish BERT Models

Cedric Lothritz, Saad Ezzini, Christoph Purschke, Tegawendé F. Bissyandé, Jacques Klein, Isabella Olariu, Andrey Boytsov, Clément Lefebvre, Anne Goujon

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Scopus citations

Abstract

Despite the widespread use of pre-trained models in NLP, well-performing pre-trained models for low-resource languages are scarce. To address this issue, we propose two novel BERT models for the Luxembourgish language that improve on the state of the art. We also present an empirical study on both the performance and robustness of the investigated BERT models. We compare the models on a set of downstream NLP tasks and evaluate their robustness against different types of data perturbations. Additionally, we provide novel datasets to evaluate the performance of Luxembourgish language models. Our findings reveal that pre-training a pre-loaded model has a positive effect on both the performance and robustness of fine-tuned models and that using the German GottBERT model yields a higher performance while the multilingual mBERT results in a more robust model. This study provides valuable insights for researchers and practitioners working with low-resource languages and highlights the importance of considering pre-training strategies when building language models.

Original languageEnglish
Title of host publication19th Conference on Natural Language Processing, KONVENS 2023 - Proceedings of the Conference
EditorsMunir Georges, Aaricia Herygers, Annemarie Friedrich, Benjamin Roth
PublisherAssociation for Computational Linguistics (ACL)
Pages17-27
Number of pages11
ISBN (Electronic)9798891760295
StatePublished - 2023
Externally publishedYes
Event19th Conference on Natural Language Processing, KONVENS 2023 - Ingolstadt, Germany
Duration: 18 Sep 202322 Sep 2023

Publication series

Name19th Conference on Natural Language Processing, KONVENS 2023 - Proceedings of the Conference

Conference

Conference19th Conference on Natural Language Processing, KONVENS 2023
Country/TerritoryGermany
CityIngolstadt
Period18/09/2322/09/23

Bibliographical note

Publisher Copyright:
© 2023 Association for Computational Linguistics.

Keywords

  • BERT
  • Downstream NLP tasks
  • GottBERT
  • Language models
  • Low-resource languages
  • LuxemBERT
  • Luxembourgish
  • Pre-training

ASJC Scopus subject areas

  • Software

Fingerprint

Dive into the research topics of 'Comparing Pre-Training Schemes for Luxembourgish BERT Models'. Together they form a unique fingerprint.

Cite this