Abstract
Despite the widespread use of pre-trained models in NLP, well-performing pre-trained models for low-resource languages are scarce. To address this issue, we propose two novel BERT models for the Luxembourgish language that improve on the state of the art. We also present an empirical study on both the performance and robustness of the investigated BERT models. We compare the models on a set of downstream NLP tasks and evaluate their robustness against different types of data perturbations. Additionally, we provide novel datasets to evaluate the performance of Luxembourgish language models. Our findings reveal that pre-training a pre-loaded model has a positive effect on both the performance and robustness of fine-tuned models and that using the German GottBERT model yields a higher performance while the multilingual mBERT results in a more robust model. This study provides valuable insights for researchers and practitioners working with low-resource languages and highlights the importance of considering pre-training strategies when building language models.
| Original language | English |
|---|---|
| Title of host publication | 19th Conference on Natural Language Processing, KONVENS 2023 - Proceedings of the Conference |
| Editors | Munir Georges, Aaricia Herygers, Annemarie Friedrich, Benjamin Roth |
| Publisher | Association for Computational Linguistics (ACL) |
| Pages | 17-27 |
| Number of pages | 11 |
| ISBN (Electronic) | 9798891760295 |
| State | Published - 2023 |
| Externally published | Yes |
| Event | 19th Conference on Natural Language Processing, KONVENS 2023 - Ingolstadt, Germany Duration: 18 Sep 2023 → 22 Sep 2023 |
Publication series
| Name | 19th Conference on Natural Language Processing, KONVENS 2023 - Proceedings of the Conference |
|---|
Conference
| Conference | 19th Conference on Natural Language Processing, KONVENS 2023 |
|---|---|
| Country/Territory | Germany |
| City | Ingolstadt |
| Period | 18/09/23 → 22/09/23 |
Bibliographical note
Publisher Copyright:© 2023 Association for Computational Linguistics.
Keywords
- BERT
- Downstream NLP tasks
- GottBERT
- Language models
- Low-resource languages
- LuxemBERT
- Luxembourgish
- Pre-training
ASJC Scopus subject areas
- Software