ML Seminar #75 – Jakub Lengiewicz

LuxemBERT – A Language Model for the Luxembourgish Language

Video recording:

Speaker: Cedric Lothritz (Interdisciplinary Centre for Security, Reliability and Trust; University of Luxembourg)
Title: LuxemBERT – A Language Model for the Luxembourgish Language
Time: Wednesday, 2023.03.15, 10:00 a.m. (CET)
Place: fully virtual (contact Dr. Jakub Lengiewicz to register)
Format: 30 min. presentation + 30 min. discussion

Abstract: Pre-trained Language Models such as BERT have become ubiquitous in NLP where they have achieved state-of-the-art performance in most NLP tasks. While these models are readily available for English and other widely spoken languages, they remain scarce for low-resource languages such as Luxembourgish.

In this talk, I will present a quick overview of a variety of architectures for language models leading up to BERT. The main topic, however, will be LuxemBERT, a BERT model for the Luxembourgish language that we created using the following approach: we augmented the pre-training dataset by considering text data from a closely related language that we partially translated using a simple and straightforward method. Our LuxemBERT model outperforms the multilingual mBERT model, which, at the time of creation, was the only option for transformer-based language models in Luxembourgish. I will conclude with a quick demo of a Luxembourgish chatbot built on LuxemBERT.

Cedric Lothritz received his Master’s degree in Computer Science, from the University of Fribourg (Switzerland), in 2017. His research interests are in the fields of machine learning and natural language processing. Cedric joined the Security, Design and Validation research group, SERVAL, headed by Prof. Yves Le Traon, and he is advised by Dr Jacques Klein and Dr Tegawendé Bissyande.