The Multilingual Corpus of World’s Constitutions (MCWC)

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

The “Multilingual Corpus of World’s Constitutions” (MCWC) is a rich resource available in English, Arabic, and Spanish, encompassing constitutions from various nations. This corpus serves as a vital asset for the NLP community, facilitating advanced research in constitutional analysis, machine translation, and cross-lingual legal studies. To ensure comprehensive coverage, for constitutions not originally available in Arabic and Spanish, we employed a fine-tuned state-of-the-art machine translation model. MCWC prepares its data to ensure high quality and minimal noise, while also providing valuable mappings of constitutions to their respective countries and continents, facilitating comparative analysis. Notably, the corpus offers pairwise sentence alignments across languages, supporting machine translation experiments. We utilise a leading Machine Translation model, fine-tuned on the MCWC to achieve accurate and context-aware translations. Additionally, we introduce an independent Machine Translation model as a comparative baseline. Fine-tuning the model on MCWC improves accuracy, highlighting the significance of such a legal corpus for NLP and Machine Translation. MCWC’s diverse multilingual content and commitment to data quality contribute to advancements in legal text analysis within the NLP community, facilitating exploration of constitutional texts and multilingual data analysis.

Original languageEnglish
Title of host publication6th Workshop on Open-Source Arabic Corpora and Processing Tools, OSACT 2024 with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation at LREC-COLING 2024 - Workshop Proceedings
EditorsHend Al-Khalifa, Kareem Darwish, Hamdy Mubarak, Mona Ali, Tamer Elsayed
PublisherEuropean Language Resources Association (ELRA)
Pages57-66
Number of pages10
ISBN (Electronic)9782493814364
StatePublished - 2024
Externally publishedYes
Event6th Workshop on Open-Source Arabic Corpora and Processing Tools, OSACT 2024 with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation - Torino, Italy
Duration: 25 May 2024 → …

Publication series

Name6th Workshop on Open-Source Arabic Corpora and Processing Tools, OSACT 2024 with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation at LREC-COLING 2024 - Workshop Proceedings

Conference

Conference6th Workshop on Open-Source Arabic Corpora and Processing Tools, OSACT 2024 with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation
Country/TerritoryItaly
CityTorino
Period25/05/24 → …

Bibliographical note

Publisher Copyright:
© 2024 ELRA Language Resource Association.

Keywords

  • Constitutions
  • Corpus
  • Fine-tuning
  • Legal Documents
  • Machine Translation

ASJC Scopus subject areas

  • Language and Linguistics
  • Education
  • Library and Information Sciences
  • Linguistics and Language

Cite this