The KIND Dataset: A Social Collaboration Approach for Nuanced Dialect Data Collection

Asma Z. Yamani, Raghad Alziyady, Reem AlYami, Salma A. Albelali, Leina Abouhagar, Jawharah Almulhim, Amjad Alsulami, Motaz Alfarraj, Rabeah Al-Zaidy

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Nuanced dialects are a linguistic variant that pose several challenges for NLP models and techniques. One of the main challenges is the limited amount of datasets to enable extensive research and experimentation. We propose an approach for efficiently collecting nuanced dialectal datasets that are not only of high quality, but are versatile enough to be multipurpose as well. To test our approach we collect the KIND corpus, which is a collection of fine-grained Arabic dialect data. The data is short texts, and unlike many nuanced dialectal datasets, it is curated manually through social collaboration efforts as opposed to being crawled from social media. The collaborative approach is incentivized through educational gamification and competitions for which the community itself benefits from the open source dataset. Our approach aims to achieve: (1) coverage of dialects from under-represented groups and fine-grained dialectal varieties, (2) provide aligned parallel corpora for translation between Modern Standard Arabic (MSA) and multiple dialects to enable translation and comparison studies, (3) promote innovative approaches for nuanced dialect data collection. We explain the steps for the competition as well as the resulting datasets and the competing data collection systems. The KIND dataset is shared with the research community.

Original languageEnglish
Title of host publicationEACL 2024 - 18th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Student Research Workshop
EditorsNeele Falk, Sara Papi, Mike Zhang
PublisherAssociation for Computational Linguistics (ACL)
Pages32-43
Number of pages12
ISBN (Electronic)9798891760905
StatePublished - 2024
Event18th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2024 - Student Research Workshop, SRW 2024 - St. Julian's, Malta
Duration: 21 Mar 202422 Mar 2024

Publication series

NameEACL 2024 - 18th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Student Research Workshop

Conference

Conference18th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2024 - Student Research Workshop, SRW 2024
Country/TerritoryMalta
CitySt. Julian's
Period21/03/2422/03/24

Bibliographical note

Publisher Copyright:
© 2024 Association for Computational Linguistics.

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Software
  • Linguistics and Language

Fingerprint

Dive into the research topics of 'The KIND Dataset: A Social Collaboration Approach for Nuanced Dialect Data Collection'. Together they form a unique fingerprint.

Cite this