Abstract
PURPOSE/AIM BACKGROUNDAlthough the Arabic language is spoken in twenty-two countries by more than 250 million speakers, it is still considered by Natural Language Processing NLP practitioners as a low resource language. Formal sources of Arabic texts are typically written in Modern Standard (or Written) Arabic (MSA), which is a form that is used in formal writing and taught in schools to Arabic speakers. However, informal communication among Arabic speakers is through informal local diglossic dialects. A diglossic language is one where the speakers of the same language have varying dialects. In Arabic, there are multiple dialects in different regions of the Arab world: Gulf, Levantine and North Africa. Users commonly communicate in social media using their local dialect rather than the formal MSA. This introduces a core NLP problem for Arabic, which is dialect identification. It is essential to identify the specific dialect prior to performing tasks such as parsing, tokenizing and other downstream tasks such as semantic inferences. Processing massive amounts of data written in these local dialects requires this identification step to improve accuracies, especially for automatic text comprehension tasks. Although Arabic dialects share a majority of common words, it is not uncommon for the same word to have different meanings across dialects. In addition to improving NLP task accuracies, Arabic Dialect Identification ADI enables a finer-grained demographic identification for mining texts related to consumer reports, health forums, entertainment and tourism reviews, and many others which ultimately lead to improved services for each demographic.The problem of ADI has been addressed by several studies such as (Al-Walaie Khan, 2017), and (Harrat et al., 2019). Some works focus mainly on curating data sets for the problem such as the Sham dataset proposed by (Abu Kwaik et al., 2018).In this work we focus on both tasks: we curate an Arabic dialect dataset for two variants of Arabic (Saudi Arabian and Egyptian) and we train supervised machine learning models to address the identification task.
| Original language | English |
|---|---|
| Title of host publication | ICCAIS 2020 - 3rd International Conference on Computer Applications and Information Security |
| Publisher | Institute of Electrical and Electronics Engineers Inc. |
| ISBN (Electronic) | 9781728142128 |
| DOIs | |
| State | Published - Mar 2020 |
Publication series
| Name | ICCAIS 2020 - 3rd International Conference on Computer Applications and Information Security |
|---|
Bibliographical note
Publisher Copyright:© 2020 IEEE.
Keywords
- Machine Learning
- NLP
- Text Classification
ASJC Scopus subject areas
- Computer Science Applications
- Information Systems
- Software
- Information Systems and Management
- Safety, Risk, Reliability and Quality
- Computer Networks and Communications