Skip to main navigation Skip to search Skip to main content

Enhancing Sindhi Word Segmentation Using Subword Representation Learning and Position-Aware Self-Attention

  • Wazir Ali*
  • , Jay Kumar
  • , Saifullah Tumrani
  • , Redhwan Nour
  • , Adeeb Noor
  • , Zenglin Xu
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Sindhi word segmentation is a challenging task due to space omission and insertion issues. The Sindhi language itself adds to this complexity. It’s cursive and consists of characters with inherent joining and non-joining properties, independent of word boundaries. Existing Sindhi word segmentation methods rely on designing and combining hand-crafted features. However, these methods have limitations, such as difficulty handling out-of-vocabulary words, limited robustness for other languages, and inefficiency with large amounts of noisy or raw text. Neural network-based models, in contrast, can automatically capture word boundary information without requiring prior knowledge. In this paper, we propose a Subword-Guided Neural Word Segmenter (SGNWS) that addresses word segmentation as a sequence labeling task. The SGNWS model incorporates subword representation learning through a bidirectional long short-term memory encoder, position-aware self-attention, and a conditional random field. Our empirical results demonstrate that the SGNWS model achieves state-of-the-art performance in Sindhi word segmentation on six datasets.

Original languageEnglish
Pages (from-to)183133-183142
Number of pages10
JournalIEEE Access
Volume12
DOIs
StatePublished - 2024
Externally publishedYes

Bibliographical note

Publisher Copyright:
2024 The Authors This work is licensed under a Creative Commons Attribution 4 0 License.

Keywords

  • Attention mechanism
  • long short-term memory
  • neural network
  • representation learning
  • word segmentation

ASJC Scopus subject areas

  • General Computer Science
  • General Materials Science
  • General Engineering

Fingerprint

Dive into the research topics of 'Enhancing Sindhi Word Segmentation Using Subword Representation Learning and Position-Aware Self-Attention'. Together they form a unique fingerprint.

Cite this