Abstract
Sindhi word segmentation is a challenging task due to space omission and insertion issues. The Sindhi language itself adds to this complexity. It’s cursive and consists of characters with inherent joining and non-joining properties, independent of word boundaries. Existing Sindhi word segmentation methods rely on designing and combining hand-crafted features. However, these methods have limitations, such as difficulty handling out-of-vocabulary words, limited robustness for other languages, and inefficiency with large amounts of noisy or raw text. Neural network-based models, in contrast, can automatically capture word boundary information without requiring prior knowledge. In this paper, we propose a Subword-Guided Neural Word Segmenter (SGNWS) that addresses word segmentation as a sequence labeling task. The SGNWS model incorporates subword representation learning through a bidirectional long short-term memory encoder, position-aware self-attention, and a conditional random field. Our empirical results demonstrate that the SGNWS model achieves state-of-the-art performance in Sindhi word segmentation on six datasets.
| Original language | English |
|---|---|
| Pages (from-to) | 183133-183142 |
| Number of pages | 10 |
| Journal | IEEE Access |
| Volume | 12 |
| DOIs | |
| State | Published - 2024 |
| Externally published | Yes |
Bibliographical note
Publisher Copyright:2024 The Authors This work is licensed under a Creative Commons Attribution 4 0 License.
Keywords
- Attention mechanism
- long short-term memory
- neural network
- representation learning
- word segmentation
ASJC Scopus subject areas
- General Computer Science
- General Materials Science
- General Engineering
Fingerprint
Dive into the research topics of 'Enhancing Sindhi Word Segmentation Using Subword Representation Learning and Position-Aware Self-Attention'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver