Abstract
Large language models (LLMs) demonstrate impressive capabilities across a range of natural language processing (NLP) tasks. However, they are highly sensitive to prompt design, which significantly affects their ability to align outputs with user intent. Poorly crafted prompts can result in misleading or irrelevant responses. Nevertheless, selecting the most effective prompt from several candidates remains an open challenge. Despite the growing importance of prompt engineering, there is no comprehensive framework to systematically evaluate prompts across multiple dimensions, such as similarity, performance, efficiency, and consistency, particularly in scenarios where performance can be traded off against computational cost or consistency. In this study, we propose a novel scoring framework to evaluate handcrafted prompts across four essential dimensions: Similarity, performance, efficiency (measured by latency, input tokens, and output tokens), and consistency. Considering Arabic, a relatively low-resource, morphologically rich language, as a case study, we evaluated this framework on six diverse text classification tasks: Dialect identification, sentiment analysis, offensive language detection, stance detection, emotion detection, and sarcasm detection. Our methodology assesses prompts across multiple LLMs (GPT-4o mini, LLaMA, ALLAM, and Claude 3.5 Haiku), providing valuable insights into model-specific and task-specific performance patterns. Results demonstrate that no single prompt universally excels across all dimensions; rather, optimal prompts vary based on specific task requirements and evaluation priorities. The proposed framework enables the identification of the most effective prompts for each application context while revealing important trade-offs between performance metrics. By addressing the unique challenges of Arabic NLP, this research not only advances prompt engineering for underrepresented languages but also provides a systematic and adaptable methodology for prompt evaluation that can enhance LLM performance across a range of linguistic contexts, diverse domains, tasks, and various model architectures.
| Original language | English |
|---|---|
| Pages (from-to) | 171468-171492 |
| Number of pages | 25 |
| Journal | IEEE Access |
| Volume | 13 |
| DOIs | |
| State | Published - 2025 |
Bibliographical note
Publisher Copyright:© 2013 IEEE.
Keywords
- Arabic
- LLMs
- NLP
- large language models
- natural language processing
- prompt design
- prompt engineering
- prompt scoring
ASJC Scopus subject areas
- General Computer Science
- General Materials Science
- General Engineering