Skip to main navigation Skip to search Skip to main content

A component-level evaluation framework for diagnosing LLM errors in optimization modelling

  • Dania Refai*
  • , Moataz Ahmed
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Large language models (LLMs) are increasingly used to convert natural language descriptions into mathematical optimization formulations. Current evaluations often treat formulations as a whole, relying on coarse metrics such as solution accuracy or runtime, which obscure structural or numerical errors. In this study, we present a comprehensive, component-level evaluation framework for LLM-generated formulations. Beyond the conventional optimality gap, our framework introduces metrics such as precision and recall for decision variables and constraints, constraint and objective root mean squared error (Cons-RMSE and Obj-RMSE), and efficiency indicators based on token usage and latency. We evaluate GPT-5, LLaMA 3.1 Instruct, and DeepSeek Math across optimization problems of varying complexity under six prompting strategies. Results show that GPT-5 consistently outperforms other models, with Chain-of-Thought, Self-Consistency, and Modular Prompting proving most effective. Analysis indicates that solver performance depends primarily on high Constraint Recall and low Cons-RMSE, which together ensure structural correctness and solution reliability. Constraint precision and decision variable metrics play secondary roles, while concise outputs enhance computational efficiency. These findings highlight three principles for NLP-to-optimization modeling: (i) Complete constraint coverage prevents violations, (ii) minimizing Cons-RMSE ensures solver-level accuracy, and (iii) concise outputs improve computational efficiency. The proposed framework establishes a foundation for fine-grained, diagnostic evaluation of LLMs in optimization modeling.

Original languageEnglish
Article number133981
JournalNeurocomputing
Volume695
DOIs
StatePublished - 28 Sep 2026

Bibliographical note

Publisher Copyright:
© 2026 Elsevier B.V.

Keywords

  • Combinatorial optimization
  • Fine-tuning
  • In-context learning
  • Large language models
  • Linear programming
  • Optimization
  • Optimization modeling
  • Prompt engineering

ASJC Scopus subject areas

  • Computer Science Applications
  • Cognitive Neuroscience
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'A component-level evaluation framework for diagnosing LLM errors in optimization modelling'. Together they form a unique fingerprint.

Cite this