Abstract
Large language models (LLMs) are increasingly used to convert natural language descriptions into mathematical optimization formulations. Current evaluations often treat formulations as a whole, relying on coarse metrics such as solution accuracy or runtime, which obscure structural or numerical errors. In this study, we present a comprehensive, component-level evaluation framework for LLM-generated formulations. Beyond the conventional optimality gap, our framework introduces metrics such as precision and recall for decision variables and constraints, constraint and objective root mean squared error (Cons-RMSE and Obj-RMSE), and efficiency indicators based on token usage and latency. We evaluate GPT-5, LLaMA 3.1 Instruct, and DeepSeek Math across optimization problems of varying complexity under six prompting strategies. Results show that GPT-5 consistently outperforms other models, with Chain-of-Thought, Self-Consistency, and Modular Prompting proving most effective. Analysis indicates that solver performance depends primarily on high Constraint Recall and low Cons-RMSE, which together ensure structural correctness and solution reliability. Constraint precision and decision variable metrics play secondary roles, while concise outputs enhance computational efficiency. These findings highlight three principles for NLP-to-optimization modeling: (i) Complete constraint coverage prevents violations, (ii) minimizing Cons-RMSE ensures solver-level accuracy, and (iii) concise outputs improve computational efficiency. The proposed framework establishes a foundation for fine-grained, diagnostic evaluation of LLMs in optimization modeling.
| Original language | English |
|---|---|
| Article number | 133981 |
| Journal | Neurocomputing |
| Volume | 695 |
| DOIs | |
| State | Published - 28 Sep 2026 |
Bibliographical note
Publisher Copyright:© 2026 Elsevier B.V.
Keywords
- Combinatorial optimization
- Fine-tuning
- In-context learning
- Large language models
- Linear programming
- Optimization
- Optimization modeling
- Prompt engineering
ASJC Scopus subject areas
- Computer Science Applications
- Cognitive Neuroscience
- Artificial Intelligence
Fingerprint
Dive into the research topics of 'A component-level evaluation framework for diagnosing LLM errors in optimization modelling'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver