DefAn: Definitive Answer Dataset for LLM Hallucination Evaluation

  • A. B.M.Ashikur Rahman
  • , Saeed Anwar*
  • , Muhammad Usman
  • , Irfan Ahmad
  • , Ajmal Mian
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Large Language Models (LLMs) represent a major step in AI development and are increasingly used in daily applications. However, they are prone to hallucinations, generating claims that contradict established facts, deviating from prompts, and producing inconsistent responses when the same prompt is presented multiple times. Addressing these issues is challenging due to the lack of comprehensive and easily assessable benchmark datasets. Most existing datasets are limited in scale and scope and rely on multiple-choice questions, which are insufficient for evaluating the generative capabilities of LLMs. To assess hallucination in LLMs, this paper introduces a comprehensive benchmark dataset consisting of over 20,000 unique prompts (more than 75,000 prompts in total) across eight domains. These prompts are designed to elicit definitive, concise, and informative answers. The dataset is divided into two segments: one publicly available for testing and assessing LLM performance, and a hidden segment for benchmarking various LLMs. In our experiments, we tested nine State-of-The-Art (SoTA) models, GPT-4o, GPT-3.5, LLama 2 7B, LLama 3 8B, Gemini 1.0 Pro, Mixtral 8x7B, Zephyr 7B, Deepseek-r1-7b, and Qwen2.5-14B, revealing that overall factual hallucination ranges from 48% to 82% on the public dataset and 31% to 76% on the hidden benchmark. Prompt Misalignment Hallucination ranges up to 95% in the public dataset and up to 94% in the hidden counterpart. Average consistency ranges from 21% to 61% and 44% to 63%, respectively. Domain-wise analysis reveals that LLM performance significantly deteriorates when asked for specific numeric information, whereas it performs moderately with queries involving persons, locations, and dates. Our dataset demonstrates its efficacy and serves as a comprehensive benchmark for evaluating LLM performance.

Original languageEnglish
Article number937
JournalInformation (Switzerland)
Volume16
Issue number11
DOIs
StatePublished - Nov 2025

Bibliographical note

Publisher Copyright:
© 2025 by the authors.

Keywords

  • evaluation
  • hallucination
  • LLMs
  • novel dataset

ASJC Scopus subject areas

  • Information Systems

Fingerprint

Dive into the research topics of 'DefAn: Definitive Answer Dataset for LLM Hallucination Evaluation'. Together they form a unique fingerprint.

Cite this