Abstract
Large Language Models (LLMs) have gained widespread popularity due to their instruction-following abilities. In this study, we evaluate their ability in simulating user interactions for task-oriented dialogue (TOD) systems. Our findings demonstrate that prompting LLMs reveals their promising capabilities for training and testing dialogue policies, eliminating the need for domain expertise in crafting complex rules or relying on large annotated data, as required by traditional simulators. The results show that the dialogue system trained with the ChatGPT simulator achieves a success rate of 59%, comparable to a 62% success rate of the dialogue system trained with the manual-rules, agenda-based user simulator (ABUS). Furthermore, the dialogue system trained with the ChatGPT simulator demonstrates better generalization ability compared to the dialogue system trained with the ABUS. Its success rate outperforms that of the dialogue system trained with the ABUS by 4% on GenTUS, 5% on the ChatGPT Simulator, and 3% on the Llama simulator. Nevertheless, LLM-based user simulators provide challenging environment, lexically rich, diverse, and random responses. Llama simulator outperforms the human reference in all lexical diversity metrics with a margin of 0.66 in SE, 0.39 in CE, 0.01 in MSTTR, 0.04 in HDD, and 0.55 in MTLD, while the ChatGPT simulator achieves comparable results. This ultimately contributes to enhancing the system's ability to generalize more effectively.
Original language | English |
---|---|
Article number | 101697 |
Journal | Computer Speech and Language |
Volume | 89 |
DOIs | |
State | Published - Jan 2025 |
Bibliographical note
Publisher Copyright:© 2024 Elsevier Ltd
Keywords
- Agenda-based simulator
- Large language models
- Prompting
- Task Oriented Dialogue
- User simulator
ASJC Scopus subject areas
- Software
- Theoretical Computer Science
- Human-Computer Interaction