Sequential Vision to Language as Story: A Storytelling Dataset and Benchmarking

  • Zainy M. Malakan*
  • , Saeed Anwar
  • , Ghulam Mubashar Hassan
  • , Ajmal Mian
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

3 Scopus citations

Abstract

Storytelling is a remarkable human skill that plays a significant role in learning and experiencing everyday life. Developing narratives is central to human mental health development, simultaneously encapsulating broad details such as psychology, morality and common sense. Contemporary deep-learning algorithms require similar skills to be able to tell a story from a visual perspective. However, most algorithms function at a superficial or factual level, aligning descriptive text with images in a one-to-one manner without considering the temporal relation. Stories are more expressive in style, language and content, involving imaginary concepts not explicit in the images. An ideal deep learning system should learn and develop cohesive, meaningful, and causal stories. Unfortunately, most existing storytelling methods are trained and evaluated on a single dataset, i.e., the VIsual STorytelling (VIST) dataset. Multiple datasets are essential to test the generalization ability of algorithms. We bridge the gap and present a new dataset for expressive and coherent story creation. We present the Sequential Storytelling Image Dataset (SSID, https://ieee-dataport.org/documents/sequential-storytelling-image-dataset-ssid) consisting of open-source video frames accompanied by story-like annotations. We provide four annotations (stories) for each set of five images. The image sets are collected manually from publicly available videos in three domains: documentaries, lifestyle, and movies, and then annotated manually using Amazon Mechanical Turk. We perform a detailed analysis and benchmarking of the current VIST dataset and our new SSID dataset and show that both datasets exhibit high variance within their multiple ground truth stories corresponding to the same image set. Moreover, our dataset achieves lower mean average scores across all metrics, meaning that the ground truth stories of our dataset are more diverse. Finally, we train and evaluate existing state-of-the-art rhetorical storytelling methods on both datasets and show that our dataset is more challenging and requires sophisticated techniques to accurately detect a significant variety of events.

Original languageEnglish
Pages (from-to)70805-70818
Number of pages14
JournalIEEE Access
Volume11
DOIs
StatePublished - 2023

Bibliographical note

Publisher Copyright:
© 2013 IEEE.

Keywords

  • Storytelling
  • computer vision
  • image and video captioning
  • sequential storytelling image dataset (SSID)
  • visual understanding dataset

ASJC Scopus subject areas

  • General Computer Science
  • General Materials Science
  • General Engineering

Fingerprint

Dive into the research topics of 'Sequential Vision to Language as Story: A Storytelling Dataset and Benchmarking'. Together they form a unique fingerprint.

Cite this