The short stories corpus

  • Faisal Alvi
  • , Mark Stevenson
  • , Paul Clough

Research output: Contribution to journalConference articlepeer-review

Abstract

In this work we describe the construction of a plagiarism detection/text reuse corpus submitted for the PAN-2015 Evaluation Lab. Our corpus consists of four different text reuse scenarios namely, (1) no-plagiarism, (2) story-retelling, (3) synonym-replacement and (4) character-substitution. Among these scenarios the most interesting one is story retelling - through it we find patterns of textual similarity between story retellings. We use Grimm brothers fairy tales as described in the Project Gutenberg as the source of our documents. The corpus consists of 200 pairs of documents, with 50 document pairs for each type of text reuse. Empirical observation shows interesting patterns of textual similarity within the corpus. Furthermore, plagiarism detection using various approaches shows the difficulty of detection of various groups within the corpus.

Original languageEnglish
JournalCEUR Workshop Proceedings
Volume1391
StatePublished - 2015

ASJC Scopus subject areas

  • General Computer Science

Fingerprint

Dive into the research topics of 'The short stories corpus'. Together they form a unique fingerprint.

Cite this