Abstract
In this work we describe the construction of a plagiarism detection/text reuse corpus submitted for the PAN-2015 Evaluation Lab. Our corpus consists of four different text reuse scenarios namely, (1) no-plagiarism, (2) story-retelling, (3) synonym-replacement and (4) character-substitution. Among these scenarios the most interesting one is story retelling - through it we find patterns of textual similarity between story retellings. We use Grimm brothers fairy tales as described in the Project Gutenberg as the source of our documents. The corpus consists of 200 pairs of documents, with 50 document pairs for each type of text reuse. Empirical observation shows interesting patterns of textual similarity within the corpus. Furthermore, plagiarism detection using various approaches shows the difficulty of detection of various groups within the corpus.
| Original language | English |
|---|---|
| Journal | CEUR Workshop Proceedings |
| Volume | 1391 |
| State | Published - 2015 |
ASJC Scopus subject areas
- General Computer Science
Fingerprint
Dive into the research topics of 'The short stories corpus'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver