Hashing and merging heuristics for text reuse detection: Notebook for PAN at CLEF-2014

Faisal Alvi, Mark Stevenson, Paul Clough

Research output: Contribution to journalConference articlepeer-review

16 Scopus citations

Abstract

This paper describes a joint software entry by King Fahd University of Petroleum & Minerals and the University of Sheffield for the text-alignment task at PAN-2014. We employ the three steps of seeding, extension and filtering for text alignment. For seeding we use character n-grams with a variant of the Rabin-Karp Algorithm for multiple pattern search. We then use an elaborate merging mechanism with several cases to combine the individually found seeds. A short filtering step is then used to remove extraneous passages. This approach scored plagdet scores of 0.65954 and 0.73416 on test corpora 2 and 3 during the final test run.

Original languageEnglish
Pages (from-to)939-946
Number of pages8
JournalCEUR Workshop Proceedings
Volume1180
StatePublished - 2014

ASJC Scopus subject areas

  • General Computer Science

Fingerprint

Dive into the research topics of 'Hashing and merging heuristics for text reuse detection: Notebook for PAN at CLEF-2014'. Together they form a unique fingerprint.

Cite this