Plagiarism detection in texts obfuscated with homoglyphs

Faisal Alvi*, Mark Stevenson, Paul Clough

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

10 Scopus citations

Abstract

Homoglyphs can be used for disguising plagiarized text by replacing letters in source texts with visually identical letters from other scripts. Most current plagiarism detection systems are not able to detect plagiarism when text has been obfuscated using homoglyphs. In this work, we present two alternative approaches for detecting plagiarism in homoglyph obfuscated texts. The first approach utilizes the Unicode list of confusables to replace homoglyphs with visually identical letters, while the second approach uses a similarity score computed using normalized hamming distance to match homoglyph obfuscated words with source words. Empirical testing on datasets from PAN-2015 shows that both approaches perform equally well for plagiarism detection in homoglyph obfuscated texts.

Original languageEnglish
Title of host publicationAdvances in Information Retrieval - 39th European Conference on IR Research, ECIR 2017, Proceedings
EditorsClaudia Hauff, Joemon M. Jose, Dyaa Albakour, Ismail Sengor Altingovde, John Tait, Dawei Song, Stuart Watt
PublisherSpringer Verlag
Pages669-675
Number of pages7
ISBN (Print)9783319566078
DOIs
StatePublished - 2017

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume10193 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Bibliographical note

Publisher Copyright:
© Springer International Publishing AG 2017.

ASJC Scopus subject areas

  • Theoretical Computer Science
  • General Computer Science

Fingerprint

Dive into the research topics of 'Plagiarism detection in texts obfuscated with homoglyphs'. Together they form a unique fingerprint.

Cite this