Abstract
Homoglyphs can be used for disguising plagiarized text by replacing letters in source texts with visually identical letters from other scripts. Most current plagiarism detection systems are not able to detect plagiarism when text has been obfuscated using homoglyphs. In this work, we present two alternative approaches for detecting plagiarism in homoglyph obfuscated texts. The first approach utilizes the Unicode list of confusables to replace homoglyphs with visually identical letters, while the second approach uses a similarity score computed using normalized hamming distance to match homoglyph obfuscated words with source words. Empirical testing on datasets from PAN-2015 shows that both approaches perform equally well for plagiarism detection in homoglyph obfuscated texts.
Original language | English |
---|---|
Title of host publication | Advances in Information Retrieval - 39th European Conference on IR Research, ECIR 2017, Proceedings |
Editors | Claudia Hauff, Joemon M. Jose, Dyaa Albakour, Ismail Sengor Altingovde, John Tait, Dawei Song, Stuart Watt |
Publisher | Springer Verlag |
Pages | 669-675 |
Number of pages | 7 |
ISBN (Print) | 9783319566078 |
DOIs | |
State | Published - 2017 |
Publication series
Name | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
---|---|
Volume | 10193 LNCS |
ISSN (Print) | 0302-9743 |
ISSN (Electronic) | 1611-3349 |
Bibliographical note
Publisher Copyright:© Springer International Publishing AG 2017.
ASJC Scopus subject areas
- Theoretical Computer Science
- General Computer Science