Analysis and extraction of sentence-level paraphrase sub-corpus in CS education

Faisal Alvi*, El Sayed M. El-Alfy, Wasfi G. Al-Khatib, Radwan E. Abdel-Aal

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

5 Scopus citations

Abstract

Since the advent of the Internet, plagiarism has become a widespread problem in student submissions. Paraphrasing is one of the several types of plagiarism employed by students to mask the original source. In this work, we construct a sub-corpus of paraphrased sentences by extracting all lightly and heavily revised sentences from the Corpus of Plagiarized Short Answers, using modified criteria for sentences. We then apply document similarity measures on this sub-corpus and derive some interesting features of this sub-corpus. Our findings suggest that this sub-corpus is more suited for testing paraphrase detection techniques by providing sentence-level paraphrasing samples instead of the file-level classification provided in the original corpus. Additional sentence samples may also be added to this sub-corpus to achieve variety and scale.

Original languageEnglish
Title of host publicationSIGITE'12 - Proceedings of the ACM Special Interest Group for Information Technology Education Conference
Pages49-54
Number of pages6
DOIs
StatePublished - 2012

Publication series

NameSIGITE'12 - Proceedings of the ACM Special Interest Group for Information Technology Education Conference

Keywords

  • Paraphrasing
  • Plagiarism
  • Similarity measures

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Education

Fingerprint

Dive into the research topics of 'Analysis and extraction of sentence-level paraphrase sub-corpus in CS education'. Together they form a unique fingerprint.

Cite this