The Effect of Low and High Frequent Term Removal on Documents Clustering

Research output: Chapter in Book/Report/Conference proceedingChapterpeer-review

2 Scopus citations

Abstract

In information retrieval systems, it is very important to define indexing using appropriate terms for documents clustering purpose. Some common words that are less helpful in selecting documents matching a user need are called stop words. Many researchers agree on a common list of stop words in different natural languages. These words have been considered as an assumptions by researchers in past studies on document clustering. This paper, based on same previous assumptions, considers high and low frequency of some non-listed words that could be potentially added to the stopwords list in order to improve overall document clustering process. Such augmented stop words could be different across different collection of documents. However, the impact of augmenting such stopwords in document clustering remains debatable. Few studies claims that stop words removal improves the purity which consequently enhances the overall document clustering performance. The discrimination power of the word are determined based on Zipf’s law by using document frequency instead of term frequency. This paper studies the effect of removing augmented stopwords in the clustering process. Particularly, it investigates whether or not removing stop words enhances the effectiveness of documents clustering techniques. This study applies three different documents clustering methods and observe how removing augmented stop words would impact on these methods. Moreover, it assesses the impact of removing stop words by observing the change in cluster purity at different thresholds. Lastly, it also presents the results of an experimental study by implementing few document clustering techniques with and without augmented stopwords at different frequency thresholds and investigates the combination impacts of these thresholds with a series of comprehensive experimental studies.

Original languageEnglish
Title of host publicationStudies in Systems, Decision and Control
PublisherSpringer Science and Business Media Deutschland GmbH
Pages699-710
Number of pages12
DOIs
StatePublished - 2024
Externally publishedYes

Publication series

NameStudies in Systems, Decision and Control
Volume528
ISSN (Print)2198-4182
ISSN (Electronic)2198-4190

Bibliographical note

Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024.

Keywords

  • Clustering
  • Document frequency
  • Information retrieval
  • Term frequency

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • Control and Systems Engineering
  • Automotive Engineering
  • Social Sciences (miscellaneous)
  • Economics, Econometrics and Finance (miscellaneous)
  • Control and Optimization
  • Decision Sciences (miscellaneous)

Fingerprint

Dive into the research topics of 'The Effect of Low and High Frequent Term Removal on Documents Clustering'. Together they form a unique fingerprint.

Cite this