Abstract
In information retrieval systems, it is very important to define indexing using appropriate terms for documents clustering purpose. Some common words that are less helpful in selecting documents matching a user need are called stop words. Many researchers agree on a common list of stop words in different natural languages. These words have been considered as an assumptions by researchers in past studies on document clustering. This paper, based on same previous assumptions, considers high and low frequency of some non-listed words that could be potentially added to the stopwords list in order to improve overall document clustering process. Such augmented stop words could be different across different collection of documents. However, the impact of augmenting such stopwords in document clustering remains debatable. Few studies claims that stop words removal improves the purity which consequently enhances the overall document clustering performance. The discrimination power of the word are determined based on Zipf’s law by using document frequency instead of term frequency. This paper studies the effect of removing augmented stopwords in the clustering process. Particularly, it investigates whether or not removing stop words enhances the effectiveness of documents clustering techniques. This study applies three different documents clustering methods and observe how removing augmented stop words would impact on these methods. Moreover, it assesses the impact of removing stop words by observing the change in cluster purity at different thresholds. Lastly, it also presents the results of an experimental study by implementing few document clustering techniques with and without augmented stopwords at different frequency thresholds and investigates the combination impacts of these thresholds with a series of comprehensive experimental studies.
| Original language | English |
|---|---|
| Title of host publication | Studies in Systems, Decision and Control |
| Publisher | Springer Science and Business Media Deutschland GmbH |
| Pages | 699-710 |
| Number of pages | 12 |
| DOIs | |
| State | Published - 2024 |
| Externally published | Yes |
Publication series
| Name | Studies in Systems, Decision and Control |
|---|---|
| Volume | 528 |
| ISSN (Print) | 2198-4182 |
| ISSN (Electronic) | 2198-4190 |
Bibliographical note
Publisher Copyright:© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024.
Keywords
- Clustering
- Document frequency
- Information retrieval
- Term frequency
ASJC Scopus subject areas
- Computer Science (miscellaneous)
- Control and Systems Engineering
- Automotive Engineering
- Social Sciences (miscellaneous)
- Economics, Econometrics and Finance (miscellaneous)
- Control and Optimization
- Decision Sciences (miscellaneous)