A New Heuristic for the Data Clustering Problem

  • Umair F. Siddiqi
  • , Sadiq M. Sait*
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

13 Scopus citations

Abstract

This paper presents a new heuristic for the data clustering problem. It comprises two parts. The first part is a greedy algorithm, which selects the data points that can act as the centroids of well-separated clusters. The second part is a single-solution-based heuristic, which performs clustering with the objective of optimizing a cluster validity index. Single-solution-based heuristics are memory efficient as compared with population-based heuristics. The proposed heuristic is inspired from evolutionary algorithms (EAs) and consists of five main components: 1) genes; 2) fitness of genes; 3) selection; 4) mutation operation; and 5) diversification. The attributes of the centroids of clusters are considered as genes. The fitness of a gene is a function of two factors: 1) difference between its value and the same attribute of the mean of the data points assigned to its cluster and 2) the frequency with which it has been mutated in previous iterations. The genes that have low fitness values should be updated through the mutation operation. The mutation operation performs small change (positive or negative) in the value of the gene. The mutants are accepted if they are better (with respect to objective function) than their parents. However, diversification in the search process is maintained by allowing, with a small probability, the mutants to replace their parents even they are not better than them. The objective functions used in the proposed heuristic are Calinski Harabasz index and Dunn index. The proposed algorithm has been experimented using real-life numeric data sets of UCI repository. The number of data points and number of attributes in the datasets lie between 150-11 000 and 4-60, respectively. The results indicate that the proposed algorithm performs better than two standard EAs: 1) simulated annealing algorithm and 2) differential evolution algorithm and a genetic algorithm-based clustering method.

Original languageEnglish
Article number7892860
Pages (from-to)6801-6812
Number of pages12
JournalIEEE Access
Volume5
DOIs
StatePublished - 2017

Bibliographical note

Publisher Copyright:
© 2017 IEEE.

Keywords

  • Data clustering
  • applications of evolutionary algorithms
  • genetic algorithm
  • simulated evolution

ASJC Scopus subject areas

  • General Computer Science
  • General Materials Science
  • General Engineering

Fingerprint

Dive into the research topics of 'A New Heuristic for the Data Clustering Problem'. Together they form a unique fingerprint.

Cite this