XML classification using ensemble learning on extracted features

Issam H. Laradji, Mohammed Salahadin, Lahouari Ghouti

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Unlike text files, XML-based documents have structure as a property that facilitates their categorization to their respective classes. Therefore, it is imperative to have structure as an integral part of the feature vector representing an XML document. To date, many models incorporating structural information were proposed. Structure Link Vector Model (SLVM) is one popular model that generates high-dimensional, sparse feature vectors incorporating both text and structure as features. However, sparsity and high-dimensionality are notorious for reducing classification performance. To mitigate this limitation, we proposed Concise Feature Vector (CFV) that models a document-using both text and structure-as a feature vector of smaller size, with features meant to be more discriminating than that of SLVM. Using chi-square statistics for retaining features, we showed that classifiers trained on retained features from CFV outperformed those trained on the same number of retained features from SLVM. Furthermore, we proposed Majority Voting Ensemble (MVE), a heterogeneous ensemble comprising several classifiers whose independent decisions are combined to classify an XML document. Experimental results showed that MVE achieved considerable improvement in recall and precision rates over single classifiers.

Original languageEnglish
Title of host publicationProceedings of the 2014 ACM Southeast Regional Conference, ACM SE 2014
PublisherAssociation for Computing Machinery
ISBN (Electronic)9781450329231
DOIs
StatePublished - 28 Mar 2014
Event2014 ACM Southeast Regional Conference, ACM SE 2014 - Kennesaw, United States
Duration: 28 Mar 201429 Mar 2014

Publication series

NameProceedings of the 2014 ACM Southeast Regional Conference, ACM SE 2014

Conference

Conference2014 ACM Southeast Regional Conference, ACM SE 2014
Country/TerritoryUnited States
CityKennesaw
Period28/03/1429/03/14

Bibliographical note

Publisher Copyright:
Copyright 2014 ACM.

Keywords

  • Ensemble learning
  • Extended markup language
  • Feature extraction
  • Machined learning models

ASJC Scopus subject areas

  • Computer Graphics and Computer-Aided Design
  • Computer Science Applications
  • Hardware and Architecture
  • Software

Fingerprint

Dive into the research topics of 'XML classification using ensemble learning on extracted features'. Together they form a unique fingerprint.

Cite this