Abstract
Unlike text files, XML-based documents have structure as a property that facilitates their categorization to their respective classes. Therefore, it is imperative to have structure as an integral part of the feature vector representing an XML document. To date, many models incorporating structural information were proposed. Structure Link Vector Model (SLVM) is one popular model that generates high-dimensional, sparse feature vectors incorporating both text and structure as features. However, sparsity and high-dimensionality are notorious for reducing classification performance. To mitigate this limitation, we proposed Concise Feature Vector (CFV) that models a document-using both text and structure-as a feature vector of smaller size, with features meant to be more discriminating than that of SLVM. Using chi-square statistics for retaining features, we showed that classifiers trained on retained features from CFV outperformed those trained on the same number of retained features from SLVM. Furthermore, we proposed Majority Voting Ensemble (MVE), a heterogeneous ensemble comprising several classifiers whose independent decisions are combined to classify an XML document. Experimental results showed that MVE achieved considerable improvement in recall and precision rates over single classifiers.
Original language | English |
---|---|
Title of host publication | Proceedings of the 2014 ACM Southeast Regional Conference, ACM SE 2014 |
Publisher | Association for Computing Machinery |
ISBN (Electronic) | 9781450329231 |
DOIs | |
State | Published - 28 Mar 2014 |
Event | 2014 ACM Southeast Regional Conference, ACM SE 2014 - Kennesaw, United States Duration: 28 Mar 2014 → 29 Mar 2014 |
Publication series
Name | Proceedings of the 2014 ACM Southeast Regional Conference, ACM SE 2014 |
---|
Conference
Conference | 2014 ACM Southeast Regional Conference, ACM SE 2014 |
---|---|
Country/Territory | United States |
City | Kennesaw |
Period | 28/03/14 → 29/03/14 |
Bibliographical note
Publisher Copyright:Copyright 2014 ACM.
Keywords
- Ensemble learning
- Extended markup language
- Feature extraction
- Machined learning models
ASJC Scopus subject areas
- Computer Graphics and Computer-Aided Design
- Computer Science Applications
- Hardware and Architecture
- Software