Un-Compromised Credibility: Social Media Based Multi-Class Hate Speech Classification for Text

  • Khubaib Ahmed Qureshi*
  • , Muhammad Sabih
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

74 Scopus citations

Abstract

There is an enormous growth of social media which fully promotes freedom of expression through its anonymity feature. Freedom of expression is a human right but hate speech towards a person or group based on race, caste, religion, ethnic or national origin, sex, disability, gender identity, etc. is an abuse of this sovereignty. It seriously promotes violence or hate crimes and creates an imbalance in society by damaging peace, credibility, and human rights, etc. Detecting hate speech in social media discourse is quite essential but a complex task. There are different challenges related to appropriate and social media-specific dataset availability and its high-performing supervised classifier for text-based hate speech detection. These issues are addressed in this study, which includes the availability of social media-specific broad and balanced dataset, with multi-class labels and its respective automatic classifier, a dataset with language subtleties, dataset labeled under a comprehensive definition and well-defined rules, dataset labeled with the strong agreement of annotators, etc. Addressing different categories of hate separately, this paper aims to accurately predict their different forms, by exploring a group of text mining features. Two distinct groups of features are explored for problem suitability. These are baseline features and self-discovered/new features. Baseline features include the most commonly used effective features of related studies. Exploration found a few of them, like character and word n-grams, dependency tuples, sentiment scores, and count of 1st, 2nd person pronouns are more efficient than others. Due to the application of latent semantic analysis (LSA) for dimensionality reduction, this problem is benefited from the utilization of many complex and non-linear models and CAT Boost performed best. The proposed model is compared with related studies in addition to system baseline models. The results produced by the proposed model were much appreciating.

Original languageEnglish
Article number9503413
Pages (from-to)109465-109477
Number of pages13
JournalIEEE Access
Volume9
DOIs
StatePublished - 2021
Externally publishedYes

Bibliographical note

Publisher Copyright:
© 2013 IEEE.

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

  1. SDG 16 - Peace, Justice and Strong Institutions
    SDG 16 Peace, Justice and Strong Institutions

Keywords

  • Machine learning
  • features exploration
  • hate speech classification
  • multi-class hate speech
  • multi-class hate speech dataset
  • natural language processing
  • social media microblogs
  • text mining
  • twitter hate speech

ASJC Scopus subject areas

  • General Computer Science
  • General Materials Science
  • General Engineering

Fingerprint

Dive into the research topics of 'Un-Compromised Credibility: Social Media Based Multi-Class Hate Speech Classification for Text'. Together they form a unique fingerprint.

Cite this