SDLER: stacked dedupe learning for entity resolution in big data era

Alladoumbaye Ngueilbaye, Hongzhi Wang*, Daouda Ahmat Mahamat, Ibrahim A. Elgendy

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

8 Scopus citations

Abstract

In the Big Data Era, Entity Resolution (ER) faces many challenges such as high scalability, the coexistence of complex similarity metrics, tautonymy and synonym, and the requirement of Data Quality Evaluation. Moreover, despite more than seventy years of development efforts, there is still a high demand for democratizing ER to reduce human participation in tuning parameters, data labeling, defining blocking functions, and feature engineering. This study aimed to explore a novel Stacked Dedupe Learning ER system with high accuracy and efficiency. The study evaluated sophisticated composition methods, such as Bidirectional Recurrent Neural Networks (BiRNNs) and Long Short-Term Memory (LSTM) hidden units, to renovate each tuple to word representation distribution in a sense to capture similarities amidst tuples. Also, pre-trained words embedding where they were not available, ways to learn and tune Word Representation Distribution customized for ER tasks under different scenarios were considered. More so, the Locality Sensitive Hashing (LSH) based blocking approach, which considered the entire attributes of a tuple and produced slighter blocks, compared with traditional methods with few attributes, were assessed. The algorithm was tested on multiple datasets namely benchmarks, and multi-lingual data. The experimental results showed that Stacked Dedupe Learning achieves high quality and good performance, and scales well compared to the existing solutions.

Original languageEnglish
Pages (from-to)10959-10983
Number of pages25
JournalJournal of Supercomputing
Volume77
Issue number10
DOIs
StatePublished - Oct 2021
Externally publishedYes

Bibliographical note

Publisher Copyright:
© 2021, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.

Keywords

  • Bidirectional RNN
  • Big data
  • Data quality
  • Entity resolution
  • Stacked Dedupe Learning (SDL)
  • Word Representation Distribution (WRD)

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Software
  • Information Systems
  • Hardware and Architecture

Fingerprint

Dive into the research topics of 'SDLER: stacked dedupe learning for entity resolution in big data era'. Together they form a unique fingerprint.

Cite this