Performance Evaluation of Machine Learning Models on Apache Spark: An Empirical Study

  • Asma Z. Yamani
  • , Shikah J. Alsunaidi
  • , Imane Boudellioua

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Artificial intelligence (AI) and machine learning significantly improve many sectors, such as education, healthcare, and industry. Machine learning techniques mainly depend on the volume and diversity of training data. With the digital transformation we live in, an abundant amount of data can be collected from different sources. However, the problem that needs to be addressed is how this amount of data can be processed and where it can be stored. Cloud services and distributed file systems (DFSs) help address this issue. Many DFSs such as Hadoop, Quantcast, and Apache Spark differ in many aspects, including scheduling algorithms, data management protocol, throughput, and runtime. Some DFSs may be better for working with specific applications than others. Apache Spark is capable of handling iterative operations like machine learning operations as well as it provides an integrated library of different machine learning algorithms called MLlib. In this paper, we evaluated the use of Spark using two machine learning algorithms, namely Logistic Regression (LR) and Random Forests (RF). We investigated the effect of varying the memory allocation configuration and the use of GPU. We concluded that the use of Spark greatly improves the runtime and memory consumption. However, its use has to be justifiable and needed for the size of the data due to different factors that affect the machine learning model's accuracy. The memory allocation should be kept to the minimum needed, and GPU should only be used when the machine learning algorithm used supports parallelization.

Original languageEnglish
Title of host publicationProceedings - 2022 14th IEEE International Conference on Computational Intelligence and Communication Networks, CICN 2022
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages498-504
Number of pages7
ISBN (Electronic)9781665487719
DOIs
StatePublished - 2022
Event14th IEEE International Conference on Computational Intelligence and Communication Networks, CICN 2022 - Al-Khobar, Saudi Arabia
Duration: 4 Dec 20226 Dec 2022

Publication series

NameProceedings - 2022 14th IEEE International Conference on Computational Intelligence and Communication Networks, CICN 2022

Conference

Conference14th IEEE International Conference on Computational Intelligence and Communication Networks, CICN 2022
Country/TerritorySaudi Arabia
CityAl-Khobar
Period4/12/226/12/22

Bibliographical note

Publisher Copyright:
© 2022 IEEE.

Keywords

  • Apache Spark
  • Bigdata
  • GPU
  • MLlib
  • machine learning

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Networks and Communications
  • Computer Science Applications
  • Computer Vision and Pattern Recognition

Fingerprint

Dive into the research topics of 'Performance Evaluation of Machine Learning Models on Apache Spark: An Empirical Study'. Together they form a unique fingerprint.

Cite this