Abstract
Artificial intelligence (AI) and machine learning significantly improve many sectors, such as education, healthcare, and industry. Machine learning techniques mainly depend on the volume and diversity of training data. With the digital transformation we live in, an abundant amount of data can be collected from different sources. However, the problem that needs to be addressed is how this amount of data can be processed and where it can be stored. Cloud services and distributed file systems (DFSs) help address this issue. Many DFSs such as Hadoop, Quantcast, and Apache Spark differ in many aspects, including scheduling algorithms, data management protocol, throughput, and runtime. Some DFSs may be better for working with specific applications than others. Apache Spark is capable of handling iterative operations like machine learning operations as well as it provides an integrated library of different machine learning algorithms called MLlib. In this paper, we evaluated the use of Spark using two machine learning algorithms, namely Logistic Regression (LR) and Random Forests (RF). We investigated the effect of varying the memory allocation configuration and the use of GPU. We concluded that the use of Spark greatly improves the runtime and memory consumption. However, its use has to be justifiable and needed for the size of the data due to different factors that affect the machine learning model's accuracy. The memory allocation should be kept to the minimum needed, and GPU should only be used when the machine learning algorithm used supports parallelization.
| Original language | English |
|---|---|
| Title of host publication | Proceedings - 2022 14th IEEE International Conference on Computational Intelligence and Communication Networks, CICN 2022 |
| Publisher | Institute of Electrical and Electronics Engineers Inc. |
| Pages | 498-504 |
| Number of pages | 7 |
| ISBN (Electronic) | 9781665487719 |
| DOIs | |
| State | Published - 2022 |
| Event | 14th IEEE International Conference on Computational Intelligence and Communication Networks, CICN 2022 - Al-Khobar, Saudi Arabia Duration: 4 Dec 2022 → 6 Dec 2022 |
Publication series
| Name | Proceedings - 2022 14th IEEE International Conference on Computational Intelligence and Communication Networks, CICN 2022 |
|---|
Conference
| Conference | 14th IEEE International Conference on Computational Intelligence and Communication Networks, CICN 2022 |
|---|---|
| Country/Territory | Saudi Arabia |
| City | Al-Khobar |
| Period | 4/12/22 → 6/12/22 |
Bibliographical note
Publisher Copyright:© 2022 IEEE.
Keywords
- Apache Spark
- Bigdata
- GPU
- MLlib
- machine learning
ASJC Scopus subject areas
- Artificial Intelligence
- Computer Networks and Communications
- Computer Science Applications
- Computer Vision and Pattern Recognition