Comparing Reinforcement Learning Algorithms for Online Couriers Scheduling in Crowdsourced Last-Mile Delivery

Research output: Contribution to journalConference articlepeer-review

Abstract

Crowdsourced delivery platforms face challenges in matching couriers to customer orders due to fluctuating demand and uncertain courier availability. The platform's courier workforce has two types: committed couriers who commit to working for a specific time, and occasional couriers who log in to the platform at a time of their choice. Traditionally, these platforms establish "offline"schedules in advance for committed couriers based on forecasts of anticipated deliveries within defined time windows. However, since actual order numbers are unpredictable in real-time, efficient operations require reactive scheduling to optimally match resources with changing demand trends. A recent paper, Saleh et al (2024), proposed a strategy of extending the shifts of committed couriers as needed in response to the change of demand trends throughout the day. The problem was formulated as a Markov Decision Process (MDP) and utilized a popular value-based algorithm, Deep Q-Network (DQN), to maximize the platform's expected reward. In this work, we extend the work of Saleh et al (2024) by investigating two other alternative approaches belonging to the class of policy gradient-based algorithms, namely, Proximal Policy Optimization (PPO) and Advantage Actor Critic (A2C). PPO and A2C optimize policy parameters to maximize the expected rewards. Through a comparative analysis, we evaluate the effectiveness of PPO and A2C versus DQN in addressing the scheduling challenges faced by crowdsourced delivery platforms. The results show that PPO achieved the best results in terms of total rewards, lost requests, and shift extension costs by effectively learning an optimized policy through its use of the clipping objective function. While A2C had a higher expected reward than DQN, it struggled with directly optimizing its policy and prioritizing timely deliveries. Finally, DQN is found to excessively rely on shift extensions.

Original languageEnglish
Pages (from-to)121-128
Number of pages8
JournalTransportation Research Procedia
Volume84
DOIs
StatePublished - 2025
Event1st Internation Conference on Smart Mobility and Logistics Ecosystems, SMiLE 2024 - Dhahran, Saudi Arabia
Duration: 17 Sep 202419 Sep 2024

Bibliographical note

Publisher Copyright:
© 2024 The Authors. Published by ELSEVIER B.V.

Keywords

  • A2C
  • Crowd-sourced last mile delivery
  • DQN
  • Deep Reinforcement Learning
  • PPO

ASJC Scopus subject areas

  • Transportation

Fingerprint

Dive into the research topics of 'Comparing Reinforcement Learning Algorithms for Online Couriers Scheduling in Crowdsourced Last-Mile Delivery'. Together they form a unique fingerprint.

Cite this