Reinforcement learning with parameterized action space and sparse reward for UAV navigation
Abstract
Autonomous navigation of unmanned aerial vehicles (UAVs) is widely used in building rescue systems. As the complexity of the task increases, traditional methods based on environment models are hard to apply. In this paper, a reinforcement learning (RL) algorithm is proposed to solve the UAV navigation problem. The UAV navigation task is modeled as a Markov Decision Process (MDP) with parameterized actions. In addition, the sparse reward problem is also taken into account. To address these issues, we develop the HER-MPDQN by combining Multi-Pass Deep Q-Network (MP-DQN) and Hindsight Experience Replay (HER). Two UAV navigation simulation environments with progressive difficulty are constructed to evaluate our method. The results show that HER-MPDQN outperforms other baselines in relatively simple tasks. Especially for complex tasks involving relay operations, only our method can achieve satisfactory performance.
Keywords
1. INTRODUCTION
In recent years, unmanned aerial vehicles (UAVs) have been widely used in emergency rescue fields within the framework of Industry 4.0[1–3]. The key technology behind these applications is UAV attitude control[4, 5] and autonomous navigation[6]. Traditional approaches to address navigation challenges using modeling techniques[7–9] and simultaneously localization-and-mapping techniques[10–12]. While these methods have demonstrated satisfactory performance, they heavily rely on prior knowledge of the environment, limiting their applicability in complex and dynamic navigation scenarios.
Deep reinforcement learning (DRL) has developed rapidly and achieved remarkable success in recent years. DRL aims at deriving a policy by maximizing a long-term cumulated reward of a Markov decision process (MDP). MDP characterizes a process in which an agent at some state takes action, transits to another state, and obtains rewards. For continuous decision problems, Silver
Recent work on UAV navigation based on DRL has been successful, but there are still some potential issues that have been overlooked. On the one hand, the behavior of UAVs is usually defined as single-type actions in existing studies. However, UAVs often need both discrete decision-making and continuous control when performing tasks. In[18], Masson
On the other hand, the studies above train agents with custom reward functions to speed up network convergence[22, 23]. However, overly complex reward signals may cause the agent to get stuck in local optima. Moreover, it is hard to give correct rewards according to whether the UAV is moving away from the target due to the presence of obstacles. This problem can be avoided by adopting more general sparse reward schemes. While defining sparse reward is simple, the potential learning problem is much harder to solve (i.e., lack of intermediate rewards hinders the learning of agents). Numerous studies propose curiosity-based approaches to encourage agents to explore previously rare states[24, 25]. These methods introduce additional fitting models of environmental dynamics to measure curiosity. However, the model will reduce efficiency when the dynamics are unpredictable. Andrychowicz
Parameterized action space and sparse reward inspire us to rethink new forms of RL problems. In this paper, our contributions can be summarized as follows:
The rest of this article is outlined below. PAMDP and the sparse reward problem are described in Section 2. In Section 3, the navigation problem is modeled as a PAMDP and uses a sparse reward scheme. Our proposal to address the problem is elaborated in Section 4. In Section 5, we compare HER-MPDQN with other baselines in two UAV navigation simulation environments, followed by our discussion in Section 6 and conclusions in Section 7.
2. BACKGROUND
In this section, we briefly introduce MDP and PAMDP, followed by the issue of sparse reward.
2.1. MDP and PAMDP
MDP is built based on a set of interactive objects, namely agents and environments. MDP consists of a state space
where
The agent learns an optimal policy by maximizing the expected discounted reward (target function) as follows
If
PAMDP is an extension of MDP on the action space, allowing decision-making using parameterized actions. Parameterized actions flexibly integrate discrete actions and continuous actions and provide richer expressiveness. In tasks such as UAV navigation, which demand precise parameter control, using parameterized actions enables finer-grained control. Moreover, from an interpretability standpoint, the structural characteristics of parameterized actions make the decision-making process of agents more understandable and explainable. The action space of PAMDP can be expressed as
2.2. Sparse reward problems
The sparse reward scheme is a reward mechanism that uses only binary values to indicate whether a task is eventually completed. Specifically, the agent can get a positive reward when completing the task and a negative reward during exploration. Although this mechanism reduces the effort required for human design, it brings potential learning problems. The lack of effective rewards prevents the agent from judging the pros and cons of its behavior and thus cannot optimize the policy
3. PROBLEM FORMULATION
In this section, a UAV navigation task from an initial to a target position is first formulated. The UAV will learn a policy by mapping internal state information to action sequences. Then the PAMDP modeling process with sparse reward is described in detail.
3.1. UAV navigation tasks
In this paper, UAV navigation tasks are divided into the direct navigation task [Figure 1] and the relay navigation task [Figure 2]. The distinction between these tasks lies in the presence of intermediate operations that the UAV needs to perform. As a result, the flight requirements for UAVs in relay navigation tasks are more demanding. Nevertheless, the simple task is also considered to verify the effectiveness and generality of the algorithm.
Figure 1. The direct navigation task. The UAV only needs to fly from the initial point to the target area.
Figure 2. The relay navigation task. The UAV needs to operate on "catching supply" during the navigation process.
Typically, the position, direction, and motion of the UAV are determined by both the earth and the body coordinate frame. The earth coordinate frame is used to describe the position and direction of the UAV, denoted as
For simplicity, the UAV is assumed to fly at a fixed altitude, i.e., the motion of the UAV is constrained within the x-y plane. Moreover, the steering action of the UAV takes effect immediately because the momentum is ignored. As a result, the vector describing the motion information of the UAV is simplified to
where
3.2. State representation specification
State representation is essential for the agent to perceive the surrounding environment and reach the target position. In our setting, the state vector is reduced as much as possible to avoid interference from irrelevant information and speed up training. Specifically, the observable information of the UAV has three sources. The first is the internal state of UAVs, which is represented in terms of position
3.3. Parameterized action design
Considering that the flight altitude of the UAV has been set as a constant above, its movement in the vertical direction can be ignored. For the direct navigation task, the optional discrete actions of the UAV are defined in the discrete set
3.4. Sparse reward function
Although sparse reward setting is simple and universal, different tasks need to set corresponding reward functions, which will bring different degrees of sparsity. The direct navigation task aims to get the UAV to reach the target area, which is a single-stage task with a relatively small reward sparsity. In the relay navigation task that contains supplies, an effective reward can be obtained only when the UAV finds supplies and carries them to the target area. This task includes relay operations which significantly increase the reward sparsity. Existing baselines are not effective at handling the relay task. This phenomenon and the solution are described in detail in Section 4.2. According to the above task requirements, the reward function of the direct navigation task is defined as
and the reward function of the relay navigation task is defined as
4. DEEP REINFORCEMENT LEARNING AGENT
To solve the navigation task presented in Section 3, the MPDQN algorithm is considered first, which has been proven effective in PAMDP. Besides, we introduce how to extend HER to solve the sparse reward problem in the relay task, followed by the implementation and training process of HER-MPDQN.
4.1. Multi-Pass Deep Q-Network (MP-DQN)
MP-DQN is an off-policy RL algorithm that deals with parameterized action space. In MP-DQN, the goal of the agent is to optimize the policy
Due to the integration of DQN[29] and DDPG[30] algorithms, MP-DQN has a network architecture similar to the Actor-Critic architecture. In specific, the MPDQN agent has an actor-parameter network
For the actor-parameter network, the MP-DQN agent learns a deterministic mapping policy
where
Each row of the matrix is fed into the Q-value network separately, thereby eliminating the influence of irrelevant parameters on the estimation of the Q-value function.
For the Q-value network, MP-DQN evaluates actions using an action-value function similar to DDPG. Since MPDQN needs to optimize both discrete actions and continuous parameters, the Bellman equation becomes
where
where
Then, similar to DQN,
where
The loss of the actor-parameter network is given by the negative sum of Q-values as
4.2. Hindsight experience replay
HER is another important baseline algorithm that we consider in our method. The core idea is to expand the experience by constructing the goal variable as shown in Figure 3 (a). There are two ways to construct the goal variable:
Figure 3. Illustration of positive reward sparsity for HER. (a) In the task that does not contain the manipulated object, achieved goal is directly affected by the behavior of the agent and constantly changes in each rollout. In this case, HER can generate valuable learning experiences. (b) For the task containing the manipulated object, achieved goal remains unchanged until the agent comes into contact with the object. In this case, all the experience generated by HER includes positive rewards but has no substantial help to the learning of the agent.
The new goals constructed in the above way have the potential to be generalized to unseen real goals. Although the agent fails to achieve a given goal in the current episode, it still learns action sequences that may achieve different given goals in a future episode. Therefore, the original failure transition can be transformed into the virtual success transition by selecting a new goal from the state experienced to replace the initial goal.
However, HER tends to be less efficient in some relay tasks. As shown in Figure 3 (b), the goal of the agent is to deliver the block to the target area. It should be noted that the
To solve this problem, we propose the goal-switch mechanism (GSM). The principle of the GSM aims to generate meaningful goal variables by dynamically assigning goals during experience expansion. In the relay navigation task, the conventional HER approach always designates the target area as the
4.3. HER-MPDQN
MP-DQN samples a fixed batch from an offline experience reply buffer to update the network. HER expands the original experience by goal replacement. On this basis, we further eliminate the hindsight experience without guidance significance. This means that our proposal can be effectively integrated with MP-DQN. As shown in Figure 4, the input vector of the neural network is extended in MP-DQN. In specific, the input vector of the actor-parameter network becomes
Figure 4. The network architecture of HER-MPDQN. For the symbols in the figure,
Correspondingly, the input vector of the Q-value network becomes
In our method, the original experience is stored together with the hindsight experience in the replay buffer. A fixed batch of data is sampled from this buffer when the neural network needs to update. The training process in Algorithm 1 is generally divided into three steps: (1) The agent interacts with the environment to accumulate real experience; (2) Capture the original experience and hindsight experience of each time step and store all of them in the replay buffer; (3) Update the parameters of the network according to the length of each episode.
Algorithm 1: HER-MPDQN Input: Minibatch size
Initialize the weight of Q-value network and actor-parameter network.
Initialize target networks by hard update.
for
5. RESULTS
To evaluate the effectiveness of HER-MPDQN, a UAV is fully trained and tested in two environments. A relatively simple direct navigation environment and experimental results are first presented in 5.1. After that, a more complex relay navigation task is considered. The related experiments and result analysis are in 5.2.
5.1. The direct navigation task
To our knowledge, open-source benchmark environments for UAV navigation tasks with parameterized action spaces and sparse rewards are currently lacking. Therefore, we simulate a large-scale environment inspired by the existing simulation1. Figure 1 shows a top view of the navigation environment. The UAV can fly within four square kilometers (2
1https://github.com/thomashirtz/gym-hybrid.
We chose Python 3.7 as our development language due to its simplicity, flexibility, and efficient development capabilities. For algorithm development and testing, we utilize the OpenAI Gym library, which is an open-source RL library that offers convenient tools for creating custom environments. In terms of hardware, we employ an AMD Ryzen 7 5800H processor and 16GB of RAM. This configuration is relatively new and provides sufficient computing resources to support the training and testing of the algorithms developed in this study.
In the context of sparse rewards, the agent ends the current episode either when the task completion or the agent goes out of bounds which results in higher rewards. Therefore, relying solely on "episode rewards" is inadequate to assess the learning effectiveness. To evaluate algorithm performance, we utilize the "success rate" as a metric, which directly reflects the number of times the agent completes the task. Also, a higher "success rate" implies a higher "reward". The "success rate" is defined as follows:
where
We evaluated the performance of HER-MPDQN in the above environment. Besides, we also implement and test the following three baselines: HER-PDQN[31], MP-DQN, [21] and P-DQN[20]. Each algorithm uses the same network with hidden layer sizes (128, 64). The step size of each environment is limited to 100, the size of the replay buffer
Figure 5 provides the training process of the UAV in the direct navigation task. Table 1 shows the evaluation performance for each algorithm in 1000 episodes. The results show that HER-MPDQN has a faster convergence speed and higher average success rate. Compared to HER-PDQN, our method evaluates the Q-value more accurately and updates the actor-parameter network without bias. At the same time, the agent learns effective experience early by introducing HER, which brings stronger learning ability than the original MP-DQN. Since P-DQN has no additional mechanism to eliminate the influence of irrelevant parameters and deal with sparse rewards, its performance is not satisfactory. In addition, it can be seen that the learning curve of HER-MPDQN has a smaller shaded area. This means that HER-MPDQN has better robustness in various experiments.
Figure 5. Comparisons of the results using HER-MPDQN and other baselines. The left figure shows the periodic calculation of the task completion rate of the last 10 episodes, and the right figure shows the total success rate during the entire training process. The shaded area is the variance in multiple experiments. Smaller shading indicates that the algorithm is less sensitive to random seeds.
The average performance over 1000 episode evaluations
The direct navigation task | ||
Success rate | Mean reward | |
HER-MPDQN(our) | 0.810 | -40.095 |
HER-PDQN | 0.725 | -43.141 |
MP-DQN | 0.412 | -70.504 |
P-DQN | 0.305 | -84.504 |
5.2. The relay navigation task
Considering the need for UAVs to deliver supplies, a relay navigation environment is further simulated. Compared with the previous environment setting, the following changes are made: (1) Add a supply point that requires the UAV to perform relay operations. (2) Introduce a new discrete action CATCH to represent grabbing operations. As shown in Figure 2, the UAV must deliver supply to the target area in a limited number of steps. CATCH is valid only when the UAV is in contact with the supply. Each episode ends when the supply has been transported to the target area (winning state) or the time limit exceeds.
Due to the increased complexity, the hyperparameters for this task are tuned as follows. The hidden layer size of all networks is set to (256, 128, 64). The size of the replay buffer
Figure 6 shows the learning curve in the relay task. The evaluation results are given in Table 2. It is obvious that HER-MPDQN is far superior to the other algorithms. On the one hand, more sparse rewards make it impossible for the UAV to complete tasks through random actions. Therefore, the MP-DQN and P-DQN without HER are challenging to learn. On the other hand, although many positive experiences are generated by introducing HER, they do not have guide significance. For this reason, HER-PDQN cannot learn policy effectively. By treating the relay task as a continuous multi-stage task, the goal is automatically allocated by HER-MPDQN according to the current state. Thus, the hindsight experience of different stages has the correct guiding significance. Figure 7 shows the flight trajectories of the trained UAV when performing specific relay navigation tasks. It can be seen that the UAV has learned the correct action policy. Not only that, HER-MPDQN exhibits good scalability in our experiments. The multi-goal relay navigation task can be completed by expanding the goal space. We leave this research for future work.
Figure 6. Comparisons of the results using HER-MPDQN and other baselines. The left figure shows the periodic calculation of the task completion rate of the last 100 episodes; other descriptions for evaluating algorithm performance are the same as Figure 5.
The average performance over 1000 episodes evaluations
The relay navigation task | ||
Success rate | Mean reward | |
HER-MPDQN(our) | 0.885 | -57.758 |
HER-PDQN | 0.000 | -00.000 |
MP-DQN | 0.000 | -00.000 |
P-DQN | 0.000 | -00.000 |
6. DISCUSSION
Existing baselines can learn effective policies in the direct navigation task but fail to handle the relay navigation task. In contrast, HER-MPDQN achieves satisfactory results in both simple and complex tasks. This means that HER-MPDQN is versatile in solving different types of navigation tasks. General-purpose agents are one of the critical directions of DRL research, which can avoid repeated algorithm design. HER-MPDQN has excellent advantages in tasks that are difficult to design reward functions and require flexible rescue strategies. However, the algorithm in environments with obstacles has yet to test, and the flying altitude of the UAV is fixed. These limitations make the transfer of simulation to reality more difficult. Future research can consider addressing the sparse reward problem in more realistic simulation conditions, which can narrow the gap between "sim-to-real".
7. CONCLUSIONS
In this paper, the HER-MPDQN algorithm is developed to address UAV navigation tasks with parametrized action space and sparse reward. In addition, a goal-switching method is proposed to correct meaningless hindsight experiences in the relay navigation task. The experiments show that HER-MPDQN outperforms baselines regarding training speed and converged value. Especially in the relay task, only the agent trained by HER-MPDQN learns effectively due to reasonable experience expansion. Further research could consider the energy consumption model of UAVs and test real UAVs in high-dimensional environments containing obstacles.
DECLARATIONS
Authors' contributions
Made contributions to the conception and design of the work, developed most associated code for the simulation environment and the reinforcement learning method, and performed data analysis and visualizations, manuscript writing, and related tasks: Feng S
Contributed to parts of the conception and design and performed the validation of the experimental design and related tasks: Li X
Contributed to the research supervision and guidance, provided environmental equipment, and performed analysis and transition of formal specifications, manuscript writing, and related tasks: Ren L
Participated in part of the experimental data analysis and visualizations and performed data collation and related tasks: Xu S
Availability of data and materials
This work is based on the Gym platform. All codes can be found at https://github.com/No-human-is-limited/HER-MPDQN. There are no additional data or materials associated with this study.
Financial support and sponsorship
This work was supported in part by the Open Project Program of Fujian Provincial Key Laboratory of Intelligent Identification and Control of Complex Dynamic System No: 2022A0001; in part by the National Natural Science Foundation of China No: 62203002; and in part by the Natural Science Foundation of Anhui Province No: 2208085QF204.
Conflicts of interest
All authors declared that there are no conflicts of interest.
Ethical approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Copyright
© The Author(s) 2023.
REFERENCES
1. Mourtzis D. Simulation in the design and operation of manufacturing systems: state of the art and new trends. Int J Prod Res 2020;58:1927-49.
2. Mourtzis D. Design and operation of production networks for mass personalization in the era of cloud technology Amsterdam: Elsevier; 2021. pp. 1-393.
3. Atif M, Ahmad R, Ahmad W, Zhao L, Rodrigues JJ. UAV-assisted wireless localization for search and rescue. IEEE Syst J 2021;15:3261-72.
4. Zhao L, Liu Y, Peng Q, Zhao L. A dual aircraft maneuver formation controller for mav/uav based on the hybrid intelligent agent. Drones 2023;7:282.
5. Mourtzis D, Angelopoulos J, Panopoulos N. Unmanned aerial vehicle (UAV) manipulation assisted by augmented reality (AR): the case of a drone. IFAC-PapersOnLine 2022;55:983-8.
6. Walker O, Vanegas F, Gonzalez F, et al. A deep reinforcement learning framework for UAV navigation in indoor environments. 2019 IEEE Aerospace Conference 2019:1-14.
7. Wang Q, Zhang A, Qi L. Three-dimensional path planning for UAV based on improved PSO algorithm. In: The 26th Chinese Control and Decision Conference (2014 CCDC). IEEE; 2014;3981-5.
8. Yao P, Xie Z, Ren P. Optimal UAV route planning for coverage search of stationary target in river. IEEE Trans Contr Syst Technol 2017;27:822-9.
9. Shin J, Bang H, Morlier J. UAV path planning under dynamic threats using an improved PSO algorithm. Int J Aerospace Eng 2020;2020:1-17.
10. Wen C, Qin L, Zhu Q, Wang C, Li J. Three-dimensional indoor mobile mapping with fusion oftwo-dimensional laser scanner and RGB-D camera data. IEEE Geosci Remote Sensing Lett 2013;11:843-7.
11. Mu B, Giamou M, Paull, et al. Information-based active SLAM via topological feature graphs. In: 2016 IEEE 55th Conference on decision and control (Cdc). IEEE; 2016. pp. 5583-90.
12. Weiss S, Scaramuzza D, Siegwart R. Monocular-SLAM–based navigation for autonomous micro helicopters in GPS-denied environments. J Field Robot 2011;28:854-74.
13. Silver D, Huang A, Maddison CJ, et al. Mastering the game of Go with deep neural networks and tree search. Nature 2016;529:484-9.
14. Levine S, Finn C, Darrell T, Abbeel P. End-to-end training of deep visuomotor policies. J Mach Learn Res 2016;17:1334-73.
15. Levine S, Pastor P, Krizhevsky A, Ibarz J, Quillen D. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. Ind Robot 2018;37:421-36.
16. Yan C, Xiang X. A path planning algorithm for uav based on improved q-learning. In: 2018 2nd international conference on robotics and automation sciences (ICRAS). IEEE; 2018.;1-5.
17. Bouhamed O, Ghazzai H, Besbes H, Massoud Y. Autonomous UAV navigation: A DDPG-based deep reinforcement learning approach. In: 2020 IEEE International Symposium on circuits and systems (ISCAS). IEEE; 2020. pp. 1-5.
18. Masson W, Ranchod P, Konidaris G. Reinforcement learning with parameterized actions. In: Thirtieth AAAI Conference on Artificial Intelligence; 2016.
19. Hausknecht M, Stone P. Deep reinforcement learning in parameterized action space. In: Proceedings of the International Conference on Learning Representations. 2016.
20. Xiong J, Wang Q, Yang Z, et al. Parametrized deep q-networks learning: reinforcement learning with discrete-continuous hybrid action space. arXiv preprint arXiv:181006394. 2018.
21. Bester CJ, James SD, Konidaris GD. Multi-pass q-networks for deep reinforcement learning with parameterised action spaces. arXiv preprint arXiv:190504388. 2019.
22. Wang W, Luo X, Li Y, Xie S. Unmanned surface vessel obstacle avoidance with prior knowledge-based reward shaping. Concurrency Computat Pract Exper 2021;33:e6110.
23. Okudo T, Yamada S. Subgoal-based reward shaping to improve efficiency in reinforcement learning. IEEE Access 2021;9:97557-68.
24. Burda Y, Edwards H, Storkey A, Klimov O. Exploration by random network distillation. arXiv preprint arXiv:181012894. 2018.
25. Badia AP, Sprechmann P, Vitvitskyi A, et al. Never give up: learning directed exploration strategies. arXiv preprint arXiv:200206038. 2020.
26. Andrychowicz M, Wolski F, Ray A, et al. Hindsight experience replay. Adv Neural Inf Process Syst 2017:30.
27. Lanka S, Wu T. Archer: aggressive rewards to counter bias in hindsight experience replay. arXiv preprint arXiv:180902070. 2018.
28. Schramm L, Deng Y, Granados E, Boularias A. USHER: unbiased sampling for hindsight experience replay. arXiv preprint arXiv:220701115. 2022. Available from:
29. Mnih V, Kavukcuoglu K, Silver D, et al. Human-level control through deep reinforcement learning. nature 2015;518:529-33.
30. Lillicrap TP, Hunt JJ, Pritzel A, et al. Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 2015.
Cite This Article
How to Cite
Feng, S.; Li, X.; Ren, L.; Xu, S. Reinforcement learning with parameterized action space and sparse reward for UAV navigation. Intell. Robot. 2023, 3, 161-75. http://dx.doi.org/10.20517/ir.2023.10
Download Citation
Export Citation File:
Type of Import
Tips on Downloading Citation
Citation Manager File Format
Type of Import
Direct Import: When the Direct Import option is selected (the default state), a dialogue box will give you the option to Save or Open the downloaded citation data. Choosing Open will either launch your citation manager or give you a choice of applications with which to use the metadata. The Save option saves the file locally for later use.
Indirect Import: When the Indirect Import option is selected, the metadata is displayed and may be copied and pasted as needed.
Comments
Comments must be written in English. Spam, offensive content, impersonation, and private information will not be permitted. If any comment is reported and identified as inappropriate content by OAE staff, the comment will be removed without notice. If you have any queries or need any help, please contact us at support@oaepublish.com.