# Learning based multi-obstacle avoidance of unmanned aerial vehicles with a novel reward

*Complex Eng Syst*2023;3:21.

## Abstract

In this paper, a novel reward-based learning method is proposed for unmanned aerial vehicles to achieve multi-obstacle avoidance. The Markov jump model was first formulated for the unmanned aerial vehicle obstacle avoidance problem. A distinctive reward shaping function is proposed to adaptively avoid obstacles and finally reach the target position via an optimal approach such that an adaptive Q-learning algorithm called the improved prioritized experience replay is developed. Simulation results show that the proposed algorithm can achieve autonomous obstacle avoidance in complex environments with improved performance.

## Keywords

*,*multi-obstacle avoidance

*,*adaptive Q-learning

## 1. INTRODUCTION

The rapid advancement of artificial intelligence and computer technology has significantly facilitated the extensive adoption of unmanned aerial vehicles (UAVs) in diverse domains, encompassing civil, commercial, and military sectors. Collision avoidance of UAVs continues to be a significant concern due to its direct implications on safety and the successful completion of tasks^{[1–3]}. Therefore, there has been considerable research interest in collision avoidance techniques, which are designed to mitigate the occurrence of collisions between UAVs and other objects^{[4]}.

The Dijkstra algorithm is adopted to address the issue of collision avoidance among UAVs and solve the shortest path problem in a power graph^{[5]}. In three-dimensional (3D) space, the A* (A-star) algorithm is used to evaluate the generation value of scalable waypoints within the path region using heuristic functions. The value of each generation was then compared with the search operation time and the cost of distance for waypoints in order to identify the optimal path^{[6]}. The Rapidly Exploring Random Trees (RRT) algorithm generates a search tree by randomly selecting leaf nodes and subsequently extends the search tree to cover the entire search space in order to identify the desired path^{[7]}. One study utilized the artificial potential field method to address the issue of UAV movement in the presence of targets and obstacles. This method involves converting the influence of the target and obstacle on the UAV's movement into an artificial potential field. By utilizing the gravitational and repulsive forces between the UAV and the obstacle, the researchers were able to effectively control the UAV's movement towards the target point, following the negative gradient of the potential field^{[8]}. The aforementioned methods pertain to the category of classical collision avoidance techniques. Although it is possible to achieve collision avoidance between UAVs and obstacles, these methods are generally limited in their applicability to addressing collision avoidance tasks in simple environments. In the presence of intricate challenges, the task of resolving them becomes increasingly arduous as the dimensions of the system expand, thereby imposing certain constraints.

With the advancement of computer and big data technologies, a variety of intelligent algorithms, including ant colony algorithms, genetic algorithms, and particle swarm optimization algorithms. In a previous study^{[9]}, the researchers employed the ant colony algorithm, which utilized pheromone as a guiding factor in the decision-making processes of ants for route selection. This approach enabled the ants to concentrate their efforts on identifying the most optimal path. Genetic algorithms can also be used to address the obstacle avoidance issue in UAVs^{[10]}. On the basis of the particle swarm optimization algorithm, the authors^{[11]} proposed a 3D path planning algorithm for UAV formation. The algorithm incorporates comprehensive improvements to automatically generate optimized flying paths. Researchers have extensively examined the benefits of the genetic algorithm and particle swarm optimization algorithm. Furthermore, they have integrated these two algorithms to compute the feasible and quasi-optimal trajectory for a fixed-wing UAV in a complex 3D environment^{[12]}. The above smart algorithms can theoretically find the optimal solution of the path, which is suitable for solving UAV collision avoidance and path planning problems in stationary environments. In practical terms, the environment is typically characterized by its dynamic and unknown nature. Reinforcement learning (RL), as an emerging research trend, has demonstrated significant potential in the domains of aircraft collision avoidance and route planning through intelligent interaction with the surrounding environment.

SARSA and Q-learning methods can enhance the obstacle avoidance ability of UAVs and optimize the calculation of the shortest path. Both methods are model-free algorithms that aim to eliminate reliance on the environmental model and make action selections based on the values associated with all available actions^{[13,14]}. Despite possessing distinctive characteristics and benefits, classical RL methods demonstrate significant limitations when confronted with high-dimensional motion environments and multi-action inputs, particularly as the complexity of the mission environment for UAVs increases. Mnih *et al*., (2013) were the first to propose the integration of RL with neural networks (NN) by utilizing them to approximate the Q function^{[15]}. Afterward, there have been significant advancements in RL methods, such as the Deep Q-network (DQN) RL-based approaches. These developments have led to the creation of deep RL algorithms that effectively address decision-making problems in complex and continuous dynamic systems^{[16,17]}. Therefore, deep RL can enable the UAV to continuously improve its obstacle avoidance strategy by interacting with the environment, thereby enabling it to have autonomous learning capabilities. The deep RL model can adapt to these complex environments through continuous trial and error learning, which means that UAVs can better adapt to different environments and tasks.

It is noteworthy to mention that the reward component in the RL method plays a pivotal role in enhancing performance. Inadequate reward configurations not only result in insufficient data and sparse rewards but also affect the intended efficacy of UAV collision avoidance. Notice that the reward is typically formulated by artificially assigning predetermined values. The adaptive rewarding or punishing of UAVs in order to optimize the prescribed performance while successfully avoiding collisions remains an unresolved matter. In this paper, a novel RL method is proposed, which incorporates an adaptive reward setting to effectively tackle the aforementioned issue. Experiments demonstrate that the proposed method has the capability to achieve autonomous obstacle avoidance for UAVs in complex and unfamiliar environments, leading to a substantial enhancement in UAV performance. The primary contributions of this paper are outlined as follows:

1. Compared to conventional optimization control methods utilized for obstacle avoidance in UAVs^{[9–12]}, the developed RL method has the capability to dynamically learn the optimal control strategy in an unfamiliar environment. This enables the UAV to effectively avoid obstacles and ultimately reach the target destination using an approximately optimal approach.

2. A new reward function has been developed, which is combined with adaptive weights to guide the agent's focus towards important aspects of the task. This approach effectively balances obstacle avoidance and reaching the target point.

The subsequent sections of this paper are structured as follows. Section 2 outlines the problem of collision avoidance for UAVs. Section 3 introduces a novel self-learning method for UAVs to avoid obstacles, incorporating an adaptive changing reward function. In Section 4, this paper presents numerical simulation results to validate the effectiveness of the developed method and analyze the advantages of the new reward function. Finally, the conclusions are presented in Section 5.

## 2. PROBLEM FORMULATION

Consider a UAV obstacle avoidance problem where the goal is to reach a target point in a bounded 3D region and avoid obstacles in an unknown environment. Assume that the obstacles in the search region are both static and unknown a priori.

### 2.1. UAV model

The kinematic model of the UAV is presented as follows:

where

To avoid the multiple obstacles in environments, we will monitor the nearest obstacle to the UAV. The following three detection ranges according to the distance

●

●

●

where

The overall goal of this paper is to self-learn control decisions for the UAV such that it can travel on obstacle-free paths to the prescribed target destination.

## 3. SELF-LEARNING OBSTACLE AVOIDANCE ALGORITHM

In this section, we are going to present a self-learning framework based on the DQN algorithm for the obstacle avoidance problem of UAVs in 3D space. Firstly, the anti-collision problem of UAV is formulated as a Markov decision process (MDP), and then, a DQN algorithm with the novel reward is developed for decision-making such that the UAV can reach the target point without collision with the obstacles.

### 3.1. MDP formulation of UAV obstacle avoidance

The problem of UAV obstacle avoidance can be formulated as an MDP

(1) State: The UAV state consists of the current coordinates of the UAV in the 3D coordinate system, that is

(2) Action: The horizontal and vertical heading angles of the UAV are assumed to change simultaneously. Angle variations are indicated by the angular velocity:

where

The navigational angle of the UAV in 3D space is controlled by varying its horizontal and vertical directional angular velocities. Therefore, the action

All actions compose the space of actions A.

(3) Reward: It is crucial to design the reward function for efficiently avoiding the obstacles and reaching the prescribed target point. The design of the reward function of traditional RL algorithms is usually relatively simple, requiring an artificial setting of weights to achieve a balance between different rewards, thereby encouraging the agent to complete the training task, which has certain limitations. Considering the overall UAV navigation cost and task, a novel reward function with a weight

where

From (7), one can find that there are three parts in the reward, and they are respectively responsible for obstacle avoidance, the target point, and energy consumption caused by actions. The parameter

where

The obstacle avoidance module reward function is as follows:

where

The reward function of the distance loss is as follows:

where both

The reward function for deflection angle loss is given below:

**Remark 1:** The design of the reward function plays an important role in the learning and training of UAV collision avoidance problems, which also determines the effectiveness and efficiency of NN training. In existing approaches to reward setting^{[19]}, the reward function is mostly defined as a fixed positive reward value when the UAV's next state is closer to the goal point after the UVA has executed the action. Alternatively, a fixed negative reward value is given for punishment. For collision avoidance problems, the drawback is that it is not possible to quantitatively characterize the impact of the current chosen action on its future. Moreover, that is a subjective decision and cannot guarantee the performance optimality. The improved deep Q network algorithm, based on prioritized experience replay (PER), proposed in this article, utilizes adaptive weights to dynamically adjust the reward function. Compared to the traditional DQN and PER algorithms, the improved PER (IPER) algorithm can accelerate learning convergence speed and improve sample efficiency by better handling sparse reward signals in tasks and incorporating important experiences.

**Remark 2:** From (8) and (9), one can notice that the weight

### 3.2. DQN algorithm design based on prioritized experience replay

An important component of the DQN algorithm is experience replay, which stores data generated from past learning into an experience pool and randomly draws past data from the experience pool for learning in the next network training. In this way, serial correlations can be broken, and past experiences can be reused. However, the random sampling approach also leads to wasted experience, so the PER approach can effectively address this issue^{[20]}. PER means that when extracting experience, the most valuable experience is extracted first with a certain probability to avoid overfitting. Therefore, compared to traditional experience replay, PER assigns a priority value

where

The goal is to make the TD error close and possibly small. If the TD error is relatively large, it means that our current Q-function is still far from the target Q-function and should be updated more often. Therefore, the TD error is used to measure the empirical value.

According to the above formula, the probability formula for the extracted sample k is as follows:

where

Recall the DQN algorithm^{[21]}; the Q-function update formula is as follows:

where

The objective function

The DQN algorithm uses two NNs for error backpropagation and weight update. One of them is called evaluate_ net, which is used to generate an estimated value. Another is called target_ net, which generates a target

A dynamically decaying

where

where

Algorithm 1 Improved PER |

1: Initialize the replay memory D with the capacity N |

2: Initialize evaluate_ net with weights |

3: Initialize target_ net with |

4: for episode = 1, M do |

5: Initialize the reward and set the UAV to the starting point |

6: fordo |

7: Choose action |

8: Otherwise, select action: |

9: Perform action |

10: Calculate: |

11: Order |

12: Calculate the precedence of state action transition experience: |

13: Store the transformed experience |

14: Calculate the sampling probability |

15: Draw samples according to the sampling probability |

16: Calculate loss function: |

17: Update the weights |

18: Every C steps assign the weights of the evaluate_ net to targe_ net |

19: end for |

20: end for |

**Remark 3:** In Algorithm 1, the highlighted contribution is that the adaptive reward works for the reward or punishment for the behavior of a UAV. Moreover, it plays a key role in performance evaluation and policy updates, together with DQN and PER. In the sense of the novel reward function, the developed Algorithm 1 is called IPER.

**Remark 4:** Notice that the natural exponential decay algorithm^{[22]} to the time-varying greedy policy is employed for fully exploring the environment at the start of learning.

## 4. SIMULATION RESULTS

This section aims to evaluate the effectiveness of the proposed algorithm by implementing the developed DQN algorithm with adaptive changing rewards.

This section assumes a 3D map area of 100 m

Parameters of the environment and UAV model

Parameter | Value |

Map size | 100 |

Flight speed | 10 m/s |

Safety distance | 5 m |

Warning distance | 10 m |

Parameters of network training

Reply memory size | 1, 000 |

Mini-batch size | 64 |

Learning rate | 0.01 |

Discount factor | 0.8 |

By implementing Algorithm 1, Figure 3 and Figure 4 show the UAV trajectories in three obstacle environments with different numbers and positions. By adopting the proposed algorithm, the UAV can successfully accomplish obstacle avoidance in some complex environments following a certain amount of training time.

Figure 3. UAV trajectories in 3D multi-obstacle environments. (A) Environment 1; (B) Environment 2; (C) Environment 3.

Figure 4. UAV trajectories in 3D multi-obstacle environments (aerial view). (A) Environment 1; (B) Environment 2; (C) Environment 3.

The obstacle avoidance capability of the algorithm under consideration is evaluated in 3D mountainous terrain. Figure 5 illustrates the trajectories of UAVs as they navigate through obstacle avoidance paths in three distinct environments, each characterized by varying numbers and heights of peaks. It is evident that the UAV is capable of effectively avoiding obstacles in challenging environments through the utilization of the proposed algorithm.

Figure 5. UAV trajectories in 3D mountain environments. (A) Environment 1; (B) Environment 2; (C) Environment 3.

Figure 6 depicts the diagram of the loss function for the algorithm proposed in this paper. It is evident that the loss function tends to converge after 120 rounds. In particular, the validation of the loss function performs satisfactorily as 0.83733 at round 121.

For comparisons, the PER and DQN methods are implemented in the multiple mountain environments as well. The IPER algorithm proposed in this article combines PER and adaptive weight dynamic adjustment of the reward function. Compared to the traditional DQN algorithm and PER algorithm, it has the advantages of improving sample utilization and accelerating the convergence speed of the algorithm. Figure 7 and Figure 8 plot the average reward curves and the average step of reaching the target point using PER, DQN, and IPER in this paper. From Figure 7 and Figure 8, one can see that the IPER algorithm developed in this paper outperforms the PER and DQN methods.

The above figures show the flight trajectories of the UAV in multiple obstacle environments and mountain simulation environments from different angles, showing that the UAV can reach the target point in a shorter path while successfully avoiding obstacles. In addition, by comparing the two indicators of average reward and average step size, it is evident that the IPER algorithm proposed in this paper outperforms both the traditional DQN algorithm and the PER algorithm.

## 5. CONCLUSIONS

This paper proposes a deep Q network algorithm that integrates an adaptive reward function and a PER method to address the challenge of intelligent obstacle avoidance for UAVs. Compared to the traditional DQN algorithm, rewards are dynamically adjusted based on the spatial distance between the drone and various obstacles along its path. This adjustment aims to strike a balance between obstacle avoidance and performance indicators. According to our experimental results, this adaptive reward method can effectively enable the UAV to navigate to the target point successfully without colliding with obstacles. In addition, deep RL also suffers from the issue of requiring a substantial amount of data samples, which is both unavoidable and expensive in real-world scenarios. In future research, we will address this problem and explore ways to effectively utilize existing samples for training in order to enhance efficiency.

The DQN algorithm learns the optimal strategy through a large number of training samples in a simulated environment. However, this strategy may not be very suitable for UAV navigation tasks in real-world environments because there may be significant differences between the real environment and the simulation environment. Therefore, studying how to adapt the strategies learned in the simulation environment to the real environment is a question worth exploring. Therefore, investigation transfer RL for practical obstacle avoidance of UAVs will be our future research direction.

## DECLARATIONS

### Authors' contributions

Made significant contributions to the conception and experiments: Gao H

Made significant contributions to the writing: Kong B

Made substantial contributions to the revision: Yu M, Li J

### Availability of data and materials

Not applicable.

### Financial support and sponsorship

This work was supported in part by the National Natural Science Foundation of China under Grant 62073158 and the Basic research project of the Education Department of Liaoning Province (LJKZ0401).

### Conflicts of interest

All authors declared that there are no conflicts of interest.

### Ethical approval and consent to participate

Not applicable.

### Consent for publication

Not applicable.

### Copyright

© The Author(s) 2023.

## REFERENCES

1. Zhou Y, Baras JS. Reachable set approach to collision avoidance for UAVs. In 2015 54th IEEE Conference on Decision and Control (CDC). 2015;pp. 5947-52.

2. Iacono M, Sgorbissa A. Path following and obstacle avoidance for an autonomous UAV using a depth camera. *Robot Auton Syst* 2018;106:38-46.

3. Radmanesh M, Kumar M, Guentert PH, Sarim M. Overview of path-planning and obstacle avoidance algorithms for UAVs: a comparative study. *Unmanned Syst* 2018;6:95-118.

4. Ait Saadi A, Soukane A, Meraihi Y, Benmessaoud Gabis A, Mirjalili S, Ramdane-Cherif A. UAV path planning using optimization approaches: a survey. *Arch Comput Methods Eng* 2022;29:4233-84.

5. Maini P, Sujit PB. Path planning for a UAV with kinematic constraints in the presence of polygonal obstacles. In 2016 International Conference on Unmanned Aircraft Systems (ICUAS); 2016. pp. 62-7.

6. Mandloi D, Arya R, Verma AK. Unmanned aerial vehicle path planning based on A* algorithm and its variants in 3D environment. *Int J Syst Assur Eng Manag* 2021;12:990-1000.

7. Wu X, Xu L, Zhen R, Wu X. Biased sampling potentially guided intelligent bidirectional RRT*Math Probl Eng* 2019;2019:1-12.

8. Liu H, Liu HH, Chi C, Zhai Y, Zhan X. Navigation information augmented artificial potential field algorithm for collision avoidance in UAV formation flight. *Aerosp Syst* 2020;3:229-41.

9. Perez-Carabaza S, Besada-Portas E, Lopez-Orozco JA, de la Cruz JM. Ant colony optimization for multi-UAV minimum time search in uncertain domains. *Appl Soft Comput* 2018;62:789-806.

10. Li J, Deng G, Luo C, Lin Q, Yan Q, Ming Z. A hybrid path planning method in unmanned air/ground vehicle (UAV/UGV) cooperative systems. *IEEE Trans Veh Technol* 2016;65:9585-96.

11. Shao S, Peng Y, He C, Du Y. Efficient path planning for UAV formation via comprehensively improved particle swarm optimization. *ISA Trans* 2020;97:415-30.

12. Roberge V, Tarbouchi M, Labonte G. Comparison of parallel genetic algorithm and particle swarm optimization for real-time UAV path planning. *IEEE Trans Industr Inform* 2013;9:132-41.

13. Jembre YZ, Nugroho YW, Khan MT, et al. Evaluation of reinforcement and deep learning algorithms in controlling unmanned aerial vehicles. *Appl Sci* 2021;11:7240.

14. Wu J, Sun Y, Li D, et al. An adaptive conversion speed Q-learning algorithm for search and rescue UAV path planning in unknown environments. *IEEE Trans Veh Technol* 2023:1-14.

15. Mnih V, Kavukcuoglu K, Silver D, et al. Playing Atari with deep reinforcement learning. *arXiv* 2013. Available from: https://doi.org/10.48550/arXiv.1312.5602[Last accessed on 16 Aug 2023].

16. Ye Z, Wang K, Chen Y, Jiang X, Song G. Multi-UAV navigation for partially observable communication coverage by graph reinforcement learning. *IEEE Trans Mob Comput* 2023;22:4056-69.

17. Wang L, Wang K, Pan C, Xu W, Aslam N, Nallanathan A. Deep reinforcement learning based dynamic trajectory control for UAV-assisted mobile edge computing. *IEEE Trans Mob Comput* 2022;21:3536-50.

18. Lin Z, Castano L, Mortimer E, Xu H. Fast 3D collision avoidance algorithm for Fixed Wing UAS. *J Intell Robot Syst* 2019;97:577-604.

19. Jiang L, Huang H, Ding Z. Path planning for intelligent robots based on deep Q-learning with experience replay and heuristic knowledge. *IEEE/CAA J Automatica Sinica* 2020;7:1179-89.

20. Schaul T, Quan J, Antonoglou I, Silver D. Prioritized experience replay. *arXiv* 2016. Available from: https://arxiv.org/abs/1511.05952[Last accessed on 16 Aug 2023].

21. Mnih V, Kavukcuoglu K, Silver D, et al. Human-level control through deep reinforcement learning. *Nature* 2015;518:529-33.

## Cite This Article

## How to Cite

Gao, H.; Kong B.; Yu M.; Li J. Learning based multi-obstacle avoidance of unmanned aerial vehicles with a novel reward. *Complex. Eng. Syst.* **2023**, *3*, 21. http://dx.doi.org/10.20517/ces.2023.24

## Download Citation

## Export Citation File:

## Type of Import

### Tips on Downloading Citation

### Citation Manager File Format

### Type of Import

**Direct Import:**When the Direct Import option is selected (the default state), a dialogue box will give you the option to Save or Open the downloaded citation data. Choosing Open will either launch your citation manager or give you a choice of applications with which to use the metadata. The Save option saves the file locally for later use.

**Indirect Import:**When the Indirect Import option is selected, the metadata is displayed and may be copied and pasted as needed.

## About This Article

### Copyright

**Open Access**This article is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, sharing, adaptation, distribution and reproduction in any medium or format, for any purpose, even commercially, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

## Data & Comments

### Data

**Views**

**Downloads**

**Citations**

**Comments**

**8**

### Comments

Comments must be written in English. Spam, offensive content, impersonation, and private information will not be permitted. If any comment is reported and identified as inappropriate content by OAE staff, the comment will be removed without notice. If you have any queries or need any help, please contact us at support@oaepublish.com.

^{0}