A distributed multi-vehicle pursuit scheme: generative multi-adversarial reinforcement learning

Xinhang Li; Yiying Yang; Qinwen Wang; Zheng Yuan; Chen Xu; Lei Li; Lin Zhang

doi:10.20517/ir.2023.25

Download PDF

Research Article | Open Access | 12 Sep 2023

A distributed multi-vehicle pursuit scheme: generative multi-adversarial reinforcement learning

Views: 972 | Downloads: 476 | Cited:

3

Xinhang Li¹

,

Yiying Yang¹

, ...

Lin Zhang^1,2

Intell Robot 2023;3(3):436-52.

10.20517/ir.2023.25 | © The Author(s) 2023.

Author Information

Article Notes

Cite This Article

Abstract

Multi-vehicle pursuit (MVP) is one of the most challenging problems for intelligent traffic management systems due to multi-source heterogeneous data and its mission nature. While many reinforcement learning (RL) algorithms have shown promising abilities for MVP in structured grid-pattern roads, their lack of dynamic and effective traffic awareness limits pursuing efficiency. The sparse reward of pursuing tasks still hinders the optimization of these RL algorithms. Therefore, this paper proposes a distributed generative multi-adversarial RL for MVP (DGMARL-MVP) in urban traffic scenes. In DGMARL-MVP, a generative multi-adversarial network is designed to improve the Bellman equation by generating the potential dense reward, thereby properly guiding strategy optimization of distributed multi-agent RL. Moreover, a graph neural network-based intersecting cognition is proposed to extract integrated features of traffic situations and relationships among agents from multi-source heterogeneous data. These integrated and comprehensive traffic features are used to assist RL decision-making and improve pursuing efficiency. Extensive experimental results show that the DGMARL-MVP can reduce the pursuit time by 5.47% compared with proximal policy optimization and improve the pursuing average success rate up to 85.67%. Codes are open-sourced in Github.

Keywords

Generative multi-adversarial reinforcement learning, graph neural network, intersecting cognition, multivehicle pursuit

Author's Talk

Download PDF 0 1

1. INTRODUCTION

Enabled by novel sensing technology^[1] and the self-learning ability of reinforcement learning (RL)^[2], the intelligent traffic management system is enjoying a significant upgrade and showing great potential to solve various problems in intelligent transportation systems (ITS)^[3]. As a complex special scene, multi-vehicle pursuit (MVP) describes the problem of multiple vehicles capturing several moving targets^[4], represented by the New York City Police Department guideline on the pursuit of suspicious vehicles^[5]. Moreover, various military intelligence combat scenes can also be modeled as MVP^[6]. Effective reward guidance^[7] and comprehensive perception^[8] of complex and dynamic urban traffic environments are the keys to solving the MVP problem and are gradually becoming hot topics.

Aiming at the MVP problem, Garcia et al. extended classical differential game theory and devised saddle-point strategies^[9] to address multi-player pursuit-evasion problems. Xu et al. considered greedy, lazy, and traitorous pursuers during the pursuit and rigorously re-analyzed Nash equilibrium^[10]. A graph-theoretic approach^[11] was employed to study the interactions of the agents and obtain distributed control policies for pursuers. A region-based relay pursuit scheme^[12] was designed for the pursuers to capture one evader. Jia et al. proposed a policy iteration method-based continuous-time Markov decision process (MDP)^[13] to optimize the pursuer strategy. However, these classical methods for MVP are not competent for complex traffic scenes with more constraints due to poor robustness. De Souza et al. introduced distributed multi-agent RL and curriculum learning to MVP problems^[14]. To improve pursuing efficiency, Zhang et al. constructed a multi-agent coronal bidirectionally coordinated with a target prediction network^[15] based on the multi-agent deep deterministic policy gradient. For efficient cooperation among pursuers, Yang et al. designed a hierarchical collaborative framework^[16]. Zheng et al. extended multi-to-multi competition to air combat among unmanned aerial vehicles^[17]. However, due to the mission nature of MVP, the pursuers only obtain a sparse reward after successfully capturing an evader. None of the aforementioned RL-based methods have addressed the sparse reward problem. This issue blurs the direction of the gradient descent of neural networks and seriously affects the strategy optimization. In addition, the lack of dynamic and effective awareness in the above MVP methods limits pursuing efficiency.

Due to powerful capabilities of distribution feature extraction and data generation, generative adversarial networks (GANs) have drawn growing interest in recent years^[18] and have been combined with RL to optimize strategies. To address the problem of incomplete observation of traffic information, Wang et al. used GANs for traffic data recovery to assist in deep RL (DRL) decision-making^[19]. A GAN-assisted human preference-based RL approach^[20] was proposed that adopted a GAN to learn human preferences. Li et al. designed a conditional deep generative model to predict future trajectory distribution^[21]. The adversarial training of GANs was introduced into the policy network and critic network^[22] to optimize RL training. Zheng et al. developed a reward-reinforced GAN^[23] to represent the distribution of the value function. However, mission-critical requirements of MVP pose significant challenges to these methods. The problem of the sparse reward remains unsolved, hindering the RL optimization.

Graph neural networks (GNNs) have an excellent ability to handle unstructured data and are widely applied to modeling multi-agent interactions and feature extraction of traffic information. Liu et al. modeled the relationship between agents by a complete graph^[24] to indicate the importance of the interaction between two agents. For cooperation among heterogeneous agents, Du et al. proposed a heterogeneous graph attention network^[25] to model the relationships among these diverse agents. GNNs were employed to model vehicle relationships and extract traffic features to enhance autonomous driving^[26,27]. A GNN with spatial-temporal clustering^[28] was designed for traffic flow forecasting. However, the single-layer GNN structure in the above methods did not couple the interaction model and traffic information of agents, which affects the RL collaborative game decision-making in complex urban traffic scenes.

In summary, as for the existing approaches for MVP, sparse reward and the lack of comprehensive traffic cognition severely limit the collaboratively pursuing efficiency. To address these problems, this paper proposes distributed generative multi-adversarial RL for MVP (DGMARL-MVP) in urban traffic scenes, as shown in Figure 1. Firstly, a generative multi-adversarial network (GMAN) is designed to guide RL strategy optimization via generating dense rewards, replacing the approximation of Bellman updates. The generative multi-adversarial RL can be applied to a wide range of multi-agent systems with sparse rewards to improve task-related performance. Moreover, a proposed GNN-based intersecting cognition promotes deep coupling of traffic information and multi-agent interaction features. The contributions of this paper are summarized as follows.

A distributed multi-vehicle pursuit scheme: generative multi-adversarial reinforcement learning

Figure 1. Architecture of DGMARL-MVP. Urban traffic environments for MVP (A) provide complex pursuit-evasion scenes and interactive environments for RL. Every pursuing vehicle targets the nearest evading vehicle and launches a collaborative pursuit. GNN-based intersecting cognition (B) couples the traffic information and multi-agent interaction features to assist GMAN boosting reinforcement learning (C) in decision-making. GMANs (D) to guide RL strategy optimization via generating dense rewards, replacing the approximation of Bellman updates. MVP: Multi-vehicle pursuit; GNNs: graph neural networks; GMAN: generative multi-adversarial network.

● This paper proposes DGMARL-MVP in urban traffic scenes. In DGMARL-MVP, a GMAN is designed to improve the Bellman equation by generating the potential dense reward, thereby properly guiding strategy optimization of distributed multi-agent RL (MARL).

● GNN-based intersecting cognition is proposed to promote deep coupling of traffic information and multi-agent interaction features to assist in improving the pursuing efficiency.

● This paper applies DGMARL-MVP to the simulated urban roads with 16 junctions and sets different pursuing difficulty levels with variable numbers of pursuing vehicles and evading vehicles. In the three tested difficulty levels, DGMARL-MVP reduces the pursuit time by 5.47$$ % $$ on average compared to proximal policy optimization (PPO) and improves the pursuing average success rate to 85.67$$ % $$. Codes are open-sourced at https://github.com/BUPT-ANTlab/DGMARL-MVP.

The rest of this paper is organized as follows. Section Ⅱ describes MVP in an urban traffic scene and models the MVP problem based on the MDP. Section Ⅲ presents generative multi-adversarial RL (GMARL) and its training process. Section Ⅳ presents distributed GMARL with GNN-based intersecting cognition for MVP. Section Ⅴ gives the performance of the proposed method. Section Ⅵ draws conclusions.

2. MULTI-VEHICLE PURSUIT IN DYNAMIC URBAN TRAFFIC

This section first introduces the details of the complex urban traffic environment for the MVP problem. Then, the modeling process of the MVP problem is stated as an MDP, and the basic Q-learning algorithm focusing on the update process is introduced.

2.1. Complex urban traffic environment for MVP

This paper focuses on the problem of MVP under the complex urban traffic and constructs a multi-intersection traffic scene. Each road is set to bidirectional two lanes and fixed-phase traffic lights at each intersection. In this scene, there are $$ M $$ pursuing vehicles, $$ N $$ evading vehicles ($$ M > N $$), $$ B $$ background vehicles, and $$ L $$ lanes. Specifically, the background vehicles follow the randomly selected routes, and the evading vehicles randomly select the routes from the preset routes. The pursuing vehicles share the companies and target information via road side units. Each pursuing vehicle integrates multi-source heterogeneous data to make decisions distributedly. If the evading vehicles are not totally captured within $$ st $$ time steps, the MVP task fails. When all evading vehicles are captured or the time steps reach $$ st $$, the pursuit is Done.

Furthermore, the following constraints are set in the MVP environment: (1) All vehicles obey the traffic rules for collision-free driving; (2) The maximum speed $$ \mathop v\nolimits_{\max} $$, maximum acceleration $$ \mathop {ac}\nolimits_{\max} $$, and maximum deceleration $$ \mathop {de}\nolimits_{\max} $$ of all pursuing vehicles and evading vehicles are set to be consistent; (3) Pursuing vehicles and evading vehicles are randomly initialized at the edge of the traffic map, respectively.

2.2. MDP-based MVP problem formulation

In this paper, the decision-making of each pursuing vehicle only depends on the current state, so the decision process can be modeled as the MDP defined by a tuple $$ \{ S, A, P, R \} $$. $$ s_{t} \in S $$, $$ a_{t} \in A $$ denote the state and action at time step $$ t $$. $$ P $$ is the state transition probability from the current state $$ s_{t} $$ to the next state $$ s_{t+1} $$ by executing the action $$ a_{t} $$. $$ r_{t} \in R: a_{t} \times s_{t} \rightarrow \mathbb{R} $$ is a real valued reward.

RL provides an excellent solution to MDP games. As an advanced RL algorithm for the problem with discrete action space, Q-learning enables decision-making without exact state transition probability and initial state. For a Q-learning-based agent, the expectation values $$ q $$ of all actions in state $$ s_{t} $$ are evaluated by $$ \mathop Q\nolimits^\pi (\mathop s\nolimits_t, a) $$ function, and the optimal strategy $$ \pi $$ chooses the action $$ s_{t} $$ with the greatest expectation to execute. Through the continuous interaction between the agent and the environment, the $$ \mathop Q\nolimits^\pi (s, a) $$ function is updated, achieving the purpose of selecting the optimal strategy finally. According to the Bellman equation, the optimal state-action value function can then be derived as

(1)

$$ \begin{equation} {{Q^{\pi^*} }({\rm{ }}{s_t}, {\rm{ }}{a_t}) = \mathop {\mathbb{E}} \nolimits_{\mathop s\nolimits_{t + 1} \sim P(.|\mathop s\nolimits_t , \mathop a\nolimits_t )} ({r_t} + \gamma \mathop {\max }\limits_{{\rm{ }}{a_{t + 1}}} {\rm{ }}{Q^\pi }({\rm{ }}{s_{t + 1}}, {\rm{ }}{a_{t + 1}}))}. \end{equation} $$

And the updating process of Q-learning can be expressed as

(2)

$$ \begin{equation} \left\{ \begin{array}{*{35}{l}} {{Q}_{target}}({{s}_{t}}, {{a}_{t}})={{r}_{t}}+\gamma \mathop {\max }\limits_{{{a}_{t+1}}}\, {{Q}^{\pi }}({{s}_{t+1}}, {{a}_{t+1}}), \\ {{Q}^{\pi }}({{s}_{t}}, {{a}_{t}})\leftarrow {{Q}^{\pi }}({{s}_{t}}, {{a}_{t}})+\alpha [\text{ }{{Q}_{target}}({{s}_{t}}, {{a}_{t}})-{{Q}^{\pi }}({{s}_{t}}, {{a}_{t}})] , \\ \end{array} \right. \end{equation} $$

where $$ \alpha $$ is the learning rate and $$ \gamma $$ is the discount factor, indicating the impact of future earnings on the current expectation value. For pursuing vehicle $$ m $$, the function $$ \mathop Q\nolimits^\pi (\mathop s\nolimits_t, a) $$ calculates the expectation values of turning left, turning right turn and going straight according to the current state $$ \mathop s\nolimits_t $$ to assist the vehicle in selecting the optimal route to pursue the evader.

3. GENERATIVE MULTI-ADVERSARIAL REINFORCEMENT LEARNING

In order to effectively solve the reward sparsity problem of MVP, a GMAN is introduced to improve the Bellman equation by generating suitable potential dense rewards during optimizing RL. Section $$ 3.1 $$ describes the principles and details of how a GMAN generates estimated rewards. Section $$ 3.2 $$ presents GMAN boosting RL and its training process.

3.1. Generative multi-adversarial network for dense reward

As a special game task, the reward of MVP is extremely sparse. Only when the pursuing vehicle captures an evading vehicle can the RL-based agent obtain a reward. The sparse reward blurs the optimization direction of RL, thus seriously hindering the strategy update. In this paper, a conditional generative network $$ G $$ is designed to estimate the potential future rewards and guide the RL training. Therefore, this paper adopts a GMAN to train the generative network $$ G $$ and generate appropriate dense rewards.

Suppose the state of the agent $$ n $$ at time step $$ t $$ is $$ s_t^n $$, the action is $$ a_t $$, and the cumulative rewards obtained by $$ n $$ from $$ t $$ until the end of the episode are

(3)

$$ \begin{equation} \mathop R\nolimits_t = \sum\limits_{T = t + 1} {\mathop {(\gamma )}\nolimits^{(T-t)} } \mathop r\nolimits_T. \end{equation} $$

The optimization objective of the generative network is to learn the contribution of the cumulative rewards $$ p_R $$ and make its output $$ G(z, [s_t, a_t]) $$ fit $$ R_t $$, where $$ z \sim p_z $$ is a simple fixed distribution that is easy to draw samples from.

In GMAN, the generating network $$ G $$ performs a max-min game with $$ I $$ discriminators to update parameters. For discriminator $$ D_i $$, the optimization objective is to distinguish data generated by $$ G $$ from the original data,

(4)

$$ \begin{equation} \arg \mathop {\max }\limits_{\mathop D\nolimits_i} V'_i(\mathop D\nolimits_i , G) = \mathop {\mathbb{E}}\nolimits_{R\sim\mathop p\nolimits_R } [\log (\mathop D\nolimits_i (R))] + \mathop {\mathbb{E}}\nolimits_{z\sim\mathop p\nolimits_z } [\log (\mathop {1 - D}\nolimits_i (G(z, [s, a])))]. \end{equation} $$

In practice, training against a far superior discriminator can impede the learning of the generator. To solve this problem and increase the stability of the generator, a classical Pythagorean mean is chosen as the fusion function $$ F_{soft} $$ to soften $$ V'_i(\mathop D\nolimits_i , G) $$, which is parameterized by $$ \lambda $$ where $$ \lambda = 0 $$ corresponds to the mean and the max is recovered as $$ \lambda \to \infty $$.

(5)

$$ \begin{equation} \mathop V = \mathop F\nolimits_{soft}(V') = - \exp (\sum\limits_i^I {\mathop {\frac{{\mathop e\nolimits^{\lambda \mathop V'_i } }}{{\sum _j {\mathop e\nolimits^{\lambda \mathop V'_j } } }}} \log ( - \mathop V'_i )} ). \end{equation} $$

Then, the discriminator $$ D_i $$ and generator $$ G $$ are updated by descending their stochastic gradient,

(6)

$$ \begin{equation} \mathop \nabla \nolimits_{\mathop \theta \nolimits_{\mathop D\nolimits_i } } \frac{1}{K}\sum\limits_k^K {[\log (\mathop D\nolimits_i (R)) + \log (\mathop {1 - D}\nolimits_i (G(z, [s, a])))]}, \end{equation} $$

(7)

$$ \begin{equation} \mathop \nabla \nolimits_{\mathop \theta \nolimits_G } \frac{1}{K}\sum\limits_k^K {\log ( 1 - \mathop F\nolimits_{soft} ( \mathop D\nolimits_i (G(z, [s, a]))))}. \end{equation} $$

Therefore, by training with historical experience replay, a GMAN is able to generate potential future rewards $$ G(z, [s_t^n, a_t^n]) $$ according to the current state and action of the agent. $$ G(z, [s_t^n, a_t^n]) $$ is used to boost the training process of RL and make its policy forward-looking. Meanwhile, the generated dense reward also effectively promotes RL convergence.

3.2. GMAN boosting reinforcement learning

The sparsity of rewards is a great challenge for the optimization of RL. In MVP, the RL-based agent explores many steps to obtain only one positive or negative reward, which leads to a vague direction of gradient descent for the agent. Therefore, this paper proposes a novel GMAN boosting RL. GMAN boosting RL generates reasonably dense rewards via virtue of the powerful generative power of the generative network. And the generated reward also includes the potential future benefit of the RL decision to improve the learning efficiency and the decision foresight of RL.

In GMAN boosting RL, the Bellman equation is modified using the generated reward. The approximation of future reward is replaced by $$ G(z, [s_t, a_t]) $$. The equation for target Q after upgrading the Bellman equation is

(8)

$$ \begin{equation} \mathop Q\nolimits_{target} ({s_t}, {\rm{ }}a_t) = {r_t} + G(z, [s_t, a_t]). \end{equation} $$

In this paper, the deep neural network $$ Q $$ is employed to fit the values of actions. Therefore, the loss of the $$ Q $$ network is calculated as

(9)

$$ \begin{equation} loss = \frac{1}{K}\sum\limits_k {\mathop {(Q({s_t}, {\rm{ }}{a_t}) - \mathop Q\nolimits_{target} )}\nolimits^2 }. \end{equation} $$

Distributed on-policy training is adopted in the proposed GMAN Boosting RL. The overall training process is shown in Algorithm 1. For every RL-based agent, the experience collected by the current policy is stored in the replay buffer $$ {\cal G} $$. At the start of training, experience $$ (\mathop s\nolimits_t , \mathop a\nolimits_t , \mathop r\nolimits_t , \mathop s\nolimits_{t + 1} , \mathop R\nolimits_t ) $$ is sampled from $$ {\cal G} $$. With the assistance of the generative network, the parameters of RL are updated via Eq. (9). Finally, the discriminators and the generator are trained in turn to help the generator learn the distribution of cumulative rewards under different state-action pairs. Through multiple cycles of training, GMAN boosting RL can have a forward-looking and optimal strategy to handle complex MVP problems with sparse reward.

Algorithm 1: Training Process of GMAN Boosting Reinforcement Learning
Input: RL-based agent Q, generator G, I discriminators, and the experience replay buffer $${\cal G}$$ collected by Q

4. DISTRIBUTED GMARL WITH GNN-BASED COGNITION FOR MVP

To enhance comprehensive cognition of complex urban traffic in MVP, a novel double-layer intersecting GNN is proposed to couple the traffic information and multi-agent interaction features. Section $$ 4.1 $$ mainly introduces the details of GNN-based intersecting cognition. DGMARL-MVP is described in Section $$ 4.2 $$. Finally, the decision-making and training flowchart of the proposed DGMARL-MVP is demonstrated in Section $$ 4.3 $$.

4.1. GNN-based intersecting cognition

In this paper, a double-layer intersecting graph network is used with a road graph to perceive the traffic condition and a vehicle graph to extract efficient information for pursuing vehicles, as shown in Figure 2. And the main idea of intersecting lies in using the perceived traffic information to construct the vehicle graph. It enables a deep coupling of road information with vehicle information.

Figure 2. Architecture of GNN-based Intersecting Cognition. GNNs: graph neural networks.

Each lane is modeled as a node on the first road graph, and the topological relationship of the road is regarded as the edge of the graph. More formally, the constructed road graph is described as $$ G^1=\{N^1, A^1\} $$, where $$ N^1=\{n_{i}^1, i \in\{1, 2, \ldots, l\}\} $$ is a set of node attributes and $$ A^1=\{e_{ij}^1, i, j \in\{1, 2, \ldots, l\} \} $$ is a set of edge attributes. Specifically, $$ n_{i}^1=[{nb}_i , {np}_i, {ne}_i] $$ represents the node feature consisting of the number of three types of vehicles (including background, pursuing and evading vehicles) in lane $$ i $$, respectively. $$ l $$ denotes the number of nodes in the constructed graph that is equal to the total number of lanes. $$ e_{ij}^1 $$ denotes the edge value of lane $$ i $$ and lane $$ j $$. $$ e_{ij}^1 = 1 $$when lane $$ i $$ and lane $$ j $$ are connected, while $$ e_{ij}^1 = 0 $$ lane $$ i $$ and lane $$ j $$ are not adjacent.

The first road graph network consists of fully connected layers (FC) $$ \Phi _1^{FC} $$ and graph convolutional network (GCN) $$ \Phi _1^{G} $$. The node features are firstly input into the FC to assist GCN in understanding their semantic information, represented as $$ {N_{FC}^1} = \;\Phi _1^{FC}({N^1}) $$. Then, GCN merges the global traffic node information and produces high-level semantic information. The process can be formulated as follows

(10)

$$ \begin{equation} G\_ou{t_1} = \Phi _1^G.\left( {\;{N_{FC}^1}, \;{A^1}} \right) = D_1^{\frac{1}{2}}{A^1}D_1^{ - \frac{1}{2}}{N_{FC}^1}W + b, \end{equation} $$

where $$ D_1 $$ is the degree matrix and $$ D_{ii}^1= \sum_{j} A_{ij}^1 $$, and $$ W $$ is the trainable weight matrix $$ {G\_out}_1 $$ is set to the shape of ($$ l\times l $$), representing the relationship between roads based on traffic density.

The vehicle graph$$ G^2=\{N^2, A^2\} $$ takes the ego pursuing vehicle and all evading vehicles as nodes $$ N^2=\{n_{i}^2, i \in\{1, 2, \ldots, M+1\}\} $$, where $$ M $$ is the number of evading vehicles, $$ n_{i}^2 $$ is the position embedding of pursuing vehicles or evading vehicles. Obviously, each pursuing vehicle obsesses an independent sub-graph. The adjacency matrix $$ A^2=\{e_{ij}^2, i, j \in\{1, 2, \ldots, M+1\} \} $$ is deliberately designed and significantly meaningful in this paper. Given that $$ G^1 $$ contains the relationship of each lane based on traffic density in our conception, a threshold $$ \varepsilon $$ is set to identify traffic-sensing connections between ego vehicles and each evading vehicle:

(11)

$$ \begin{equation} \mathop e\nolimits_{ij}^2 = \left\{ {\begin{array}{*{20}{l}} 1&{G\_ou{t_1}\left( {{I_{i, j}}} \right) > \;\varepsilon , }\\ 0&{G\_ou{t_1}\left( {{I_{i, j}}} \right) < \;\varepsilon , } \end{array}} \right. \end{equation} $$

where $$ I_1 $$ represents the number of the located lane of the ego pursuing vehicle, and $$ I_j $$ represents that of the evading vehicle $$ j $$. The calculating process of the vehicle graph is the same as the traffic graph network. The output of vehicle graph $$ G^2 $$ is $$ {G\_out}_2 = \Phi_2^G\left(N_{FC}^2, A^2\right) $$, $$ N_{FC}^2=\Phi_2^{FC}(N^2) $$ is the output of FC $$ \Phi_2^{FC} $$.

4.2. Distributed GMARL for MVP

In this paper, a deep neural network is adopted to fit $$ \mathop Q\nolimits^\pi (s_t, a_t) $$. For the pursuing vehicle $$ n $$, the GMARL-based agent is used to plan the optimal pursuit route. The agent contains a neural network with a parameter $$ \theta_{n} $$, an online network $$ \mathop Q\nolimits^\pi_n (s_t^n, a_t^n\left| {\mathop \theta \nolimits_n } \right) $$, to estimate the $$ Q $$ value of the action $$ a_t^n $$ in the current state $$ s_t^n $$. The architecture of DGMARL-MVP is shown in Figure 3.

Figure 3. Architecture of Distributed GMARL for MVP. MVP: Multi-vehicle pursuit.

Each pursuing vehicle distributedly makes decisions based on its own observations and shared information. For pursuing vehicle $$ n $$, its state $$ s_t^n $$ consists of three parts, including its own position $$ {loc}^n_t $$, the position of the closest evading vehicles $$ {loc}^m_t $$, and road-vehicle hybrid features $$ {G\_out}_2 $$. In this paper, the location of vehicle $$ n $$ is denoted by $$ {loc}^n_t = \left\{ {{Code_l}, pos^{n, l}_t} \right\} $$. And the length of $$ {loc}^n_t $$ is denoted by $$ len_{loc} $$. $$ {Code_l} $$ denotes the binary code of lane $$ l $$ where vehicle $$ n $$ is located, and $$ pos^{n, l}_{t} $$ denotes the distance between vehicle $$ n $$ and the starting of lane $$ l $$ at time $$ t $$.

Due to the constraints of the traffic scene, the action space of the pursuing vehicles contains three elements, i.e., turning left, turning right, and going straight at the next intersection. And the expectation values of turning left $$ {q}^{n, lef}_{t} $$, turning right $$ {q}^{n, rig}_{t} $$, and going straight $$ {q}^{n, str}_{t} $$ are calculated according to the pursuing vehicle $$ n $$'s current state $$ s^{n}_{t} $$. The agent randomly chooses an action with probability $$ \epsilon $$ to explore or exploit with probability $$ 1-\epsilon $$ by selecting the action with the largest q-value. In particular, if a vehicle encounters a T-junction that only enables going straight and turning right and, unfortunately chooses to turn left, the agent will randomly select the action to execute in the reasonable action space.

To motivate the capture of pursuing vehicles and incentivize efficient training, an elaborately designed reward $$ r_t^n $$ consists of two parts.

1. Only the pursuing vehicle $$ n $$ successfully captures an evading vehicle does it obtain a positive reward $$ R_V $$.

2. A distance-sensitive reward is set to improve the pursuing efficiency. When a pursuing vehicle reduces the distance from the closest evading vehicle compared to that at the last time step, it will obtain a positive reward, and conversely, it will be punished with a negative reward.

Therefore, the formulation of $$ r_t^n $$ is expressed as

(12)

$$ \begin{equation} r_t^n = \left\{ {\begin{array}{*{20}{l}} {R_V}&{\text{if successful pursuit}, }\\ { \sigma \left( {d_t^{n, m} - d_{t - 1}^{n, m}} \right)}&{{\rm{else }}, } \end{array}} \right. , \end{equation} $$

where $$ \sigma $$ is a negative reward factor, and $$ d^{n, m}_{t} $$ denotes the distance of the pursuing vehicle $$ n $$ from the closest evading vehicle $$ m $$ at the time step $$ t $$.

Each agent is updated by gradient descent in distributed training. Specifically, $$ \theta_{n} $$ is iteratively updated through stochastic gradient descent (SGD) using random samples from the experience replay buffer. Suppose the sample data is denoted as $$ \left\{s^{n}_{\tau}, a^{n}_{\tau}, r^{n}_{\tau}, s^{n}_{\tau+1}\right\} $$. The gradient of online network $$ \mathop Q\nolimits^\pi_n (s, a\left| {\mathop \theta \nolimits_n } \right) $$ is delivered as

(13)

$$ \begin{equation} \mathop \nabla \nolimits_{{\theta _n}} J = \frac{1}{{K}}\sum\limits_\tau {\mathop \nabla \nolimits_{{\theta _n}} \mathop {\left( {r^n_\tau + G(z, [s_\tau^n, a_\tau^n]) - \mathop Q\nolimits_n^\pi \left( {s^n_\tau , a^n_\tau \left| {{\theta _n}} \right.} \right)} \right)}\nolimits^2 } \end{equation} $$

4.3. Decision-making and training process of DGMARL-MVP

This part presents the overall decision-making and online training process of DGMARL-MVP, as shown in Algorithm 2. At the beginning of each episode, the urban pursuit-evasion environment and the local state of all agents are initialized. Then, the road information and the position information of vehicles are fed into the intersecting graph network outputting $$ {G\_out}_2 $$, which extracts integrated features of the traffic situation and agent relationships from multi-source heterogeneous data. Equipped with $$ {G\_out}_2 $$, $$ n $$ agents form their own local state using the other position information, individual position information $$ loc_t^n $$, and position information of evading vehicles $$ loc_t^m $$. And each pursuing vehicle $$ n $$ selects an action $$ a_t^n $$ according to the current policy $$ Q_n^\pi $$ with the local state. The local state will be updated and the reward will be feedback after each agent performs $$ a_t^n $$. Each transition $$ (s_t^{n}, a_t^{n}, r_t^{n}, s_{t + 1}^{n}) $$ is then stored in the separate replay buffer $$ {\cal G}_n $$.The potential future rewards $$ R_t^n $$ are generated by a conditional generative network. $$ R_t^n $$ is appended to the experience replay buffer to estimate and guide the RL training. Then, the training processes of the distributed networks are according to Algorithm 1.

Algorithm 2: DGMARL-MVP Decision-making and Online Training Algorithm

In the decision-making process of DGMARL-MVP, collaboration among agents is performed in the information sharing. During the pursuit process, every pursuing vehicle shares its own position and observation information with other pursuing vehicles for collaboration. The shared information is used to develop GNN-based intersecting cognition and gain effective and comprehensive awareness of the agent relationships and traffic situations.

5. EXPERIMENTS AND RESULTS

5.1. Simulator and parameter settings

As a MARL algorithm, DGMARL-MVP collects training data and updates parameters by interacting with the simulated urban traffic environment. This paper constructs a complex urban traffic environment based on SUMO^[29] to verify the effect of the DGMARL-MVP. The environment with $$ 3 \times 3 $$ grid-pattern urban road structure simulates continuous dynamic random traffic flow. In the simulated closed scene, there are four intersections and eight T-junctions. During the simulation process, the number of background vehicles is fixed, and the background vehicles follow randomly selected routes. Moreover, to evaluate the robustness of DGMARL-MVP, this paper designs three different difficulty levels of MVP tasks with variable numbers of pursuing vehicles $$ N $$ and evading vehicles $$ M $$, respectively, four pursuing vehicles chasing two evading vehicles (P4-E2), five pursuing vehicles chasing three evading vehicles (P5-E3) and seven pursuing vehicles chasing four evading vehicles (P7-E4). All evading and pursuing vehicles randomly select their initialization locations, as shown in Figure 4. The GPU used to train our model is NVIDIA Tesla T4. Notably, all algorithms in our experiments are trained in the same environment and evaluated by averaging various metrics for 100 test epochs, including the average reward, the average time steps, and the success rate. The simulation parameters are shown in Table 1. And the parameter settings of DGMARL-MVP are shown in Table 2. The internal structures of Deep Q Network (DQN) and discriminators are shown in Table 3 and Table 4, respectively.

Figure 4. Traffic Simulation Environment for MVP. MVP: Multi-vehicle pursuit.

Table 1

Simulation settings

Parameters	Values
Maximum time steps $$ st $$	800
Maximum speed $$ \mathop v\nolimits_{\max} $$	20 $$ m/s $$
Maximum acceleration $$ \mathop {ac}\nolimits_{\max} $$	0.5 $$ m/s^2 $$
Maximum deceleration $$ \mathop {de}\nolimits_{\max} $$	4.5 $$ m/s^2 $$
Number of lanes $$ L $$	48
Length of location code $$ len_{loc} $$	7
Number of junctions	16
Length of each lane	500 $$ m $$
Number of background vehicles	200

Table 2

Parameter settings

Parameters	Values	Parameters	Values
$$ \alpha $$	$$ \mathop {10}\nolimits^{ - 4} $$	$$ R_V $$	500
$$ \gamma $$	0.9	$$ \varepsilon $$	0
$$ \lambda $$	0.5	$$ \sigma $$	5
$$ \epsilon $$	0.05	$$ max\_epoch $$	2600

Table 3

Structure of the deep Q network

Layers	Deep Q network
Input	(batch size, $$ (M+1)^2+2\times{len}_{loc} $$)
Dense Layer 1	($$ (M+1)^2+2\times{len}_{loc} $$, 32)
Activation Function	$$ Elu $$
Dense Layer 2	(32, 48)
Activation Function	$$ Elu $$
Dense Layer 3	(48, 32)
Activation Function	$$ Elu $$
Dense Layer 4	(32, 16)
Activation Function	$$ Elu $$
Dense Layer 5	(16, 3)
Activation Function	$$ SoftMax $$
Output	(batch size, 3)

Table 4

Structure of the discriminator

Layers	Discriminator
Input	(batch size, 1)
Dense Layer 1	(1, 128)
Activation Function	$$ LeakyReLu (negative\_slope=0.02) $$
Dense Layer 2	(128, 64)
Activation Function	$$ LeakyReLu (negative\_slope=0.02) $$
Dense Layer 3	(64, 1)
Output	(batch size, 1)

5.2. Ablation experiments

The ablation experiments are conducted to further demonstrate the effectiveness of the proposed method and examine the impact of the GMAN boosting RL and GNN-based intersecting cognition in the DGMARL-MVP. Specifically, a is the method that DQN is only equipped with GNN-based intersecting cognition, and b is the method that GMAN boosting RL is not equipped with GNN-based intersecting cognition. The results are shown in Table 5.

Table 5

Evaluation results

N-M	P4-E2			P5-E3			P7-E4
Evaluate metrics	Average Reward	Average Time Steps	Success Rate	Average Reward	Average Time Steps	Success Rate	Average Reward	Average Time Steps	Success Rate
¹ a: DQN equipped with GNN-based intersecting cognition; b: GMARL without GNN-based intersecting cognition.
DGMARL-MVP	8.688	644.96	0.91	8.827	698.20	0.85	9.213	731.64	0.81
a	7.407	695.33	0.86	8.592	728.53	0.81	8.791	751.32	0.78
b	8.094	684.01	0.87	8.692	714.36	0.82	8.959	739.55	0.80
DQN	6.953	736.09	0.81	7.195	763.31	0.72	8.122	749.76	0.76
PPO	7.513	717.41	0.86	8.368	731.88	0.80	8.844	745.51	0.78
QMIX	6.339	745.27	0.77	8.645	749.09	0.74	8.725	755.48	0.73

Compared with a, the average reward of DGMARL-MVP is increased by 17.29$$ % $$, 2.74$$ % $$, and 4.80$$ % $$ in P4-E2, P5-E3, and P7-E4 scenes, respectively. DGMARL-MVP exhibits fewer average time steps than a in the same scene. Specifically, compared with a, the average time steps of DGMARL-MVP are reduced by 7.24$$ % $$ in the P4-E2 scene at most, reduced by 2.62$$ % $$ in the P7-E4 scene at least, and 4.68$$ % $$ in all scenes on average. Also, DGMARL-MVP has the highest success rate, which is 5.81$$ % $$, 4.94$$ % $$, and 3.85$$ % $$ in P4-E2, P5-E3, and P7-E4 scenes, respectively. These results reveal that the proposed GMAN boosting RL algorithm can effectively alleviate the problem of sparse reward caused by MVP under urban environments and successfully indicate the optimization direction of the RL policy through the suitable potential dense reward generated by the GMAN, thus enhancing the optimality of the agent policy.

In addition, it can be obtained that the proposed DGMARL-MVP shows a higher average reward than b from Table 5, exactly 7.38$$ % $$, 1.55$$ % $$, and 2.84$$ % $$ higher than that of b in P4-E2, P5-E3, and P7-E4 scenes, respectively. Also, the average timesteps of DGMARL-MVP are separately 5.71$$ % $$, 2.26$$ % $$, and 1.07$$ % $$ less than that of B in P4-E2, P5-E3, and P7-E4 scenes, respectively. As for the success rate, DGMARL-MVP also shows a delightful superiority over b. Concretely, compared with b, the success rate of DGMARL-MVP increases by 4.60$$ % $$ in the P4-E2 scene at most, 1.25$$ % $$ in the P5-E3 scene at least, and 3.17$$ % $$ in all scenes on average. This phenomenon demonstrates that the proposed GNN-based intersecting cognition can effectively assist pursuing vehicles dealing with multi-source heterogeneous data from complex dynamic environments and enable agents to adaptively extract the situation information of other vehicles and the interaction features among agents so as to improve the pursuing efficiency.

Furthermore, Figure 5 depicts the bar chart comparison of the three metrics, average reward, time steps, and success rate, for DGMARL-MVP, a, and b in P4-E2, P5-E3, and P7-E4 scenes, offering a more intuitive illustration of the effectiveness of the proposed modules. It is evident that DGMARL-MVP has the best performance for all metrics in any scene. The ablation experiments confirm that the proposed GMAN boosting RL algorithm can generate appropriate potential dense rewards, which makes RL more forward-looking in policy updating and correctly guides the optimization direction of RL policy, thereby improving the stability of distributed multi-agent system and enhancing the optimality of agent decision-making. Meanwhile, the proposed GNN-based intersecting cognition can adequately couple the interaction features of agents with traffic information and enhance their ability to handle multi-source heterogeneous data so as to promote the adaptability of the agents to the dynamic environment and improve the pursuing efficiency.

Figure 5. Results of Ablation Experiments. (A): Average Reward of Ablation Experiments. (B): Average Time Steps of Ablation Experiments. (C): Success Rate of Ablation Experiments.

5.3. Comparison with other methods

This part demonstrates the performance of applying DGMARL-MVP and other algorithms to three scenes of MVP problems. This paper uses DQN, QMIX^[30], and PPO for comparison. The details are shown in Table 5.

In the MVP problem of P4-E2, three metrics show consistency in performance evaluation. It is clear that DGMARL-MVP is noticeably the strongest performer on all of the metrics, which indicates the superiority of our proposed DGMARL-MVP. The success rate is an appreciable 91$$ % $$, which is 12.35$$ % $$ higher than DQN, 5.81$$ % $$ higher than QMIX, and 5.81$$ % $$ higher than PPO. In the evaluation of the three metrics, the proposed DGMARL-MVP shows the most significant advantage in the average reward metric, with 15.64$$ % $$ higher than the sub-optimal algorithm PPO. This indicates that the proposed DGMARL-MVP can provide better guidance for agents in the pursuing process, in other words, make better local decisions and also lead to better final results. For other comparison algorithms, QMIX performs the worst of all the algorithms on all metrics in this scene, and PPO shows a sub-optimal performance.

Upgrading the difficulty to P5-E3, the proposed DGMARL-MVP algorithm still shows superior performance among other comparison algorithms. This superiority is specifically manifested in that our algorithm is 2.1$$ % $$, 4.60$$ % $$, and 6.25$$ % $$ better than the sub-optimal algorithm on average reward, average time steps, and success rate, respectively. The largest performance gap can be seen on the average reward metric compared with DQN, which is an exceedance of 22.68$$ % $$. From this data, DGMARL-MVP performs better on the metric of the success rate than the other two metrics, which indicates that our algorithm DGMARL-MVP has high stability in a relatively difficult scene, resulting in an improvement in success rate. In this scene, other comparison methods have shown instability on three metrics to some extent. DQN performs worst on all metrics. QMIX is inferior to PPO in terms of success rate by 2.35$$ % $$, although it is suboptimal in terms of rewards.

The difficulty setting of P7-E4 is approximately the same as that of P5-E3, but the increase of vehicles in both the pursuing team and the evading team increases the difficulty of global scheduling. However, from Table 5, the proposed DGMARL-MVP still shows satisfactory optimal performance among three other algorithms. In terms of the average reward metric, DGMARL-MVP is 4.17$$ % $$ better than the sub-optimal algorithm PPO and 13.43$$ % $$ better than the worst algorithm DQN. In terms of the average time steps, the difference between DGMARL-MVP and other algorithms is not that huge, but compared with the sub-optimal algorithm, there is also a 1.86$$ % $$ improvement with 14.12 time steps, which illustrates that DGMARL-MVP can steadily take its decision-making advantages in more difficult global scheduling scenes. In addition, DGMARL-MVP performs better by 7.05$$ % $$ on the success rate than the other algorithms on average. PPO shows the suboptimal performance on all three metrics.

By comparing the performance of all algorithms in these three scenes, the proposed DGMARL-MVP is the most stable algorithm and also performs the best. In spite of this stable performance, there are differences in the performance of DGMARL-MVP in the three scenes. As the difficulty of the scene increases, for example, from P4-E2 upgrading to P5-E3, the success rate of the pursuit decreases by 7.06$$ % $$, and the average time steps decrease by 8.25$$ % $$. It illustrates that the negative impact of increasing pursuing difficulty on success rates is indisputable, but the proposed DGMARL-MVP is more stable than other comparison algorithms. It is worth mentioning that DQN, which is the basis of DGMARL-MVP, is the worst performer in all three scenes. From this perspective, it indicates the proposed DGMARL-MVP makes considerable improvements.

In order to show the performance variation and comparison of all algorithms more clearly, a bar chart is used to show the changing trend of the three metrics in three scenes, as shown in Figure 6. Intuitively, as the number of evading vehicles increases, the average reward increases at the same time. In Figure 6A, unexpectedly, the average reward of the proposed DGMARL-MVP presents the highest rewards but a slight increase, and QMIX presents the largest increase. An inference of the reason could be that DGMARL-MVP provides better decisions during the pursuit, resulting in a high reward accumulation. As the difficulty increases in Figure 6B, DGMARL-MVP has a larger increase than other algorithms on the metric of average time steps, which illustrates that DGMARL-MVP presents more advantages in a sample scene than in difficult scenes. In terms of the success rate in Figure 6C, DGMARL-MVP, with the other algorithms, shows a downward trend, although DGMARL-MVP reaches the highest, except DQN, which shows a rise of performance as the scene changes from P5-E3 to P7-E4. The negative impact of increasing pursuing difficulty on success rates is indisputable, but the performance of DGMARL-MVP presents more stable, which shows the generalization of DGMARL-MVP.

Figure 6. Results of Comparison Experiments. (A): Average Reward of Comparison Experiments. (B): Average Time Steps of Comparison Experiments. (C): Success Rate of Comparison Experiments.

5.4. Convergence comparison during training

In order to more convincingly prove the advantages of the proposed method, Figure 7 describes the convergence curves of average reward with training steps for various methods, including DGMARL-MVP, a, PPO, and QMIX, in P4-E3, P5-E3, and P7-E4 scenes. In Figure 7A, in the P4-E2 scene, it can be seen that compared with a, of which the fluctuation has a slight advantage over other methods in the last stage of training, DGMARL-MVP has better performance in convergence rate and convergence target. For the P5-E3 scene, as shown in Figure 7B, although all the methods showed similar convergence stability at the last stage of training, our method has a better convergence trend, which has a growing trend and a higher convergence target during the training. Figure 7C depicts the convergence curve of average reward with training steps in the P7-E4 scene, presenting that DGMARL-MVP has superior performance over other methods in both convergence rate and convergence trend. In conclusion, Figure 7 illustrates that compared with PPO and QMIX, which are separately the best and the worst of all comparison methods, DGMARL-MVP makes the competitive convergence rate and trend, demonstrating its superiority and effectiveness on MVP under urban environments.

Figure 7. Convergence Process During Training. (A): Convergence Process of Average Reward for Methods in P4-E2. (B): Convergence Process of Average Reward for Methods in P5-E3. (C): Convergence Process of Average Reward for Methods in P7-E4.

With the horizontal comparison of the convergence of the proposed DGMARL-MVP in three scenes, the convergence performance of DGMARL-MVP in the three scenes is basically the same, and the convergence starts at about 1950 time steps. Compared with the other three algorithms, in all three scenes with different difficulties, the time step of convergence of DGMARL-MVP is basically the same as that of other algorithms. It is worth mentioning that the proposed DGMARL-MVP improved surprisingly quickly at the beginning of the training process, which indicates that our algorithm can better guide the direction of training at the initial stage. For this, it is not surprising that the reward of DGMARL-MVP remains the highest from the beginning to the end of the training in the P4-E2 scene and P7-E4 scene. An exception occurs in the scene of P5-E3; the average reward is reversed by DQN in the later stages of training, but the final result is still the best. In addition, the proposed DGMARL-MVP exhibits much smaller fluctuations in this scene, which shows the stability of the proposed algorithm.

6. CONCLUSIONS

This paper has proposed DGMARL-MVP to address the sparse rewards and insufficient perception of complex traffic situations brought by the MVP under urban traffic environments. In DGMARL-MVP, a GMAN has been designed to generate potential dense rewards and provide proper guidance for distributed RL optimization. Equipped with a GMAN, DGMARL-MVP has effectively solved the problem of optimization direction ambiguity caused by reward sparsity via the enhanced Bellman equation. In addition, this paper has proposed a GNN-based intersecting cognition, where the construction of vehicle graphs encourages a deep coupling between traffic information and multi-agent information. It thoroughly extracted and utilized the multi-source heterogeneous data of urban traffic and the complicated multi-agent interaction features, thus considerably improving pursuing efficiency. Extensive experimental results have demonstrated that the DGMARL-MVP can significantly improve pursuing success rate up to 85.67$$ % $$. In the future, the impact of additional factors, such as pedestrians and communication delays, on the design and analysis of MVP methods will be investigated. More real scenes, such as evading vehicles not following traffic rules, will also be considered to design smarter MVP methods.

DECLARATIONS

Acknowledgments

The authors would like to thank the editor-in-chief, the associate editor, and the anonymous reviewers for their valuable comments.

Authors' contributions

Made contributions to the research, idea generation, conception, and design of the work and wrote and edited the original draft: Zhang L, Li X, Yang Y, Wang Q

Made contributions to the algorithm design and simulation and developed the majority of the associated code for the simulation environment and the proposed method: Li X, Yuan Z, Yang Y

Participated in part of the experimental data analysis and visualizations and performed data collation and related tasks: Li L, Xu C, Wang Q

Performed critical review and revision and provided administrative, technical, and material support: Zhang L, Li L, Xu C

Availability of data and materials

The codes of this paper are open-sourced and available at https://github.com/BUPT-ANTlab/DGMARL-MVP.

Financial support and sponsorship

This work is supported by the National Natural Science Foundation of China (Grants No. 61971096 and No. 62176024), the National Key R & D Program of China (2022ZD01161, 2022YFB2503202), Beijing Municipal Science & Technology Commission (Grant No. Z181100001018035) and Engineering Research Center of Information Networks, Ministry of Education.

Conflicts of interest

All authors declared that there are no conflicts of interest.

Ethical approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Copyright

REFERENCES

1. Chen C, Zou W, Xiang Z. Event-triggered consensus of multiple uncertain euler-lagrange systems with limited communication range. IEEE Trans Syst Man Cybern, Syst 2023;53:5945-54.

2. Boin C, Lei L, Yang SX. AVDDPG-Federated reinforcement learning applied to autonomous platoon control. Intell Robot 2022;2:145-67.

3. Zhu Z, Pivaro N, Gupta S, Gupta A, Canova M. Safe model-based off-policy reinforcement learning for eco-Driving in connected and automated hybrid electric vehicles. IEEE Trans Intell Veh 2022;7:387-98.

4. Cao Z, Xu S, Jiao X, Peng H, Yang D. Trustworthy safety improvement for autonomous driving using reinforcement learning. Trans Res Part C-Emer Technol 2022;138:103656.

5. Patrol guide. section: Tactical operations. procedure no: 221-15; 2016. Available from: https://www1.nyc.gov/assets/ccrb/downloads/pdf/investigations_pdf/pg221-15-vehicle-pursuits.pdf.

6. Qi Q, Zhang X, Guo X. A Deep Reinforcement Learning Approach for the Pursuit Evasion Game in the Presence of Obstacles. In: 2020 IEEE International Conference on Real-time Computing and Robotics (RCAR). IEEE; 2020. pp. 68–73.

7. Xu B, Wang Y, Wang Z, Jia H, Lu Z. Hierarchically and cooperatively learning traffic signal control. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 35; 2021. pp. 669–77.

8. Li S, Yan Z, Wu C. Learning to delegate for large-scale vehicle routing. Adv Neural Inf Process Syst 2021;34. Available from: https://proceedings.neurips.cc/paper_files/paper/2021/file/dc9fa5f217a1e57b8a6adeb065560b38-Paper.pdf.

9. Garcia E, Casbeer DW, Von Moll A, Pachter M. Multiple pursuer multiple evader differential games. IEEE Trans Automat Contr 2020;66:2345-50.

10. Xu Y, Yang H, Jiang B, Polycarpou MM. Multiplayer pursuit-evasion differential games with malicious pursuers. IEEE Trans Automat Contr 2022;67:4939-46.

11. Lopez VG, Lewis FL, Wan Y, Sanchez EN, Fan L. Solutions for multiagent pursuit-evasion games on communication graphs: finite-time capture and asymptotic behaviors. IEEE Trans Automat Contr 2020;65:1911-23.

12. Pan T, Yuan Y. A region-based relay pursuit scheme for a pursuit-evasion game with a single evader and multiple pursuers. IEEE Trans Syst Man Cybern, Syst 2023;53:1958-69.

13. Jia S, Wang X, Shen L. A continuous-time markov decision process-based method with application in a pursuit-evasion example. IEEE Trans Syst Man Cybern, Syst 2016;46:1215-25.

14. De Souza C, Newbury R, Cosgun A, et al. Decentralized multi-agent pursuit using deep reinforcement learning. IEEE Robot Autom Lett 2021;6:4552-59.

15. Zhang R, Zong Q, Zhang X, Dou L, Tian B. Game of drones: multi-uav pursuit-evasion game with online motion planning by deep reinforcement learning. IEEE Trans Neural Netw Learn Syst 2022; doi: 10.1109/TNNLS.2022.3146976.

16. Yang Y, Li X, Yuan Z, Wang Q, Xu C, et al. Graded-Q reinforcement learning with information-enhanced state encoder for hierarchical collaborative multi-vehicle pursuit. In: 2022 18th International Conference on Mobility, Sensing and Networking (MSN); 2022. pp. 534–41.

17. Zheng Z, Duan H. UAV maneuver decision-making via deep reinforcement learning for short-range air combat. Intell Robot 2023;3:76-94.

18. Durugkar I, Gemp I, Mahadevan S. Generative multi-adversarial networks. In: International Conference on Learning Representations (ICLR); 2017. Available from: https://openreview.net/forum?id=Byk-VI9eg.

19. Wang Z, Zhu H, He M, et al. Gan and multi-agent drl based decentralized traffic light signal control. IEEE Trans Veh Technol 2021;71:1333-48.

20. Zhan H, Tao F, Cao Y. Human-guided robot behavior learning: a gan-assisted preference-based reinforcement learning approach. IEEE Robot Autom Lett 2021;6:3545-52.

21. Li L, Yao J, Wenliang L, et al. Grin: Generative relation and intention network for multi-agent trajectory prediction. Adv Neural Inf Process Syst 2021;34:27107-18.

22. Xia Y, Zhou J, Shi Z, Lu C, Huang H. Generative adversarial regularized mutual information policy gradient framework for automatic diagnosis. In: Proceedings of the AAAI conference on artificial intelligence. vol. 34; 2020. pp. 1062–69.

23. Zheng C, Yang S, Parra-Ullauri JM, Garcia-Dominguez A, Bencomo N. Reward-reinforced generative adversarial networks for multi-agent systems. IEEE Trans Emerg Top Comput Intell 2021;6:479-88.

24. Liu Y, Wang W, Hu Y, et al. Multi-agent game abstraction via graph attention neural network. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 34; 2020. pp. 7211–18.

25. Du W, Ding S, Zhang C, Shi Z. Multiagent Reinforcement Learning With Heterogeneous Graph Attention Network. IEEE Trans Neural Netw Learn Syst 2022;PP:1-10.

26. Liu Q, Li Z, Li X, Wu J, Yuan S. Graph convolution-based deep reinforcement learning for multi-agent decision-making in interactive traffic scenarios. In: 2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC). IEEE; 2022. pp. 4074–81.

27. Xiaoqiang M, Fan Y, Xueyuan L, et al. Graph Convolution Reinforcement Learning for Decision-Making in Highway Overtaking Scenario. In: 2022 IEEE 17th Conference on Industrial Electronics and Applications (ICIEA). IEEE; 2022. pp. 417–22.

28. Chen Y, Shu T, Zhou X, et al. Graph attention network with spatial-temporal clustering for traffic flow forecasting in intelligent transportation system. IEEE Trans Intell Transport Syst 2022; doi: 10.1109/TITS.2022.3208952.

29. Lopez PA, Behrisch M, Bieker-Walz L, et al. Microscopic Traffic Simulation using SUMO. In: The 21st IEEE International Conference on Intelligent Transportation Systems. IEEE; 2018.

30. Rashid T, Samvelyan M, De Witt CS, et al. Monotonic value function factorisation for deep multi-agent reinforcement learning. J Mach Learn Res 2020;21:7234-84.

Cite This Article

Research Article

Open Access

A distributed multi-vehicle pursuit scheme: generative multi-adversarial reinforcement learning

Xinhang Li, ... Lin Zhang

How to Cite

Download Citation

If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click on download.

Export Citation File:

RIS BibTeX EndNote

Type of Import

Direct Import Indirect Import

Tips on Downloading Citation

This feature enables you to download the bibliographic information (also called citation data, header data, or metadata) for the articles on our site.

Citation Manager File Format

Use the radio buttons to choose how to format the bibliographic data you're harvesting. Several citation manager formats are available, including EndNote and BibTex.

Type of Import

If you have citation management software installed on your computer your Web browser should be able to import metadata directly into your reference database.

Direct Import: When the Direct Import option is selected (the default state), a dialogue box will give you the option to Save or Open the downloaded citation data. Choosing Open will either launch your citation manager or give you a choice of applications with which to use the metadata. The Save option saves the file locally for later use.

Indirect Import: When the Indirect Import option is selected, the metadata is displayed and may be copied and pasted as needed.

About This Article

Copyright

© The Author(s) 2023. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, sharing, adaptation, distribution and reproduction in any medium or format, for any purpose, even commercially, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Data & Comments

Data

Views

972

Downloads

476

Citations

3

Comments

0

1

Comments

Comments must be written in English. Spam, offensive content, impersonation, and private information will not be permitted. If any comment is reported and identified as inappropriate content by OAE staff, the comment will be removed without notice. If you have any queries or need any help, please contact us at [email protected].

⁰

Author's Talk

Download PDF

Download XML 0 downloads

Cite This Article 25 clicks

Export Citation 3 clicks

Like This Article 1 likes

Share This Article

https://www.oaepublish.com/articles/ir.2023.25?to=comment

Scan the QR code for reading!

See Updates

Contents

Figures

A distributed multi-vehicle pursuit scheme: generative multi-adversarial reinforcement learning

Abstract

Keywords

1. INTRODUCTION

2. MULTI-VEHICLE PURSUIT IN DYNAMIC URBAN TRAFFIC

2.1. Complex urban traffic environment for MVP

2.2. MDP-based MVP problem formulation

3. GENERATIVE MULTI-ADVERSARIAL REINFORCEMENT LEARNING

3.1. Generative multi-adversarial network for dense reward

3.2. GMAN boosting reinforcement learning

4. DISTRIBUTED GMARL WITH GNN-BASED COGNITION FOR MVP

4.1. GNN-based intersecting cognition

4.2. Distributed GMARL for MVP

4.3. Decision-making and training process of DGMARL-MVP

5. EXPERIMENTS AND RESULTS

5.1. Simulator and parameter settings

5.2. Ablation experiments

5.3. Comparison with other methods

5.4. Convergence comparison during training

6. CONCLUSIONS

DECLARATIONS

Acknowledgments

Authors' contributions

Availability of data and materials

Financial support and sponsorship

Conflicts of interest

Ethical approval and consent to participate

Consent for publication

Copyright

REFERENCES

Cite This Article

How to Cite

Download Citation

Export Citation File:

Type of Import

Tips on Downloading Citation

Citation Manager File Format

Type of Import

About This Article

Copyright

Data & Comments

Data

Comments

Share This Article

See Updates

Committee on Publication Ethics

Portico

Committee on Publication Ethics

Portico