Energy efficient opportunistic routing for wireless multihop networks: a deep reinforcement learning approach*

2022-05-23 14:29JINXiaohanYANYanZHANGBaoxian
中国科学院大学学报 2022年3期

JIN Xiaohan, YAN Yan, ZHANG Baoxian

(Research Center of Ubiquitous Sensor Networks, University of Chinese Academy of Sciences, Beijing 100049, China) (Received 20 April 2020; Revised 12 May 2020)

Abstract Opportunistic routing has been an efficient approach for improving the performance of wireless multihop networks due to its salient features to take advantage of the broadcast and lossy nature of wireless channels. In this paper, we propose a deep reinforcement learning based energy efficient opportunistic routing algorithm for wireless multihop networks, which enables a learning agent to train and learn optimized routing policy to reduce the transmission time while balancing the energy consumption to extend the life of the network in an opportunistic way. Furthermore, the proposed algorithm can significantly alleviate the cold start problem and achieve better initial performance. Simulation results demonstrate that the proposed algorithm yield better performance as compared with existing algorithms.

Keywords deep reinforcement learning; wireless multihop networks; opportunistic routing

Opportunistic routing (OR)[1-3]has been an efficient routing strategy to improve the performance of a wireless multihop network. It takes advantage of the broadcast nature of the wireless medium and makes routing decisions by choosing next relay node among a group of the forwarder list in an online manner. Energy consumption is also a crucial issue in the study of a wireless multihop network to achieve balanced energy consumption and thus prolonged network lifetime. However, most existing opportunistic routing algorithms in the literature fail to consider the optimal trade-off between the routing performance and energy consumption in dynamic wireless environment[4]. It is thus highly desirable to design efficient opportunistic routing algorithm which can learn energy efficient routing policy in-telligently as needed.

Reinforcement learning (RL)[5]is a classic machine learning method for solving sequence decision problems and has been widely used in image processing, natural language processing and also wireless environments for improved network performance. The basic framework of RL is shown in Fig. 1. Agent with decision-making capabilities constantly interacts with the environment to imple-ment the learning process. However, traditional RL methods are not scalable in many complex real-world applications. Such scalability issue could be addressed by combining the deep neural networks and RL into deep reinforcement learning (DRL) methods. With the generalization ability and feature extraction ability of deep neural networks, DRL has achieved breakthroughs in many fields, e.g., AlphaGo in the field of man-machine rivalry[6], automatic navigation[7]and machine translation[8], etc.

Fig.1 Reinforcement learning framework

Recently, RL has been a promising technique for improving the performance of wireless networks[9]. However, existing research on the routing and DRL in the field of wireless multihop networks are mostly conducted independently, which largely limits DRL’s potential to improve the performance of wireless network. In-depth analysis are needed to exploit the potentials of DRL in wireless networks. For this purpose, in this paper, we propose an opportunistic deep Q-network (ODQN) routing algorithm for wireless multihop networks, which is designed to optimize the routing and energy consumption performance intelligently by joint design and optimization of the opportunistic routing and DRL technology. The major work and contributions of this paper are given as follows:

1) We formulate the opportunistic routing problem in wireless multihop networks as a Markov decision process and define the corresponding state space, action space, and reward function.

2) We propose a new energy efficient opportunistic routing algorithm using DRL to solve packet routing problem in wireless multihop networks, which can balance the routing perfor-mance and energy consumption effectively. The proposed algorithm can effectively alleviate the cold start problem in the traditional DQN based algorithm and improve the performance at the early stage of the learning.

3) Extensive simulations are conducted and the results show that the proposed algorithm demon-strates appreciable gain as compared with existing work.

1 Related work

Traditional routing protocols developed for wireless multihop networks can be divided into the following four types: 1) routing-table-based proactive protocols like DSDV[10]and OLSR[11], 2) on-demand protocols like AODV[12], DSR[13], 3) hybrid routing protocols like ZRP[14], and 4) opportunistic routing protocols like ExOR[1], MORE[2], GeRaF[3].

Recently, some RL based routing algorithms have been proposed. Q-routing, proposed by Boyan and Littman[15], uses the Q-learning[16]algorithm to achieve adaptive routing. The simulation results indicated that Q-learning is effective under time varying network load level, traffic patterns and topologies. However, Q-routing, as well as its variants, e.g., predictive Q-routing[17]and dual Q-routing[18], has obvious disadvantages in wireless networks-They need to maintain a large Q-table and are not scalable to large state space. In Ref. [19], a reinforcement learning-based opportunistic routing (RLOR) algorithm has been proposed for providing live video streaming services over multi-hop wireless networks, which could effectively achieve a proper tradeoff between the transmission reliability and latency. By designing a distributed energy-efficient RL based routing algorithm for the wireless mesh IoT networks, Ref. [20] achieves considerable improvements in power efficiency, failure rate, and spectrum use efficiency.

Compared with the above traditional “shallow” RL based algorithms, DRL based algorithms, which takes advantage of deep neural networks[21]to improve the performance of RL[22], has been investigated in the area of wireless networks. Authors of Ref. [23] presented a distributed routing approach based on multi-agent RL which could simultaneously optimize the travel time of routed entities and energy consumption. Valadarsky et al.[24]applied DRL for network routing and showed that learning from the historical routing schemes with the consideration of the demand matrix and link utilization provides an efficient approach to smartly choose the optimal routing structure for future data forwarding. A DRL method proposed by Stampa et al.[25]optimizes the routing performance in SDN with the aim of reducing transmission delay. The experimental results show that there is significant improvement on transmission delay over various traffic intensities. Although existing work employ DRL to facilitate network routing in various environments, none of them consider both routing performance and network lifetime simultaneously under lossy wireless network environment.

2 Network model and problem formualtion

In this section, we formulate the energy efficient wireless opportunistic routing problem as a Markov decision process (MDP).

A.System model

We model a wireless multihop network withNnodes as a graphG=(V,E), where each vertexv⊆Vrepresents a node, and every edgee⊆Erepresent a link between a pair of nodes. Each node could be served as the source node or destination node. When a packetpfrom source nodess⊆Vbeing sent towards its destinationd⊆Varrives at a nodej, it will be processed as follows:

· Ifj=d, then the packet has arrived its destination and this packet’s process terminates successfully.

· Otherwise, the packet has been forwarded to an intermediate next hop, as chosen according to the learnt routing policy, which will be illustrated in Section IV.

When a node has packets to send or in its queue, it is treated as in the working state and each transmission will consume certain energy. Otherwise, the node has no energy consumption. When the energy of a node is lower than a given threshold, the node is treated as inactive and unreachable.

B.Problem formulation

We model the energy efficient opportunistic routing problem in wireless multihop networks under the centralized RL formulation, where a central agent is assumed to be responsible for learning the routing policy and chooses the action for each node according to the global state. We formulate the optimization of energy efficient opportunistic routing as an MDP defined by a tuple (S,A,P,R), whereSis a finite set of states,Ais a finite set of actions,Pis a transition probability from statestto statest+1after the agent taking actionatat time slott, andRis the immediate reward obtained after actionatis performed indicating how good the current action is in the immediate sense.

Next, to formulate the above problem as an MDP, we give the details of the related observation space, action space, and reward function as follows.

1) State space

2) Action space

3) Reward

R: The rewardrt(st,at) at time slottis the immediately reward that agent can get when performing joint actionatwith the state transiting fromsttost+1. It is defined as a function of the average packet transmission time and the residual energy as follows

(1)

(2)

4) Problem definition

By defining the state and action space and the reward, we can then formally formulate energy efficient opportunistic routing problem as the MDP. The objective of DRL algorithm is to find the deterministic optimal policyπ*to maximizes the total accumulated rewardrt(st,at). Thus, based on the formulations above, we define the problem of DRL based energy efficient opportunistic routing as maximization of the cumulative future reward:

Gt=rt+γrt+1+γ2rt+2+….

(3)

Our objective is to learn an optimal policyπ*to route each packet correctly while maximizing the network lifetime, which has been proved to be NP-hard[29]. This makes it necessary to explore alternative solutions with the aim of maximizing the performance. In this paper, we explore the potential of the DRL to find the optimal energy efficient opportunistic routing policy. In the next section, we propose the ODQN-Routing algorithm for solving this problem.

3 ODQN-Routing

In this section, we propose the ODQN-Routing algorithm, which can adapt to the dynamic changes of wireless medium and network topology to optimize the network routing and energy use performance by adjusting the routing policy intelligently.

A. Algorithm overview

Our proposed ODQN-Routing is based on the deep Q-network (DQN)[30], which transforms the Q-table update process into a function fitting problem in high-dimensional state-action space.

We assume a centralized ODQN agent which is responsible for interacting with the environment and learning the policy. Moreover, we assume that there exists a separate control channel, which could be used to collect and distribute control information directly between the centralized agent and each node in the network, an assumption made in a lot of existing DRL algorithms in this area. The framework of the ODQN agent is shown in Fig.2. The agent observes the network information and gets the current network statest, selects an joint actionatfor all nodes according to the learnt policy with theε-greedy principle, which randomly selects neighbor nodes as the next hop with a probability ofε, and selects the next hop node with the largestQvalue with a probability of 1-ε. Then the agent gets the rewardrt(st,at), and enters the next statest+1.The experience tuple (st,at,rt,st+1) is stored in the replay buffer. In the training phase, a small batch of state transition pair samples are randomly extracted from the replay buffer to train the ODQN agent.

B.ODQN agent training

The neural network architecture of our ODQN-Routing agent is shown in Fig.3, which has one input layer, two fully connected hidden layers and one output layer. Letθdenote the training neural network parameters, andθ-denote the target neural network parameters. ODQN agent tries to minimize the loss function defined as the difference between the target value and the estimated value.

LDQN(θ)=E[(y-Q(s,a;θ))2].

(4)

whereyis the target value denoted as

(5)

The gradient descent method is used to update the parameters with learning rateα.

(6)

(7)

The parameters of the training neural network are copied to the target neural network everyCsteps, and then the parameters of the target neural network are updated.

Fig.2 ODQN agent framework

Fig.3 Neural network structure in ODQN-Routing

C.Cold start alleviation

There usually will be a cold start problem[31]for DRL based algorithm in real-world application as DRL requires huge amount of data before reaching acceptable performance. Thus, its performance could be compromised as the performance of RL during the learning process could be extremely poor due to its trial-and-error nature.

Therefore, to alleviate the cold start problem, we adopt the pre-training mechanism as in Ref. [31] in our proposed ODQN-Routing algorithm to speed up the agent’s learning speed at the beginning phase. We use demonstration data as the starting policy to pre-train the agent so that it could perform well at the start of learning, and then let the agent interact with the environment to continue improving the routing performance by using its own collected data. In this paper, the demonstration data is collected by the AODV protocol to pre-train the agent to accelerate the learning process.

D.Proposed ODQN-Routing algorithm

The detailed ODQN-Routing algorithm is given in Algorithm 1, which can be divided into two phases: pre-training and policy learning.

Initialization: In line 1, we collect the demonstration data of AODV and store such data in the demonstration data bufferdemo, which will not be modified during the execution, initialize replay bufferreplaywith capacity, demonstration ratioηand number of pre-training stepsk.We initialize training neural network with random weightsθand target neural network with weightsθ-=θ.

Pre-trainingphase: From line 2 to line 8, before starting interaction with the environment, we performksteps pre-training phase by sampling the mini-batch tuples from demonstration data bufferdemo.We train the agent forksteps with a loss functionLpreas

Lpre=LDQN+γ1Ls+γ2L2.

(8)

whereγ1,γ2are parameters that regulate the weight betweenLsandL2.Lpreis defined as the combination of the DQN lossLDQNin (3), the supervised large margin classification lossLs, and theL2regulari-zation loss on the neural network’s weights and biases to prevent overfitting on the demonstration data. The supervised lossLsis defined as

(9)

(10)

whereaeis the action from the demonstration data.

Then the loss functionLpreis minimized using the gradient descent optimizer. The target neural network’s parametersθ-is updated byθeveryCsteps. The purpose of pre-training phase is to make the agent learn the starting policy from the demon-stration data to gain better initial performance.

Policylearningphase: Onceksteps pre-training is completed, the agent enters the policy learning phase to interact with the environment to learn its own policy as in lines 9-23. From line 11 to line 14, the agent collects the information and get the statest, chooses actionat, receives rewardrt, and enters next statest+1.The self-generated data tuple (st,at,rt,st+1) will be added into replay bufferreplay, where the oldest experience will be over-written when the buffer is full. In line 15, we use both demonstration data and self-generated data as samples to train the neural networks and useto control the proportion of demonstration data, which will be decreased with the training time as indicated from line 19 to line 22, whereτis a small step value used to fine-tuneη,ddemoanddreplayare the tuples sampled from demonstration data bufferdemoand replay bufferreplay.We calculate the loss function using target neural network in line 16, which uses loss functionLprefor demonstration data andLDQNfor self-generated data. Then from line 17 to line 18, we perform a gradient descent step to updateθandθ-.

Algorithm1.ODQN-Routing algorithm

--------------------

1:Initialization:

Initialize the training neural network’s parametersθand target neural network’s parametersθ-, replay bufferreplayand demonstration data bufferdemo, demonstration ratioηand number of pre-training stepsk.

2:Pre-training:

3:fort= 1, 2, …,kdo

4: Samplentuples in demonstration data bufferdemoto pre-train the agent.

5: Calculate loss functionLpreusing target neural network.

7: Update the target neural network’s parameterθ-everyCsteps:θ-←θ.

8:endfor

9:Policylearning:

10:fort= 1, 2,…,do

11: The agent gets the statest.

12: Selectε-greedy actionat.

13: Get the rewardrtand enter the next statest+1.

14: Store (st,at,rt,st+1) into replay bufferreplay.

15: Samplentuples fromdemoandreplaywith a fractionηof the samples fromdemoto train the agent.

16: Calculate loss functionLpreusing target neural network: demonstration data useLpre, self-generated data useLDQN.

17: Use gradient descent to update the parameters of the training neural network.

18: Update target neural network’s parameters everyCsteps:θ-←θ.

19:ifη>0 then

20:η←η-τ

21:else

22:η←0

23:Endfor

--------------------

4 Simulation results

In this section, we conduct extensive simulations to evaluate the performance of ODQN-Routing algorithm by comparing it with existing work.

A.Simulation environment

We used Python 3.6 to build the event-driven simulator and used TensorFlow as the deep learning framework to implement DRL based routing. We built the wireless multihop network environment on the basis of Networkx[32]. In the simulations, we implemented the following algorithms for comparison purpose: (a) The proposed ODQN-Routing algorithm, (b) AODV algorithm, (c) Q-Routing algorithm, and (d) ODQN-Routing algorithm without pre-training, referred to as DQN-Routing.

In the simulations, we implemented the simulation experiments with 10, 20, 30, 40, 50 nodes in the network, respectively, and each node has a random number of neighbor nodes. Each simulation lasts for 2 500 s. Each reported result was averaged over 50 tests, each of which was due to a different random network topology. The bandwidth of each link in the wireless network under simulations was randomly set among 1, 2, 5.5, and 11 Mbit/s. The packet generation rate is subject to a Poisson distribution with the mean value of 10 packets per second and the packet size is set to 1 000 bytes with randomly generated source node and destination node. The queue length of each node is set to 500 packets. We used 1 000 packets based on AODV to pre-train the neural networks for 500 steps. The parameter settings in our simulations are shown in Table 1, where the settings related to deep Q-learning were also used in Refs. [31,33].

In the simulations, the initial energy of each node was set to 1 Joule. The transmit energy for packet transmission is 50 nJ/bit and that for processing queued packets is 40 nJ/bit. Moreover, when the residual energy of a network node is less than half of the mean residual energy of nodes in the network, it is treated as an inactive node and will be temporally removed from the network topology for any routing operations.

Table 1 Parameter settings

In the simulations, we compared the performance of different algorithms in terms of following measures:

Average packet transmission time

(11)

We definetpas the total transmission time from the source to destination node of packetp∈P,Pis the set of packets in the network and |P| is the total number of packets.

Network lifetime

(12)

We use the proportion of active nodes remaining in the network to characterize the network lifetime, whereNactiveis the number of active nodes in the network when the simulation terminates and |V| represents the network size.

Average path length

(13)

whereLpis the path length of packetp∈P.

Packet loss rate

(14)

whereDnis the number of packets that fail to reach their destinations.

B.Simulation results

Averagetransmissiontime: Figure 4 shows the variation of average packet transmission time of different algorithms. It can be seen that at the beginning of the simulation, the average packet transmission time of Q-Routing and DQN-Routing are much larger than the other two algorithms. That is because the agent is in the process of policy learning and explores routing path through continuous trials and errors. Thus the reward obtained is unstable. In contrast, as the ODQN-Routing allows pre-training the agent using demonstration data from AODV, the routing policy can be optimized much more quickly. It can also be seen that during the training process, the transmission time of ODQN-Routing and DQN-Routing can both cover to a lower level than the traditional reinforcement learning routing algorithm Q-Routing. In general, the transmission time by ODQN-Routing is 50.1% shorter than that by Q-Routing, and 30.2% shorter than that by DQN-Routing. The results can verify that ODQN-Routing can significantly improve the average transmission time performance.

Fig.4 Comparison of average transmission time

Fig.5 Comparison of average network lifetime

Networklifetime: Figure 5 compares the network lifetime performance of different algorithm. In Fig. 5, at the beginning of simulations, the percentage of active nodes by ODQN-Routing, Q-Routing, and DQN-Routing are quite similar as all nodes have high level residual energy at this moment. As the simulations elove, ODQN-Routing and DQN-Routing allow the ODQN agent to fully train and learn optimalized routing policy through DRL technology to adaptively select the nexthops to avoid rapid depletion of nodes in the network. Specifically, ODQN-Routing prolongs the network lifetime by 57.1% as compared with Q-Routing and 10.7% as compared with DQN-Routing algorithm. In the simulations, the value ofw2was set to 0.93 to effectively balance the energy consumption while reducing packet transmission time.

Averagepathlength: Figure 6 (a) compares the average path length performance of different algorithms. The results show that the average path length by ODQN-Routing is shorter than that by Q-Routing and DQN-Routing. As compared with Q-Routing, ODQN-Routing could effectively reduce the average path length by up to 40.1%, indicating that ODQN-Routing could learn more effective routing poliy by considering the ETX values to destinations when making packet forwarding decisions, which could effectively avoid the detouring of the data packets to longer paths. The average path length by ODQN-Routing is 13.8% shorter than that by DQN-Routing, indicating the use of demostration data for the agent pre-training can significantly improve the learning efficiency, which enables the agent to learn optimalized routing policy faster.

Fig.6 Comparison of average path length and average packet loss rate

Packetlossrate: Figure 6 (b) compares the packet loss rate performance of different algorithms. In Fig. 7, it is seen that ODQN-Routing outperforms other algorithms in terms of packet loss rate. Moreover, ODQN-Routing can reduce the packet loss rate by 30.1% and 24.8% as compared with AODV and Q-Routing, respectively.

5 Conclusion

In this paper, we investigated how to improve the energy efficiency by applying DRL based routing in wireless multihop network. Regarding this, a formulation of energy efficient opportunistic routing was built as an MDP problem. To address the problem, we first defined the corresponding state space, action space, and reward function, and then proposed the ODQN-Routing algorithm. Extensive simulations were conducted and the results demonstrate that the proposed algorithm yields better performance as compared with existing work. Our work demonstrates that DRL based routing has significant advantages in improving the routing performance of wireless networks. In the future, we shall consider how to combine inverse reinforcement learning with wireless routing to strengthen the routing intelligence by learning more intrinsic reward functions. We also plan to design a full decentralized multi-agent reinforcement learning based algorithm for enabling efficient distributed routing in wireless multihop networks.