Blockchain and MEC-Assisted Reliable Billing Data Transmission over Electric Vehicular Network:An Actor–Critic RL Approach

2021-08-21 09:37XinyuYeMengLiPengboSiRuizheYangEnchangSunYanhuaZhang
China Communications 2021年8期

Xinyu Ye,Meng Li,2,*,Pengbo Si,2,Ruizhe Yang,2,Enchang Sun,Yanhua Zhang,2

1 Faculty of Information Technology,Beijing University of Technology,Beijing 100124,China

2 Beijing Laboratory of Advanced Information Networks,Beijing 100124,China

Abstract:Recently,electric vehicles(EVs)have been widely used under the call of green travel and environmental protection,and diverse requirements for charging are also increasing gradually.In order to ensure the authenticity and privacy of charging information interaction,blockchain technology is proposed and applied in charging station billing systems.However,there are some issues in blockchain itself,including lower computing efficiency of the nodes and higher energy consumption in the consensus process.To handle the above issues,in this paper,combining blockchain and mobile edge computing(MEC),we develop a reliable billing data transmission scheme to improve the computing capacity of nodes and reduce the energy consumption of the consensus process.By jointly optimizing the primary and replica nodes offloading decisions,block size and block interval,the transaction throughput of the blockchain system is maximized,as well as the latency and energy consumption of the system are minimized.Moreover,we formulate the joint optimization problem as a Markov decision process(MDP).To tackle the dynamic and continuity of the system state,the reinforcement learning(RL)is introduced to solve the MDP problem.Finally,simulation results demonstrate that the performance improvement of the proposed scheme through comparison with other existing schemes.

Keywords:electric vehicles; billing data interaction;blockchain; mobile edge computing; reinforcement learning

I.INTRODUCTION

In recent years,with serious environmental pollution,fossil energy depletion and other issues have become increasingly prominent,green travel has become one of the energy conservation and emission reduction methods advocated by all over the world[1].As a green means of transportation,electric vehicles(EVs)have obvious advantages over conventional fuel vehicles,including less air pollution and more environmental protection,which provide strong support for promoting the green and intelligent development of the transportation industry[2].At present,a large number of EV charging infrastructures are constructed and served for EVs to facilitate the timely charging and electricity trading of EVs[3,4].

In the data transmission process of charging billing,it is crucial to ensure the accuracy and security of the charging information.However,when EVs and the charging stations use different smart meters to measure electricity demand,different results may be produced,resulting in invalid or false billing.In addition,the charging information is easy to be revealed or manipulated artificially,and the attackers will track users’billing data and obtain information such as vehicle data and location privacy.Therefore,the security and privacy of the charging transaction information cannot be ignored.

To address privacy and security issues in charging transactions,blockchain is considered as a promising technology,which has the features of decentralization,trust-free and tamper-proof,and promotes the interconnection of information[5,6].It can be widely used in electricity market transactions and other scenarios of energy transaction interconnection.For example,in[7],the energy auction system based on smart contract is designed for users to choose electricity trading safely and freely.In[8],the consortium blockchain is used to establish multi-agent nodes to manage the selection and transactions of users’ charging mode.In addition,the authors in[9]propose an optimal charging scheme for the electric taxis by summarizing the information of each charging station operation node.Although lots of the existing works have been done in blockchain applications,the problem of computing resource consumption in the blockchain consensus process should be focused and solved.

Fortunately,edge computing can be used to solve the problem of insufficient computing resources of blockchain nodes.By providing computing offloading services for nodes,it can improve computing efficiency and reduce energy consumption significantly.Among them,mobile edge computing(MEC)is one of the most promising paradigms,which has been used in many outstanding research works[10–12],including resource allocation,offloading strategy and collaborative cooperation.Moreover,due to the same decentralization characteristics and interdependent functions of both blockchain and MEC,they make their combination become natural[13].

Nevertheless,the joint application of blockchain and MEC in the charging station billing systems is still facing great challenges.For example,there is a problem of how to balance the latency and energy consumption of the systems by considering the different offloading decisions of the primary and replica nodes.Another problem is how to improve the performance of transaction throughput while keeping the security of the blockchain system.In addition,since the introduction of MEC increases the total system latency,and the coverage of the charging stations is limited,how to select a suitable charging station to transmit charging billing results to the EVs is also a problem that needs to be considered.Meanwhile,considering the dynamic characteristics of the environment,the optimization method of the systems is still to be determined.Therefore,we should consider these problems carefully when designing the systems.

To address the above challenges,in this paper,we propose a performance optimization framework for blockchain and MEC-assisted billing data transmission in electric vehicular networks to improve the security of charging information data and the system efficiency.In addition,in order to deal with the dynamic and continuous characteristics of the networks,the reinforcement learning(RL)approach is adopted to train and obtain the optimal strategy.The main contributions of this paper are listed as follows.

•We propose a reliable billing data transmission scheme for charging stations in electric vehicular networks based on combing of blockchain and MEC.By jointly considering the offloading decisions of the primary and replica nodes,block size and block interval to minimize consumption costs(including latency and energy consumption)and maximize the transaction throughput of the blockchain system to ensure data security and computing efficiency.Considering the computing task execution,blockchain consensus process,and the dynamic characteristics of the networks,we formulate the proposed problem as a Markov decision process(MDP)by defining state space,action space and reward function.

•To address the dynamic and continuous characteristics of the formulated problem,we introduce and employ a RL approach,namely actor–critic RL algorithm.Through the training,the optimal selection and strategy can be selected and determined.

•Simulation results show that the proposed scheme is effective,and the network performance has been improved compared with other existing schemes under different parameters.

The rest of this paper is organized as follows.In Section II,we present the related works of blockchain and MEC applied to the Internet of Vehicles.Section III introduces the system model.The joint optimization problem of the primary and replica nodes offloading decisions,block size and block interval is presented in Section IV.In Section V,we use an actor–critic RL algorithm to solve the problem.Section VI shows and discusses the simulation results.Finally,we conclude this paper and give future works in Section VII.

II.RELATED WORKS

In this section,we briefly review the frameworks related to the Internet of Vehicles(IoV),and then describe the related researches of blockchain technology and MEC applied to the IoV.Meanwhile,we discuss the existing problems and challenges.

2.1 Internet of Vehicles

IoV is an intelligent vehicle interconnection network that implements dynamic information sharing and exchange between vehicles,road side units(RSUs),pedestrians,and communication networks in an intelligent way in accordance with established communication protocols and data interaction standards[14].Driven by the needs of users and the development of the IoV,more intelligent vehicles are put into use.These intelligent vehicles are always equipped with radio frequency identification technology,intelligent sensor technology,communication technology,computing units,global positioning systems(GPSs)and human-computer interaction equipment[15].Thanks to these promising technologies and equipment,the data information of vehicles can be acquired and analyzed to obtain the optimal route of the vehicles.According to the timely reporting of road conditions and the arrangement of signal light cycles,the intelligent monitoring,dispatching and management of people,vehicles and roads can be realized.Therefore,in response to the increasingly prominent traffic contradictions,the application range of IoV is very extensive[16].

Wanget al.[17]proposed the use of machine learning methods to achieve accurate vehicle trajectory clustering to extract relevant information through clustering results,such as obtaining the optimal path,monitoring traffic and predicting the next location.In[18],the authors formulated a multicamera system,which was used to manage the vacant parking spaces by mapping the detected vehicle positions to the parking spaces in their corresponding parking lots,so as to conduct parking guidance.Besides,Shakerighadiet al.[19]presented a new tri-level game theory method to maximize the financial profits of the EVs and the charging stations.By solving the optimization problem,the electricity prices were determined by each charging station,and the EVs chose the optimal charging station according to the location and prices.However,these researches ignore the trust and privacy issues that may arise in the process of information sharing when the communication links between vehicles or between vehicles and RSUs are unstable or insecure.

2.2 Blockchain for IoV

As an emerging distributed ledger technology,blockchain technology can be used to solve the problems of trusted interaction and privacy protection in the IoV.Therefore,the integration of blockchain technology and the IoV can greatly improve the privacy security,big data storage,trusted sharing and efficient management of the IoV[20].Luet al.[21]presented a blockchain-based anonymous reputation system for vehicle networks.The vehicles rated the credibility of the received information according to the road conditions,and calculated the reputation values of the message senders through the ratings to estimate the availability of the messages,so as to ensure that the vehicles could collect safe and reliable traffic information.Liuet al.[22]proposed a blockchain-enabled EV participation charging scheme,and used the iceberg order execution algorithm to maximize the benefits of EV users while minimizing the level of power fluctuations in the grid system.In[23],the authors designed a blockchain-based IoV framework using deep neural network distance evaluation algorithm to improve vehicle GPS positioning accuracy and ensure the robustness and security of the system.In addition,Xiaet al.[24]considered the Bayesian game pricing method to improve the blockchain-based vehicles electricity trading scheme,and obtained the optimal electricity price to maximize the utility of both parties in the electricity transaction.Qianet al.[25]formulated a blockchain-enabled content caching scheme for Cognitive Internet of Vehicles(CIoV).The cognitive engine in CIoV used machine learning methods to perceive the content requirements of vehicles and recommend useful content to content providers for caching,so as to improve the cache hit rate and content acquisition efficiency.Nevertheless,these research works ignore the limited computing resources of the nodes used for the consensus processes in blockchain systems.

2.3 Blockchain with MEC for IoV

The integration of blockchain and MEC into the IoV can achieve reliable access and control of the IoV,and perform data storage and computation at the edge,thereby providing secure data processing and improving the consumption experience of end users[13].Vangalaet al.[26]designed an authentication scheme for vehicle accident detection and notification based on blockchain and MEC to ensure that the accident-related transactions collected by vehicles could be securely transmitted to MEC for transactions analysis,and valuable transactions could be sent to the blockchain system for consensus.Kanget al.[27]proposed a reputation-based data sharing scheme that integrated blockchain and MEC.Vehicles selected the most reputable data provider according to accurate reputation management to realize data sharing.In[28],a security architecture of Vehicular Ad-hoc NETwork based on blockchain and MEC was proposed to ensure the security of data transmission through blockchain technology,and to handle computation intensive tasks including transaction consensus and image processing through MEC.In addition,Zhouet al.[29]combined blockchain,contract theory and MEC into the vehicle-to-grid energy trading framework to incentive EVs to participate in secure energy trading and reduce the computational burden of block creation through computing offloading.In[30],the authors designed a blockchain-based vehicle edge computing architecture to support Artificial Intelligence(AI)within IoV,securely and effectively solve challenging vehicle problems and manage the IoV infrastructure.Daiet al.[31]integrated deep reinforcement learning and the permissioned blockchain into the vehicular edge computing networks to implement secure and intelligent vehicle content caching.However,the above research works lack comprehensive consideration of the system energy consumption,the system latency and the transaction throughput of the blockchain system in the IoV charging station scenario.

As discussed above,in this paper,we propose a reliable charging billing data transmission scheme integrating blockchain and MEC,which mainly considers the energy consumption and the latency of the system,as well as the transaction throughput of the blockchain system,and improves the system performance through resource allocation strategy.

III.SYSTEM MODEL

In this section,we present the network model at first,and then depict the blockchain model and the transmission model in detail,respectively.

3.1 Network Model

Figure 1 shows the architecture of the vehicular networks with blockchain and MEC,which consists of device layer and edge layer.In the device layer,we consider a unidirectional road with charging stations along the way[32],which are indexed byC={1,2,...,c}.Each charging station is equipped with a wireless access point(AP)with limited computing resources[33],and all APs are connected by wireless links.LetC∗=C={1,2,...,c}denote the set of APs.In addition,the energy of each AP can be denoted byV(t)={v1(t),v2(t),...,vc(t)}.At each time slott∈{1,2,...,T}(Tis the time instant that any AP does not have enough energy to work),each charging station transmits the collected transaction data to the blockchain system formed by the APs for data consensus and record.In the edge layer,a single macrocell base station with MEC server is located at the center of the study area[34].Since the weak computing capacity of APs,the MEC server is required to complete the complicated computing tasks generated by the APs consensus process after the charging station request,thereby improving energy efficiency.

3.2 Blockchain Model

When the nodes consume different computing resources due to different offloading decisions,we consider selectingNblockchain nodes with more remaining computing resources as consensus nodes to participate in block generation and verification[34],which can improve the energy efficiency of the system and make the system work longer.The consensus mechanism adopts practical Byzantine fault tolerance(PBFT)consensus mechanism,which can guarantee the correctness of the system while there are less than(N −1)/3 faulty nodes[35].We assume that the CPU cycles of generating or verifying one signature and generating or verifying one message authentication code(MAC)areαandβ[36],respectively.Based on PBFT,the consensus mechanism consists of the following five steps[37],as shown in Figure 2.

Figure 1.System model.

Figure 2.Consensus process of PBFT.

1)Request:Over a period of timet,the charging stations submit offloading requests to the APs,and each AP node broadcasts the transactions sent by the charging station to the whole network.After the transactions are broadcast,the blockchain system randomly assigns a primary node to package the transactions and into a new block(the new block is generated by the primary node,which is carried out within the block intervalTi(t)).Then,the primary node verifies thed(t)/δ(t)signatures and MACs of the transactions[37].Hence,the computing cycles of the primary node can be expressed by

whered(t)is the total size of transaction batches sent in time slott,δ(t)is the average transaction size,andd(t)/δ(t)represents the total number of transactions.

2)Pre-prepare:After the transactions are verified,the primary node generates one signature andN −1 MACs,which are sent to each replica node together with the new block.After receiving the new block,the replica nodes verify the signature and MAC of the block at first.Considering that there are wrong transactions when the charging stations submit offloading requests to the APs,the primary node discards the erroneous transactions after verifying all the transactions through the request phase.We assume that the final remaining correct transactions are thegpart of all transactions[36].In this phase,each replica node needs to verify the signatures and MACs forg · d(t)/δ(t)transactions[37].Hence,the computing cycles of the primary node and the replica nodes are respectively expressed by

and

wheregis the percentage of correct transactions sent by the charging stations.

3)Prepare:After the replica nodes verify the new block,each replica node generates one signature andN −1 MACs,and sends them to all other nodes.The computing cycle of each replica node to generate the signature and MACs is expressed ascr,p(t)=α+(N −1)β.Then,both the primary node and the replica nodes need to verify the 2f(wheref=(N −1)/3)signatures and MACs from the other replica nodes.Hence,the computing cycles of the primary node and the replica nodes are respectively expressed by

and

4)Commit:The verified nodes send one signature andN −1 MACs to all other nodes,if they have received more than 2fcorrect messages.Meanwhile,each node will verify 2fsignatures and MACs from all other nodes.Hence,the computing cycles of both primary node and replica nodes can be represented as

5)Reply:After collecting 2fcommit messages,it indicates that the transactions have reached the consensus of the whole network,and the new block becomes a valid one,which will be attached to the blockchain.Meanwhile,each replica node generatesg·d(t)/δ(t)signatures and MACs to the primary node,and the primary node needs to verify 2fsignatures and MACs.Hence,the computing cycles of the primary node and the replica nodes are respectively expressed by

and

Thus,according to the above consensus steps,the total computing cycles of one consensus process can be expressed as

Then,the total computing cycles of the primary node and the replica nodes in one consensus process can be respectively expressed as

and

In order to execute these heavy and complex computing tasks,when the local computing capacity is insufficient to support the consensus process,the node will offload the computing tasks to the MEC server for calculation,and each node will choose the computing method according to its own computing power needs.

When the primary node selects local processing,the replica node has two options:local processing and offloading to MEC server processing.If the replica node also selects local processing,the computing cycles experienced by their consensus process are the same,which are the total computing cycles of completing one consensus process.Therefore,the latency generated by the primary node and the replica node at time slottcan be expressed by

whereFlis the computation capacity of the APs.

Since the latency and energy consumption are different,the latency is obtained according to the total computing cycles of one consensus process,and the influencing factor of the energy consumption is the different computing cycles generated in the consensus process of all nodes.Therefore,the primary node and the replica nodes obtain different energy consumption according to their different calculation cycles,and their energy consumption at time slottcan be expressed as

and

wherekis set to 10−27according to the actual measurement[38].

When the primary node selects local processing and the replica node selects the MEC server to execute the computing tasks,the latency generated by the primary node and the replica node at time slottcan be expressed by

and

whereR(t)is the transmission rate between the APs and the MEC server at time slott,λ(t)is the computation capacity of the MEC server.

When the primary node selects the MEC server to execute the computing tasks,the replica node also has two options:local processing and offloading to MEC server processing.If the replica node selects local processing,the latency generated by the primary node and the replica node at time slottcan be expressed by

and

When the primary node and the replica node select the MEC server to execute the computing tasks,the latency generated by the primary node and the replica node at time slottcan be expressed by

the energy consumption generated by the primary node and the replica node at time slottcan be expressed by

and

whereptis the transmission power of the APs to the MEC server,andpmis the computing power of the MEC server.

At time slott,each AP selects the processing method of the computing task,respectively.Therefore,similar to[35],the total latency and total energy consumption generated by the blockchain system can be represented as

and

wheretbis the broadcast delay between nodes.

Then,the transaction throughput of the blockchain system[35]can be denoted as

whereS(t)is the size of the block generated at time slott.

3.3 Transmission Model

After the data of charging information has been uploaded,the APs will send the output results to the vehicles to facilitate the EVs to pay and check the charging information.We consider that the EVs moving at high speed along the road may pass through multiple charging stations during the mission[33].

We assume that the distance between every two adjacent charging stations isL,and the average speed of the vehicles isv.Then,after the vehiclenis charged at the charging stationn,the number of charging stations that the vehicle will pass through is

Due to the blockchain system formed by APs,each charging station can share information,then the vehiclenwill receive the results at the section of charging stationm,wherem=n+s+1.Therefore,the latency and transmission energy consumption of APmsending results to the vehiclenat time slottare:

and

whered∗(t)is the output data size at time slott,Ravis the transmission rate between the APmand the vehiclen,andpt∗is the transmission power of the APs to the EVs.

Therefore,the total latency and total energy consumption of the system can be calculated as

and

IV.PROBLEM FORMULATION

In order to reduce the latency and energy consumption of the proposed system,and improve the transaction throughput of the blockchain system at the same time,we need to jointly optimize the offloading decisions of the primary and replica nodes,block size and block interval.In this section,we formulate the joint optimization problem as a discrete MDP by defining the state spaceS,action spaceA,and reward functionr.

4.1 Consumption Cost Minimization and Transaction Throughput Maximization Problem

For the blockchain and MEC-assisted electric vehicular network,we can formulate a consumption cost minimization and transaction throughput maximization problem as follows:

where the expectation operator E is taken over the randomness of the system parameters[39,40](i.e.,the energy of all APsV(t),the transmission rate between the APs and the MEC serverR(t),the computing resources of the MEC serverλ(t),and the average transaction sizeδ(t))and the possibly random actions(i.e.,the primary node offloading decisionap(t),the replica nodes offloading decisionar(t),block sizeS(t),and block intervalTi(t))at each time slot.In the proposed problem,ϖin constraintC1 represents the lowest value of APs energy.C2 denotes the time limit for block completion,whereε >1.C3 represents the limit of the task data size.

Solving P1 is very challenging for the following reasons.First,it is usually difficult to obtain accurate transmission rate between APs and MEC server,which can be affected by many factors such as time-varying channel gain and transmission power.Second,the amount of parameters such as the energy of APs and the computing resources of the MEC server varies over time,and it is intractable to know the statistical distributions of the combinations of all the random system parameters[40].Third,there are time coupling constraints on stochastic system parameters,which means that future decisions would be affected by current action.When dealing with the characteristics of time coupling and high-continuity of the system,the typical methods based on dynamic programming suffers from “the curse of dimensionality” problem[41].In this paper,we formulate the P1 problem as a discrete MDP problem.Then,we adopt an actor-critic RL algorithm to solve the problem.

4.2 MDP Formulation

In the electric vehicular network,the transmission rate between APs and MEC server in the next time slot is only determined by the time-varying channel gain and transmission power of the current time slot.Furthermore,the system energy surplus,latency consumption,and transaction throughput of the next time slot only depend on the current energy surplus and the current nodes offloading decisions,block size,and block interval,which are independent of previous states and actions.Therefore,the optimization problem P1 can be formulated as an MDP.

The MDP can be described by a five-tupleM=(S,A,P,r,γ)[42–44],Sis the set of all environmental states.Ais the set of executable actions of the agent.P:S ×A×S →[0,1]is the state transition probability function.r:S ×A →Ris the reward function,andγis a discount factor.In the following parts,we will formulate the environment state spaceS,action spaceA,and reward functionrin MDP.

1)State Space:At time periodt,we define the stateS(t)∈ Sas a union of the energy of all APsV(t)={v1(t),v2(t),...,vc(t)},the transmission rate between the APs and the MEC serverR(t)={r1(t),r2(t),...,rc(t)},the computing resources of the MEC serverλ(t),and the average transaction sizeδ(t),which is denoted as

Since the state space is continuous,the probability of being in a particular state is zero.We define that the probability of transferring from the stateS(t)to the next stateS(t+1)after taking an actionA(t)∈Aas

wherefis the state transition probability density function.

2)Action Space:The action space includes the primary node offloading decisionap(t),the replica nodes offloading decisionar(t),block sizeS(t)and block intervalTi(t).Therefore,the actionA(t)at time slottcan be defined as

whereap(t)={0,1}andar(t)={0,1}are the offloading decisions of the primary and replica nodes.Whenap(t)=0 andar(t)=0,the primary and replica nodes perform the tasks locally,respectively.Otherwise,the tasks are offloaded to the MEC server.Besides,S(t)∈{1,2,...,S}represents the level of the block size.Ti(t)∈{0.2,0.5,...,I}denotes the level of the block interval.

3)Reward Function:

V.PROBLEM SOLUTION

In our proposed optimization problem,the features in the proposed scenario are huge and complex that it needs to make decisions from a global perspective.At the same time,our system has high-dynamic and highcontinuity.According to the optimization target,we can achieve the best policy effectively by using reinforcement learning approach.

Generally,RL algorithms include value-based,policy-based and actor-critic methods[45].Valuebased methods,such as Q-learning[46],and SARSA[47],usually use temporal-difference to estimate the expected rewards of a policy and select the action with the highest value function.However,when the valuebased methods face the continuous action space,the action space must be discretized first.As a result,high dimensional action space will be obtained,which makes the solution very difficult.Meanwhile,the value-based methods aim at deterministic policy and are not suitable for optimal policy problems.Policybased methods,such as natural gradient[48],standard gradient ascent[49],and quasi-Newton[50],can be applied to learn stochastic policies in continuous action spaces,and the convergence rate is relatively fast.However,they tend to converge to the local optimum,which makes it difficult to evaluate a policy and leads to a reduction in efficiency.

Therefore,in this paper,we use policy-based and value-based actor-critic RL algorithm to solve the joint optimization problem to achieve long-term rewards[45,51].The framework of the actor-critic RL algorithm is shown in Figure 3,the RL agent learns the optimum policy and its value function by interacting with the environment.The RL agent is composed of two parts:the actor and the critic.The actor defines parameterized stochastic policy and generate actions consisted of the primary and replica node offloading decisions,block size and block interval according to the environment state including the APs energy,transmission rate,the MEC server computing resources and average transaction size.After observing the reward and the next state of the environment,the critic estimates the approximation of the value function and its parameters,and generates a temporal difference(TD)error to evaluate the performance of the action.Then,the actor uses the critic’s output to update its policy parameters.In the following subsections,the structures of critic and actor will be elaborated.

Figure 3.The framework of the actor-critic RL algorithm.

Figure 4.Total reward under different learning rates.

Figure 5.Total reward under different schemes.

5.1 The Critic Process

The critic evaluates the quality of the policy according to the value function,which is defined as the expectation of the cumulative discount reward obtained by following the policy throughout the process,denoted as

whereγ∈(0,1)is a discount factor.SinceV π(s)cannot compute the infinite state problems,we use the function approximation method such as deep neural network(DNN)to estimate the value function,and use the parameter vectorw=(w1,w2,...,wn)Tto parameterize the approximated state value function,and denoted asVw(s)≈V π(s).

Therefore,the estimated valueVw(s)can be provided by the output layer of the DNN.And by iteratively minimizing the loss function,the DNN can be trained to learn the optimal weightw,and the loss function can be expressed as

where the termrt+γVw(st+1)−Vw(st)is the TD error,it is often used to update the evaluated value function,which is calculated by the temporal difference of adjacent state function in state transition and get the following expression:

The gradient descent method is commonly used to approximate the real value function,and thus the value function parameterswcan be adjusted as

whereαc >0 is the learning rate of the value function evaluation.

5.2 The Actor Process

The actor uses the policy gradient method to evaluate and improve the policy parameters gradually,policyπ(a|s)is constructed by parameter vectorθ=(θ1,θ2,...,θn)Tand denoted asπθ(s,a)=Pr(a|s,θ).This method is used to optimize the following objective function:

whered(s)is the state distribution.Qπ(s,a)is the action value function,which is expressed as

In order to significantly reduce the variance in the gradient calculation,we use the advantage function to replace the action value functionQπ(s,a).The advantage functionAπ(s,a)=Qw(s,a)−B(s)represents the good degree of selecting the actionaunder the states,whereQw(s,a)is the approximate action value function obtained by using function approximation ofQπ(s,a),and the best choice for baseline functionB(s)is the state value functionV π(s).Thus,the most suitable objective function can be represented as

We perform partial differentiation on the parametersθand obtain the gradient of the objective function as

By making the policy gradient ascend,the local maximum of the objective functionJ(πθ)can be found.Thus,the update of policy parametersθcan be expressed as

whereαa >0 is the learning rate for the policy update.

5.3 The Actor-Critic Algorithm

We combine the critic process and the actor process,and update the parameters of both in sequentially and simultaneously.The actor-critic RL algorithm is summarized as Algorithm 1.

The line 4 denotes the end of update in one episode when the state reaches the set terminal state.In the lines 5−9,the actor interacts with the environment and obtains the state transition sample to calculate the TD error.Then,the parameters update of critic network and actor network are presented in the lines 10−13.

VI.SIMULATION RESULTS AND DISCUSSIONS

In this section,we use computer simulation to evaluate the performance of the proposed architecture.First,the settings of simulation parameters are presented,and then the simulation results under different parameter settings are discussed.

6.1 Simulation Parameters

In the simulation,the software environment we used is TensorFlow[52]1.13.1 with Python 3.6.

For the network scenario,there are 4 charging stations,each equipped with a wireless AP.Since the selection of block producers is not considered in the simulation,it is assumed that all the APs are selected as block producers.The CPU cycle frequency of APs is 300MHz.The other parameters are presented in Table 1.

Algorithm 1.Actor-Critic RL algorithm with function approximation method.1:Initialization:parameters vector θ in the actor network,parameters vector w in the critic network.2:for episode=1 to Emax do 3:reset environment state s0,set random terminal state sdone,and reset reward r=0.4:while st!=sdone do 5:generate an action at according to π(a|s)6:obtain immediate reward rt 7:observe the next environmental state st+1 8:givethestatetransitionsample< st,at,rt,st+1 > and get the TD error:9:δt=rt+γVw(st+1)−Vw(st)10:update the parameters of the critic network:11:w ←w+αcδt∇wVw(st,at)12:update the parameters of the actor network:13:θ ←θ+αa∇θJ(πθ)14:end while 15:end for

Table 1.The simulation parameters.

In order to evaluate the effectiveness of the proposed architecture,we select the following five comparison schemes:1)Proposed scheme withoutprimary node offloading decision(Without-primarynode-offloading):the computation tasks of the primary node are executed locally.2)Proposed scheme without replica node offloading decision(Withoutreplica-node-offloading):the computation tasks of the replica node are executed locally.3)Proposed scheme with fixed block size:the size of generating blocks is the same.4)Proposed scheme with fixed block interval:the frequency of generating blocks is the same.5)Existing works:the scheme without actor-critic RL optimization.In the scheme,the selection of actions is fixed and no optimization strategy is generated.Specifically,the block size and block interval are fixed,and the offloading decision of primary and replica nodes is always computing locally.

6.2 Performance Comparison of Convergence

We observe the convergence performance of the proposed scheme and the comparison schemes to evaluate the scheme intuitively.

Figure 4 shows the convergence of the proposed scheme under different actor’s learning rateαa,which the critic’s learning rate is fixed atαc=0.01.As shown in the figure,with the improvement of the actor’s learning rate,the proposed scheme has a faster convergence rate.However,when the learning rate is high,we may only find the local optimum point and miss the global optimum point.Meanwhile,low learning rate leads to slow convergence rate.Hence,the learning rate with moderate convergence rate should be selected,which is set as 0.0005 in this paper.

In Figure 5,we show the convergence of different schemes.From Figure 5,the total reward increases with the number of episodes and reaches a stable state after about 50 episodes,which verifies the convergence performance of the proposed scheme.In addition,we found that the proposed scheme can achieve higher reward compared with other schemes,which reflects the advantage of our proposed architecture.

6.3 Performance Comparison of Different Aspects

We separately explored the performance comparison between the proposed scheme and the comparison schemes,including the total latency,the total energy consumption,the transaction throughput of the blockchain system and the weight of consumption.

Figure 6 shows the relationship between the total system latency and the number of charging stations under different schemes.It can be seen from the figure that the system latency increases obviously with the increasing number of charging stations.Meanwhile,the latency of the proposed scheme is always lower than that of other comparison schemes.The reason is that when the number of charging stations increases,the computing tasks of the system becomes heavier,which makes the processing time of the tasks longer.Thus,the latency of the system is extended.

Figure 6.Total latency versus the number of charging stations.

Figure 7.Total energy consumption versus the number of charging stations.

Figure 8.Throughput versus the transaction size.

In Figure 7,we show the relationship between the total energy consumption and the number of charging stations under different schemes.From this figure,the energy consumption increases with the number of charging stations.Meanwhile,it can be seen from the energy consumption data that the performance of the proposed scheme is better than other comparison schemes.A reasonable explanation is that in order to reduce the computing burden of APs caused by the increase of computing tasks,the agent tends to offload the tasks to effectively save system energy.

Figure 8 depicts the relationship between the transaction throughput of the blockchain system and the average transaction size under different schemes.As can be seen from the figure,with the average transaction size increasing,the transaction throughput decreases.The reason is that when the transaction size increases,one block can only contain a small number of transactions.According to the observation,the throughput of our proposed scheme is the highest,which indicates that the selection of appropriate block size and block interval can improve the system throughput,so as to improve the system performance.

In Figure 9,we show the relationship between the weight of consensus consumption(the weighted value of latency and energy consumption of the consensus process)and the task data size under different schemes.From the figure,we can observe that the weight of consensus consumption gets higher and gradually reach a stable state when the task data size increases.Besides,the proposed scheme always gets lower the weight of consensus consumption compared to other comparison schemes.The reason is that due to the limitation of block size,when the amount of data reaches the maximum block size,the block cannot carry more transactions.Therefore,with the increase of the task data size,the weight of consensus consumption eventually tends to be stable.

Figure 9.Weight of consensus consumption versus the task data size.

Figure 10.Total latency versus task data size.

Figure 11.Total energy consumption versus task data size.

Figure 12.Throughput versus block size.

Figure 10 depicts the relationship between the total system latency and the task data size under different schemes.From this figure,it can be seen that with the increasing task data size,the total system latency in the proposed scheme and other comparison schemes increases obviously.The proposed scheme has the slowest increase in latency,and compared with other comparison schemes,the total latency of the proposed scheme is always the lowest.In addition,due to the limitation of the block size in the blockchain system,the task data size in the block can only reach the maximum value of the block size.Therefore,when the task data size increases to a certain value,the total latency obtained by the system will tend to be stable.Similar observations can be made from Figure 11:it takes the least total energy consumption of the proposed scheme when compared with the comparison schemes,and eventually tends to be stable with the increase of task data size.

In Figure 12,we show the relationship between the transaction throughput of the blockchain system and the block size under different schemes.It can be seen from the figure that the blockchain-assisted electric vehicular network can handle more transactions with the increase of block size,which is applicable to all the schemes except for the scheme of fixed block size.However,the transaction throughput does not increase continuously,since the latency generated by the block generation and consensus process in the blockchain system has a certain time limit(which needs to meet theC2 condition),thus the maximum number of transactions in one block is restricted.

Figure 13 presents the relationship between the weight of system consumption(the weighted value of latency and energy consumption of the system)and the sum of the power(the total transmit power of APnto the MEC sever and APmto vehiclen)under different schemes.We can observe that the weight of system consumption increases when the sum of the power increases.The reason is that according to the Shannon’s Theorem,there is a proportional relationship between the transmission power and the transmission rate.As the transmission power increases,the transmission rate increases accordingly,which reduces the system latency,but the system energy consumption increases.Because the increase of energy consumption is greater than the decrease of latency,the weight of system consumption increases slowly but continuously.Moreover,we can find that the weight of system consumption of our proposed scheme is always lower than the other comparison schemes,which represents the superiority of actor-critic RL-based solutions.

Figure 13.Weight of system consumption versus the sum of the power.

Through all the comparison figures we put forward,it can be seen that the performance of our proposed scheme is better than other existing schemes.The framework of this paper can effectively reduce the latency and energy consumption during task processing,and improve the transaction throughput of the blockchain system.The combination of reinforcement learning and blockchain can enable the blockchain to find the optimal parameter form,effectively improve the transaction processing capacity,and improve the overall energy efficiency of the system.Meanwhile,the combination of edge computing,reinforcement learning and blockchain can intelligently offload to the edge for computing when the blockchain itself is insufficient in computing resources,thereby reducing the overall energy consumption and minimizing latency of the system.

In addition,in practical applications,the framework proposed in this paper can further adopt deep reinforcement learning and permissioned blockchain for content caching of vehicular networks[31].By using deep reinforcement learning,edge computing and blockchain,cached content can be stored efficiently and securely,and the transaction throughput of the blockchain system can be maximized.At the same time,the combination of blockchain and deep reinforcement learning promotes the development of intelligence,which can drive a variety of smart scenarios,such as authorized smart 5G beyond[53,54],automatic training and operation of Internet of Things devices[55,56],etc.

VII.CONCLUSIONS AND FUTURE WORK

In this paper,we developed an MEC-based secure billing transmission scheme for EVs in charging station blockchain systems,investigated the problem of the minimization of the system consumption costs and the maximization of the transaction throughput of the blockchain system.To improve the performance of the systems,we jointly optimized the primary and replica nodes offloading decisions,block size and block interval.Considering the dynamic characteristics of the system state,the optimization problem was modeled as an MDP,and the actor-critic RL algorithm was developed to solve the problem.Simulation results have shown that our proposed scheme has better convergence and effectiveness under the proposed algorithm by comparing with other existing schemes.In future work,we will consider the combination of deep reinforcement learning,edge computing and permissioned blockchain for vehicle to cache content or drive intelligent applications in the proposed scheme.

ACKNOWLEDGEMENT

This work was supported in part by the National Natural Science Foundation of China under Grant 61901011,and in part by the Foundation of Beijing Municipal Commission of Education under Grant KM202110005021 and KM202010005017.