Nash-Q learning-based collaborative dispatch strategy for interconnected power systems

2020-07-16 07:32:10RanLiYiHanTaoMaHuilanLiu
Global Energy Interconnection 2020年3期

Ran Li,Yi Han,Tao Ma,Huilan Liu

State Key Laboratory of Alternate Electrical Power System with Renewable Energy Sources,North China Electric Power University,Lianchi District,Baoding 071003,P.R.China

Abstract: The large-scale utilization and sharing of renewable energy in interconnected systems is crucial for realizing “instrumented,interconnected,and intelligent” power grids.The traditional optimal dispatch method can not coordinate the economic benefits of all the stakeholders from multiple regions of the transmission network,comprehensively.Hence,this study proposes a large-scale wind-power coordinated consumption strategy based on the Nash-Q method and establishes an economic dispatch model for interconnected systems considering the uncertainty of wind power,with optimal windpower consumption as the objective for redistributing the shared benefits between regions.Initially,based on the equivalent cost of the interests of stakeholders from different regions,the state decision models are respectively constructed,and the noncooperative game Nash equilibrium model is established.The Q-learning algorithm is then introduced for high-dimension decision variables in the game model,and the dispatch solution methods for interconnected systems are presented,integrating the noncooperative game Nash equilibrium and Q-learning algorithm.Finally,the proposed method is verified through the modified IEEE 39-bus interconnection system,and it is established that this method achieves reasonable distribution of interests between regions and promotes large-scale consumption of wind power.

Keywords: Interconnected region,Noncooperative game,Q-learning,Wind power accommodation.

1 Introduction

Wind energy has become one of the most promising new energy sources in the context of global energy interconnection.In China,as the distance between the wind power resources and load center is considerable,large-scale wind-power grid integration leads to shortage of backup resources in the regional power grid,rendering complete consumption of the wind power difficult on the spot.Besides,the fluctuation and anti-peak characteristics lead to insufficient system peak regulation capacity,resulting in serious wind-power curtailment [1].Therefore,the maximization of wind-power consumption during grid dispatch has become a research hotspot [2–4].

With the development and expansion of the power system,“instrumentation,interconnection,and intelligence” is the development direction of the power grid,and each isolated power system will inevitably become interconnected through the tie-line [5].Furthermore,the concept of cross-regional allocation of new energy between nations has been proposed [6–8].With the traditional centralized algorithm,it is difficult to effectively solve the problems of hierarchical and partitioned dispatch in an interconnected power grid.Hence,robust optimization [9]has been adopted to describe wind-power output,and a decentralized and coordinated scheduling model based on the objective cascade analysis method has been proposed for the decentralized and coordinated dispatch of an interconnected power system.Moreover,the synchronous alternating direction multiplier method has been applied to generate the optimization model for each region,and the large-scale consumption of wind-power dispatch involving energy storage has been considered [10].A cut-plane consistency algorithm has been proposed to decentralize the economic dispatch models in interconnected areas [11]; however,timely deletion of the nonfunctional cut-planes during the iteration process is necessary.To speed-up the convergence speed,the Ward equivalence method has been applied for decomposing interconnected systems,and the objective function has been processed using the modified generalized Benders decomposition method [12].In addition,a distributed algorithm has been proposed based on Newton’s method to solve the problem of multiregional economic scheduling [13],which can reduce the iteration time and effectively increase the calculation efficiency.

The aforementioned studies on interconnected system scheduling mainly focus on problems such as large-scale complex constraints and the difficulty in combining windpower interconnected power grids with security constraints; hence,they mainly adopt mathematical methods for objective function optimization.In most available studies,in case of multiregional participation in dispatch,the regions are regarded as economic interest communities with the same objective function in pursuit of overall maximum interest,unreasonably distributing the benefits and even sacrificing the interests of certain regions.The coordinated dispatch of interconnected systems is a process in which two regions pursue their own interests to maximize not only the economic interests but also the overall social and economic interests.Therefore,game theory [14–16]can be used to analyze the respective interests of the interconnected regions and the Nash equilibrium that the two may achieve.Additionally,the aforementioned studies optimize the power output of the unit through mathematical models based on the load and wind-power forecast data at each moment,in isolation,ignoring the fact that power system dispatch is a continuous and repeated process.For each process,the uncertainty prediction and wind-power dispatch can be based on experience.Therefore,reinforcement learning [17–19]provides a new solution for using such experience to optimize the power output of the unit more efficiently,maximize wind-power consumption,and complete the game between regions based on system experience and conditions.With online learning ability,the wind-power consumption capacity of an integrated wind storage system has been strengthened [20].For coordination and dispatch in an integrated energy microgrid,a dispatch model has been constructed based on the multiagent game and Q-learning algorithm to maximize the operating incomes and reduce the costs of all the parties [21].

This study proposes a large-scale wind-power coordinated consumption strategy to realize the optimal operation of interconnected systems.The main contributions of this study are summarized as follows:

• A coordinated economic dispatch model based on the noncooperative game Nash equilibrium,considering the interconnected wind-power uncertainty,is established.The interactions between the regions and tie-line for various operational cases are presented in detail.

• The Q-learning algorithm is used to optimize the state of two regions,separately,to determine the Nash equilibrium of the largest fleet of wind power,most economic unit combination,and tie-line power output strategies.

• By analyzing the modified IEEE 39-bus interconnection system,the Nash-Q learning method proposed in this study is compared with the traditional dispatch method,establishing that the proposed method achieves reasonable distribution of interests among regions and promotes large-scale consumption of wind power.

The reminder of this paper is organized as follows.Sections 2 and 3 introduce the detailed dispatch model and Nash-Q learning method,respectively.The performed case study and discussions are presented in Section 4.Section 5 highlights the conclusions of this study.

2 Coordinated dispatch model based on noncooperative game Nash equilibrium

2.1 Two-region economic optimization decision model

In interconnected system dispatch,for a certain area,the start-up/shut-down of conventional thermal power units can be arranged based on load data,wind-power forecast,and tie-line schedule to calculate the equivalent economic cost of each region,and that of the entire system,subsequently.In the dispatch process of an interconnected system,one region does not necessarily cause another region to suffer the same economic loss,while optimizing its equivalent economic cost.The two interconnected parties may have strategies to achieve a common interest balance,i.e.,to further emphasize the maximization of their own benefits from the perspective of cooperation; hence,the coordination of the economic interests of the two regions constitutes a noncooperative game Nash equilibrium model.This study focuses on improving and emphasizing the autonomy of the region itself,and the collaboration and cooperation between regions.The start-up/shut-down strategies are separately formulated in the two regions to realize coordination and cooperation between them through the adjustment of the tie-line power.The game factor is defined as the equivalent cost of each region,assuming that regionIis a highgeneration area for wind power,and regionJis a windpower consumption area; the following equations result:

wherefIandfJare the equivalent cost functions of the regionIand regionJgame players,respectively;Tis the number of dispatch periods;NIGandNJGare the total number of thermal power units in each region,respectively;NWis the number of wind farms in regionI;are the start-up states of thermal power unitnin each period,which range from 0–1;FInandFJnare the coal consumption function of each region,respectively;are the output of thermal power unitnin each period;are the starting costs of thermal power unitnin each period of each region;CWis the penalty cost coefficient of abandoned wind; ΔPW j,tis the abandoned wind power of wind turbinejin each period;γis the tie-line price;Pl tis the power on the tie-line;Sn,hotandSn,clodare the hot and cold start-up cost,respectively;Tn,t,offis the continuous downtime of each period of unitn;Tn,minis the minimum stop time;Tn,coldis the cold start time;an,bn,andcnare the operating cost coefficients of unitn;PW j,tis the predicted wind power of wind turbinejin each period; andis the wind power of wind turbinejparalleling in the grid in each period.

2.2 Constraints

The unit output constraints are as follows:

whereandare the minimum and maximum technical outputs of unitn,respectively.

The unit ramp rate constraints are as follows:

whereRD,nandRU,nare the up and down climbing rates of unitn,respectively.

The minimum start-off time constraints are as follows:

whereTn,onandTn,offare the minimum on and off time of unitn,respectively.

The constructed wind-power output uncertainty set is expressed as follows [22]:

wheretχis the power flow direction of the interconnected regions.Ifpltis the same as the specified positive direction,andDtis the predicted value of the load in the respective area.

The load demand constraints are as follows:

whereris the load reserve ratio of the respective area.

The transmission capacity constraints are as follows:

wherePjmaxis the maximum wind-power prediction value in a dispatch period;tηis the proportional control coefficient in the adjustment period,0≤tη≤1; andPlmaxis the maximum limit of the transmission power on the tie-line.

The aforementioned interconnected system economic dispatch model,considering wind-power uncertainty,is a multivariate mixed-integer nonlinear optimization problem; due to computational limitations,it is difficult to determine the optimal solution for this model.In this study,the method presented in [23]is used to linearize the coal consumption and start-up cost function,as well as the nonlinear constraint variables in the constraint function; this problem is then transformed into a mixed-integer linear one.

2.3 Interarea Nash equilibrium

RegionsIandJconstitute a noncooperative gamebased model in the process of pursuing their own best interests.The Nash equilibrium points achieved by each region independently are the optimal strategies of their own objective functions [24–25].The following can be expressed as

whereGis the equilibrium point of the game;gis the game function;Igis the individual involved in the game;f*gis the Nash equilibrium winning function of the players participating in the game,namely,the equivalent cost;A*gis the Nash equilibrium action strategy set of the game player;is the Nash equilibrium strategy for the start-up output of the units in regionI;A*J,tis the Nash equilibrium strategy for the start-up output of the units in regionJ; andPtl*is the Nash equilibrium strategy for the tie-line power values.Equations (15) and (16) represent a region’s own optimal strategy,when the other selects the optimal strategy.

The process of solving the Nash equilibrium problem is described as follows:

Step 1:Input raw data,including the load prediction data,wind-power prediction data,and the various data and parameters required by objective functionsfIandfJ.

Step 2:The initial value of the Nash equilibrium solutionis given.

Step 3:Iterative search: Thek-th round optimization result is solved according to its own equivalent cost objective function and the optimization results of the previous round,i.e.,

Step 4: Determine whether the Nash equilibrium solution has been found.If thek-th round optimization result is consistent with the (k-1)-th round,i.e.,

If the equation is satisfied,the Nash equilibrium solution has been found,i.e.,the Nash equilibrium equivalent costandf*Jof each region,the tie-line power valueat each moment,and the start-up output strategiesA*I,tandA*J,tcan be determined.

3 Interconnected system economic dispatch based on the Nash-Q learning method

3.1 Basic principle of the Q-learning algorithm

Reinforcement learning is a process of repeatedly learning and repetitively interacting with the environment to strengthen certain decisions.In this process,as the agent acts on the environment through actionA,the environment changes and the reward valueRis generated.The agent receives the reward valueR,and selects the next action to be performed according to the enhanced signal and the current stateS,with the goal of finding the optimal strategy to accomplish the target task.A typical reinforcement learning model is depicted in Fig.1.

Fig.1 Reinforcement learning model

Q-learning is a type of reinforcement learning,commonly based on the Markov decision process [26–28].In this study,the Q-learning algorithm used in the Nash game results in more efficient commitment and dispatch decisions.It sets the previous empirical Q-value as the initial value of the subsequent iterative calculation,improving the convergence efficiency of the algorithm.The value function and iterative process of the Q-learning algorithm are expressed as follows:

wheresandsare the current and next state,respectively;Sis the state space set;βis the discount factor;R(s,s,a) is the reward value obtained by performing actionafrom statesto states' ;is the transition probability to states' after performing actionain states;Q(s,a) is the Q-value of performing actionain states;αis the learning factor; andQk(sk,ak) is thek-th iteration value of the optimal value functionQ*.

3.2 Principle of the Nash-Q learning algorithm

In a noncooperative game,due to the constant change of state,the action strategy of the game players changes accordingly.At timet,playeriiteratively learns to update the value ofQitand equalizes theQjtvalue of playerj,in addition; a Nash game equilibriumis then formed in states.By defining the noncooperative game Nash equilibrium solution asthe updated iterative equation of the Nash-Q learning algorithm is provided as follows [29]:

3.3 Selection of the state space and action strategy set

In general,there are two methods for implementing the Q-function: the neural network method and the lookup table [26].In this study,the latter is used to implement Q-learning,where the utility or worth of each corresponding action-state is solely expressed by aquantified value (Q-value),which is determined by an action-value function.Therefore,it is necessary to first determineSandA.

In the objective function model,the state variable includes the load prediction valueSLoadand the wind-power prediction valueSWin each period; the action variable setAincludes the start-up outputsaIandaJof the thermal power units in regionsIandJ,respectively,and the tie-line power valueal.Before generating the Q-value table,we first discretize the continuous state variable and action variable to form a (state,action) pair function.The state variables can be discretized by the following expression [30]:

where ΔPiis the interval length of thei-th variable;PimaxandPiminare the maximum and minimum values of each variable,respectively; andNiis the interval number of thei-th variable.All the state variables can be discretized into interval forms using (20),and the stateto which each region belongs can be uniquely determined.The action strategy variable needs to be discretized into a fixed value form,after which a set of action strategiesak={aI,aJ,al} can be uniquely determined according to the unit state and interval of the tie-line power value.

3.4 Coordinated dispatch based on Nash-Q learning

After state spaceSand action strategyAare determined,prelearning and online learning can be performed.Before reaching the optimal Q-value,prelearning accumulation of experience is necessary to generate a Q-table that approximates the optimal solution; online learning can then be performed to obtain the best action strategy [31].

Fig.2 Nash-Q learning flow chart

4 Analysis of examples

4.1 Description of the examples

In this study,the two-region economic dispatch model is solved through Yalmip programming in MATLAB; the Gurobi solver is applied,in addition.The modified IEEE 39-bus two-region system is used to verify the model depicted in Fig.A1.This interconnected system contains four wind turbines and 20 thermal power units.All the four wind turbines are located in regionI.Fig.3 displays the windpower prediction value,and its upper and lower bounds.The parameters of the thermal power units and load demand data in each region are available in [32],and the upper limit of the tie-line power transmission is 500 MW.The penalty cost of abandoned wind is 100 $/MW,the electricity price of the tie-lineγ=20 $/MW,ΓS=4,andΓT=12.

With respect to parameter setting for the Q-learning algorithm,it is supposed that the learning factorα=0.01 and the discount factorβ=0.8.For state space division,the load power is divided into 16 discrete spaces at intervals of 50 MW,and the wind-power output is divided into six discrete spaces at intervals of 50 MW.Therefore,corresponding to a 24-h dispatch period,regionsIandJcorrespond to 2304 and 384 states,respectively.For action space division,the operation of the units is divided into two fixed states,start or stop,and the tie-line power value is divided into {0,100,200,300,400,500} six fixed values.Therefore,regionsIandJcontain 6144 actions.Utilizing the annual historical load and wind-power data,the Nash-Q learning model is prelearned,and a Q-table approximating the optimal solution is established,which gives Q-learning a higher decision-making ability.The range of the regional cost is $ 5.47–6.23 million and $5.51–6.35 million,respectively,in regionsIandJduring the prelearning stage.

Fig.3 Prediction value,and upper and lower bounds of the wind-power output

4.2 Impact of tie-line dispatch on the system based on Nash-Q learning

To better compare the analysis results,we designed the following three cases:

Case 1: The tie-line powerat any time,and the interconnected system is divided into two isolated systems in which economic dispatch is implemented,respectively.

Case 2:tη=0,which allows the tie-line power to be adjusted at any time.The traditional method is used to solve the interconnected dispatch model.

Case 3:tη=0,and the Nash-Q learning method proposed in this study is applied to solve the interconnected dispatch model in which Q-learning has been prelearned.

The total cost of the three cases is shown in Table1.

Table1 Cost of each region for Cases 1–3

1) Analysis of Cases 3 and 1

Fig.4 Unit combinations in different cases

In Cases 3 and 1,units 1–4 in both regions are always in the start-up group at any time,and the states of units 5–10 are different,as shown in Figs.4(a) and 4(b).The total system dispatch cost in Case 3 is 1124554 $,which is 23076$ lesser than that of Case 1.It can also be observed that there is a decrease in the dispatch costs of both regions in Case 3.Comparing Figs.4(a) and 4(b),the number of start-ups and running time in Case 3 are reduced in the two regions,indicating that the interconnected system can not only reduce the operating cost but also promote the consumption of wind power.

2) Analysis of Cases 3 and 2

In Cases3 and 2,the start-up/shut-down of the units in the two regions differ for units 5–7,as depicted in Figs.4(b) and 4(c).Therefore,on comparing the start-up/shut-down status and the equivalent cost of each region between Cases 2 and 3,it can be observed that the number of start-ups in regionIand the run-time of the units increase,and the number of start-ups in regionJand the run-time of the units reduce in Case 3; correspondingly,the equivalent cost of regionJdecreases,whereas that of the regionIincreases.This demonstrates that Nash-Q learning can redistribute the economic benefits of the two regions again through iterative solutions to enable more wind-power consumption.

The transmission power of the tie-line in Cases 3 and 2 and the wind power dispatch are displayed in Fig.5,where the total system dispatching cost in Case 2 is 1122752 $,and the total cost in Case 3 is 0.16% more than that of Case 2.From Fig.5,it can be calculated that the wind power consumption in Case 3 increases by 1.76% compared to Case 2.Moreover,with respect to Case 2,there is a decrease in the tie-line power value in the peak load periods of 10–13h and 20–21 h in Case 3; the tie-line power value reduces in the two low-load periods of 1–5h and 23–24 h.In both cases,the wind power can be completely consumed in the 6–22h period.In the two load periods of 1–5h and 23–24 h,there is wind-power curtailment in both cases; however,the wind-power consumption capacity in Case 3 is higher.This demonstrates that Nash-Q learning can alleviate the peak pressure in the peak load period by coordinating the interests between the two regions,ensuring the benefit of the wind-power receiving end regionJin the low-load period,and promoting the consumption of wind power,but at the expense of the overall economics of the system.

Fig.5 Comparison of the tie-line power and windpower dispatch output between Cases 2 and 3

4.3 Influence of γ and ηt on the economics of interconnected systems

This study analyzes the influence of the control coefficientηton the wind-power adjustment period and the tie-line electricity priceγon the economics.Forηt=0 andγranging from 0–25 $/MW,the equivalent and total cost of each region are listed in Table2.It can be seen that as the tie-line electricity priceγincreases,the equivalent cost of regionJgradually increases compared to regionI; correspondingly,there is a deterioration in the overall economy and wind-power consumption.Forγ=20 $/MW andηtranging from 0–1,the changes in the total cost are presented in Table3.It can be seen that with the increase inηt,the time-period in which the wind power can be shared between the regions decreases,deteriorating the wind-power consumption and the overall economy.

Table2 Influence of γ on the economic dispatch of interconnected systems

Table3 Influence of ηt on the economic dispatch of interconnected systems

4.4 Algorithm performance comparison

The discrete particle swarm optimization (DPSO) algorithm is used to solve the optimal dispatch strategy by optimizing the unit output to obtain the equivalent cost of each region.For the overall game process,the Nash equilibrium point is solved using the iterative search method.The solution procedures are different from those of the Q-learning method.The same time section is used for 10 repeated calculations in both algorithms.Four indicators are used for comparison,including the mean value of the objective function,variance (D),standard deviation (SD),and relative standard deviation (RSD) of each algorithm,and the results are shown in Table4.The stability of the solution method proposed in this study is slightly better than that of the DPSO,and the objective function value based on particle swarm optimization has a relatively uniform distribution,increasing the uncertainty of the result.

Table4 Optimization results of the two algorithms

The iterative convergence process is illustrated in Fig.6.The DPSO algorithm converges to the optimal value with 1,200 iterations,whereas the Q-learning algorithm after prelearning can converge to the optimal cost value in approximately 600 iterations.Besides,the initial total cost of Q-learning is approximately 30% less than the DPSO,and it is also more economical after convergence.This shows that after prelearning,the Q-learning algorithm has reasonable ability to make an optimal decision based on the learned and accumulated experience; hence,its initial solution is close to the optimal value.Compared to the DPSO heuristic algorithm,the Q-learning algorithm has a more efficient solution speed and optimal decision solution for multivariate high-dimensional complex scheduling optimization problems.

Fig.6 Comparison of the iterative convergence between the Q-learning and DPSO algorithms

5 Conclusions

In this study,a coordinated economic dispatch model,considering the interconnected wind-power uncertainty,was presented,which integrates and exploits the synergy between game theory and reinforcement learning algorithms.The rationality and feasibility of this model in dispatch decision-making was analyzed in detail and discussed.After verification through a calculation example,the following conclusions were drawn:

1) The economic dispatch of interconnected systems based on the Nash-Q learning algorithm can not only effectively deal with the uncertainty of the wind-power output and improve wind-power consumption but also redistribute the shared benefits between regions.

2) The power line parameterγof the tie-line has significant effect on the dispatch cost of the interconnected system.The larger the value ofγ,the higher is the cost of purchasing wind power from the sending end,and the lower is the wind power consumption.The larger the value ofηt,the lesser is the wind power shared between the interconnected systems in each period,and the overall economics worsen.

3) The pre-learning Q-learning algorithm has better convergence and computational efficiency than the intelligent algorithm.

Further research can focus on improving the generalization of the proposed method and online transfer learning can be used for application to various scenarios.

Acknowledgements

This work is supported by the Fundamental Research Funds For the Central Universities (No.2017MS093).

Appendix A

Fig.A1 Interconnection structure of the modified IEEE-39 node system