Day-ahead scheduling based on reinforcement learning with hybrid action space

2022-06-27 00:28CAOJingyuDONGLuandSUNChangyin

Journal of Systems Engineering and Electronics 2022年3期

CAO Jingyu ,DONG Lu ,and SUN Changyin,*

1.School of Automation, Southeast University, Nanjing 210096, China;2.School of Cyber Science and Engineering, Southeast University, Nanjing 211189, China

Abstract: Driven by the improvement of the smart grid,the active distribution network (ADN) has attracted much attention due to its characteristic of active management.By making full use of electricity price signals for optimal scheduling,the total cost of the ADN can be reduced.However,the optimal dayahead scheduling problem is challenging since the future electricity price is unknown.Moreover,in ADN,some schedulable variables are continuous while some schedulable variables are discrete,which increases the difficulty of determining the optimal scheduling scheme.In this paper,the day-ahead scheduling problem of the ADN is formulated as a Markov decision process(MDP) with continuous-discrete hybrid action space.Then,an algorithm based on multi-agent hybrid reinforcement learning(HRL) is proposed to obtain the optimal scheduling scheme.The proposed algorithm adopts the structure of centralized training and decentralized execution,and different methods are applied to determine the selection policy of continuous scheduling variables and discrete scheduling variables.The simulation experiment results demonstrate the effectiveness of the algorithm.

Keywords: day-ahead scheduling,active distribution network(ADN),reinforcement learning,hybrid action space.

1.Introduction

With the rapid development of science and technology,the load demand of users continues to increase and the requirements for environmental protection are getting higher.Therefore, it is necessary to improve the traditional distribution method of uniformly generating electricity from large power plants and then flowing to the load nodes through the superior grid, because this method has the problems of high supply pressure during peak hours and large power loss during transmission.These problems can be solved by introducing the distributed generation(DG) units and battery energy storage systems (BESS).Meanwhile, the energy consumption can be reduced by interrupting some unnecessary load on the user-side.The traditional distribution network can no longer achieve the purpose of active management.Therefore, active distribution network (ADN) is proposed, which can actively manage the DG units, the BESS and the user-side.

In practical applications, optimal scheduling is the key point of active management of the ADN.Many researches have focused on the BESS and user-side due to their controllability and flexibility of scheduling.For example,the mixed-integer conic programming (MICP) is applied to the scheduling of energy storage [1].In [2–4], electric vehicles were regarded as the BESS, and then different optimization algorithms were used to obtain the optimal charging or discharging scheduling.The similar algorithms are also applied to demand response of the userside [5–7].The basic idea of these methods is to formulate the scheduling problem as a mixed integer nonlinear programing (MINLP), and then the optimal policies are explored through different optimization algorithms.For example, in [8], the MINLP was linearized to mixed integer linear programing (MILP), and the branch and bounded method was used to observe the optimal solution.The authors of [9] directly applied the teaching & learning based optimization (TLBO) algorithm to obtain the optimal value of the MINLP.These scheduling methods have all been verified to be effective, but only considered from a single aspect of the BESS or the user-side, which may lead to poor performance in other aspects.Therefore,many scheduling approaches for the overall architecture of the ADN have been proposed.For instance, the studies in [10] proposed a multi-stage optimization approach for the scheduling of the ADN.In addition, the ADN was regarded as a whole for modeling, and then different optimization methods were used to obtain the optimal scheduling scheme.The authors of [11] directly used the general algebraic modeling system (GAMS) to solve the formulated MINLP problem.The rolling optimization method and the robust optimization method were applied in [12] and [13], respectively.The intelligent algorithms were also used by many scholars to solve the overall programming problem of the ADN, such as the particle swarm optimization (PSO) algorithm in [14], the grey wolf algorithm in [15] and the hybrid algorithm based on dynamic programming (DP) and the genetic algorithm(GA) in [16].

The above-mentioned optimization methods are carried out on the basis of the established model, so there exists the problem of excessive dependence on the model.However, in the actual day-ahead scheduling problem, the electricity price and residential load cannot be known in day-ahead and fluctuate dynamically within a certain range, so it is difficult to establish an accurate model.Therefore, reinforcement learning (RL) is introduced.It does not require a model and obtains the optimal solution based on the interaction between the agent and the environment.A lot of related work has been done in literature.For the charging scheduling of the BESS, Q learning was used in [17] and deep Q network (DQN) was used in [18].These two methods regarded the selection of charging behavior of the BESS as a discrete variable, and then the RL methods for the Markov decision process (MDP) with discrete action space were applied.In practice, the charge or discharge capacity of the BESS can be any value within the maximum range, that is, treating it as a continuous variable can obtain a better scheduling scheme.Similarly, the authors of [19,20] applied DQN in user-side demand response.For the scheduling of DG units, the double DQN (DDQN) was proposed in [21].Although RL has not been widely applied to the day-ahead scheduling of the ADN, the autonomous household energy management of smart homes with independent generators can be extended to the whole ADN.In [22], a deep neural network(DNN) was built and its parameters were trained to obtain the optimal solution for scheduling.The studies in [23]proposed an algorithm that combined DNN and Q learning to improve the optimization performance.DQN was applied in [24,25] and deep deterministic policy gradient(DDPG) was applied in [26–28].In addition, the RL algorithm has been combined with other methods to achieve better results.For example, fuzzy reasoning was introduced into RL in [29].

It is worth noting that these papers have formulated the optimal scheduling problem as an MDP with fully continuous action space or fully discrete action space, and then the appropriate RL methods have been applied to obtain the optimal solution.Obviously, these formulations are idealized.In practice, some schedulable variables are continuous, such as charging or discharging capacity of the BESS and the interrupted load of the user-side.While some schedulable variables are discrete, such as the number of the operating DG units.Therefore, a new RL algorithm is required to obtain the optimal solution of the MDP with continuous-discrete hybrid action space.In the literature, the methods for the MDP with hybrid action space are mainly divided into two categories.One is to discretize the continuous action space, so that this problem is transformed into the MDP with fully discrete action space.For example, fuzzy rules were used in [30] to discretize the continuous variables.However, this method of approximation through discretization made the control accuracy decrease a lot.The other is to make the discrete action space continuous.The algorithm based on the multi-agent DDPG proposed in [31] was applied to obtain the optimal solution, and then performed inverse discretization to obtain the discrete controllable variables.This method greatly increased the complexity of the action space.Therefore, a more reasonable method is to apply two different algorithms to update the selection policies of discrete actions and continuous actions [32].The authors of [33] proposed an algorithm called p-DQN that combined DQN and DDPG, where DQN was used to select discrete actions and DDPG was used to select continuous actions.Afterwards, some papers proposed improved algorithms on the basis of the p-DQN according to the practical problem, such as multi-pass DQN (MPDQN) in [34] and deep multi-agent parameterized Q-networks (Deep MAPQN) in [35].However, these algorithms are applied to the problems with parameterized action space, that is, continuous actions are the parameters of discrete actions.For the parallel structure of discrete action space and continuous action space proposed in this paper, when the dimensionality of the discrete actions increases, the complexity of the algorithm will increase exponentially.To the best of our knowledge, the application of RL in the optimal day-ahead scheduling problem of the ADN with hybrid action space has not been reported in the literature.

In this paper, the optimal day-ahead scheduling of the ADN is formulated as an MDP with continuous-discrete hybrid action space.The objective of this problem is to obtain the optimal scheduling scheme to minimize the total cost of the ADN.A novel RL structure is proposed to determine the optimal scheduling scheme.The main contributions of this papers are as follows:

(i) A multi-agent hybrid RL (HRL) based algorithm is proposed for the MDP with continuous-discrete hybrid action space.In this algorithm, the advantage actor-critic and DDPG are applied for the selection of discrete schedulable variables and continuous schedulable variables, respectively.Moreover, the HRL adopts the structure of centralized training and decentralized execution.Due to the parallel relationship between the actor networks,when the dimensionality of discrete actions increases, the complexity of the algorithm will not increase significantly.

(ii) The objective function is designed as the sum of the costs of different aspects of the ADN.The optimal scheduling scheme which is obtained based on this objective function can reduce the total cost of the ADN in one day and alleviate the supply pressure on the superior grid during peak hours.

(iii) In the proposed method, Gaussian distribution is applied to the establishment of the forecasting models,which can effectively increase the robustness of the forecasting models.

The rest of this paper is organized as follows.The problem formulation is presented in Section 2.After that,the forecasting model and the multi-agent HRL-based algorithm are introduced in Section 3.In Section 4, simulation results based on actual application scenarios are presented to demonstrate the effectiveness of the proposed algorithm.Finally, conclusions are drawn in Section 5.

2.Problem formulation

The optimal day-ahead scheduling problem proposed in this paper is aimed to minimize the total cost of the ADN.The framework of the ADN is shown in Fig.1.The red arrows in the figure indicate the electricity exchange between each single aspect and the ADN.It can be seen that the cost of the ADN mainly includes the cost of electricity exchanged with the superior grid, the BESS, the user-side, and the DG units.

Fig.1 Framework of ADN

This section is mainly divided into four parts.First of all, to simplify the optimization process, this paper makes some appropriate assumption according to the actual scenarios in Subsection 2.1.Afterwards, in Subsection 2.2,the objective functions which cover the four aspects of the ADN are introduced in detail.In Subsection 2.3, the constraints of some parameters are explained.In Subsection 2.4, this problem is formulated as an MDP with discrete time steps of one hour.The specific descriptions are as follows.

2.1 Assumptions

(i) The electricity consumption during the transmission is zero.

(ii) The maximum interruptible load cannot exceed 30%of the total residential load at the current hour.

(iii) The difference in dissatisfaction of individual users is ignored, and the interruption of residential load is carried out uniformly by the ADN.

2.2 Objective function

The objective function of this optimal scheduling problem is to minimize the total cost of the ADN, which is defined as

whereCGEdenotes the cost of electricity exchanged with the superior grid,CBESSindicates the sum cost of the BESS internal loss and the transmission of charging or discharging,CUDrepresents the cost of user dissatisfaction caused by interrupting the residential load, andCDGis the operating cost of the DG units.

In particular,

wheretdenotes every hour of the day, αpurand αselare 0−1 variables.For timet, αpur(t)=1 represents the ADN purchases electricity from the superior grid and αsel(t)=1 represents the ADN sells the surplus electricity to the superior grid.It is worth noting that αpur(t)+αsel(t)<1.cpur(t) represents the total amount of electricity purchased andcsel(t) represents the total amount of electricity sold.PGE(t) represents the electricity price per megawatt(MW).

where αlossrepresents the loss factor of the BESS aging.B(t) denotes the electricity of the BESS at timet.WhenB(t) is in the range of 20%−80% of the maximum capacity of the BESSBmax, the aging cost of the BESS is low.However, when it exceeds this range, the aging cost of the BESS will increase.Ptrindicates the transmission price of the charging or discharging process.αchand αdare 0−1 variables.For timet, αch(t)=1 means the BESS is charging and αd(t)=1 means the BESS is discharging.Similarly, αch(t)+αd(t)<1.cch(t) andcd(t) represent the amount of charge and discharge, respectively.Bidenotes the initial electricity of the BESS at the beginning of the day andB24denotes the remaining electricity of the BESS at the end of the day.µ indicates the coefficient relationship between the reduced electricity of the BESS in the day and the extra cost of the BESS.

wherecIL(t) denotes the amount of interrupted residential load according to the current electricity price.The cost of user dissatisfaction is proportional to the square of the interrupted load.βdisis the coefficient that represents the relationship between the cost of user dissatisfaction and the square of the interrupted load.This parameter can be set according to the preference of the user-side.

whereo(t) represents the number of operating DG units at timet.anddenote the sum of the generation cost and the maintenance cost of the DG units when the number of operating DG units is 0, 1 and 2, respectively.

2.3 Constraints

(i) BESS constraint:

(ii) Transmission electricity constraints:

wherecmaxrepresents the maximum charging or discharging electricity within one hour.

(iii) Load interruption constraint:

whereImaxrepresents the maximum interruptible load within one hour.

2.4 RL

The RL is a theoretical framework for simulating the randomness policy and the received reward of the agents in an environment where the state has Markov properties.The framework is constructed based on a set of interactive objects, namely agents and environment.This paper takes the decision makers of the scheduling scheme as the agents, and the influencing factors of the scheduling scheme as the environment.Therefore, the optimal scheduling problem is formulated as an MDP, and then the objective is achieved through interactive learning between the agents and the environment.The specific descriptions are as follows.

(i) State:

wherePt−1andPtdenote the electricity price in the previous hour and the current hour.Lt−1andLtdenote the total residential load in the previous hour and the current hour.htrepresents the current hour.Btindicates the electricity of the BESS in the current hour.The above state variables provide a reference for the decision-making of policies.

In addition, this paper proposes an additional state variableExt, which represents the amount of electricity exchanged between the ADN and the superior grid,denoted assat.This additional state variable will not affect the decision of the next actions, but it can be helpful to evaluate the current selected actions.Therefore,satis entered into the critic network but not into the actor networks.

The additional state variableExtcan be calculated as

whereCtdenotes the charging or discharging electricity of the BESS.The value ofCtis positive for charging and negative for discharging.The amount of electricity change is expressed by the absolute value ofCt, whileCtis zero means that the BESS is neither charging nor discharging in the current hour.Itindicates the interrupted load of the user-side.Dtdenotes the amount of electricity generated by the DG units.When the calculated value of the variableExtis positive, it means that the ADN purchases electricity from the superior grid.Otherwise, it means that the ADN sells the surplus electricity to the superior grid.

(ii) Action:

whereOtindicates the number of operating DG units.Among the action variables,CtandItare continuous variables, and they can be any value within the restricted range.WhileOtis a discrete variable which can only be selected from the discrete action spaceAd={0,1,2}.Therefore, the action space of this optimal scheduling problem is a continuous-discrete hybrid action space.

(iii) Policy:

where π(at|st) reflects the conditional probability distribution of each actionataccording to the statest.

(iv) Reward:

Reward is the feedback from the environment to the agents after the agents execute the actions.The agents use it to evaluate the performance of selected actions.Finally,the maximum value of cumulative reward can be obtained through interactive learning between the agents and the environment.Therefore, the minimum value of the total cost can be obtained.Each parameter in the calculation formula ofCtotalhas a certain corresponding relationship with the state or action parameters.Therefore,the optimal value of the parameters inCtotalcan be determined by the decision-making of policies, and then the optimal day-ahead scheduling scheme is determined.

(v) Return:

whereGis return, which represents the cumulative reward of the day after being weighted by the discount factor.γ is the discount factor between 0 and 1.When γ is close to 0, the agent is shortsighted.When γ is close to 1, the agent is foresighted.

(vi) State-action value function:

whereQπ(s,a) denotes the state-action value function which evaluates the performance of the obtained scheduling scheme.The objective of the day-ahead scheduling problem is to obtain the optimal policyπ∗, i.e., a sequence of actions for the user-side, the BESS and the DG units, to maximize the state-action value function.

(vii) State transition:

wherest+1={Pt,Pt+1,Lt,Lt+1,ht+1,Bt+1} represents the next state, which can be expressed as function ofstandat.

With the aforementioned definitions in RL framework and constraints in Subsection 2.3, the remark is created.

Remark 1The constraints proposed in Subsection 2.3 are accomplished by restricting the value of the action variables.

For BESS, the electricity of the BESSBt+1is obtained by

Thus as long asCtis restricted to satisfy −Bt≤Ct≤Bmax−Bt, the constraint of the BESS can be accomplished.At the same time, the value ofCtneeds to be guaranteed that−cmax≤Ct≤cmaxin order to satisfy the constraint of transmission electricity.Furthermore, in order to satisfy the constraint of load interruption, the action variableItneeds to take a value between 0 andImax.

3.Multi-agent HRL for optimal scheduling of the ADN

It is difficult to determine the optimal day-ahead scheduling scheme in the case that the future electricity price and residential load are unknown.Moreover, through the problem formulation, the scheduling problem is transformed into the problem of an RL with continuous-discrete hybrid action space.Therefore, how to obtain the optimal solution of the MDP with hybrid action space is a more important but difficult point in this paper.In view of the above difficulties, first of all, the values of electricity price and residential load fluctuate within a certain range with a 24-hour cycle, so the neural networks are used to fit the forecasting models of electricity price and residential load.Then, to obtain the optimal solution of the MDP with hybrid action space, a multi-agent HRL algorithm is proposed in this paper.The details are as follows.

3.1 Forecasting model

Electricity price and residential load are both time-related variables.However, in actual application scenarios,to avoid the security risks of grid caused by sudden changes in electricity price or residential load, the previous values are usually used as reference for limiting the current value.Therefore, the current value is determined by both the previous values and time.The relationship between them is expressed as follows:

Then the neural networks are used to fit unknown variables.In order to increase the robustness of the forecast models, the input variables are sampled from N(Va,0.12),Vadenotes the actual values of electricity price or residential load.

3.2 Multi-agent HRL algorithm

The multi-agent HRL algorithm adopts an actor-critic architecture since this basic architecture can be applied to both continuous and discrete action spaces.In addition,for the policy selections of different aspects, the multiagent HRL algorithm adopts an architecture of centralized training and decentralized execution in order that each single aspect can obtain the optimal scheduling scheme independently.Consequently, the architecture of the multi-agent HRL algorithm contains several parallel actor networks for execution and a single critic network for training, which is shown in Fig.2.

Fig.2 Architecture of multi-agent HRL algorithm

In order to learn stochastic policies more effectively,different RL algorithms are adopted for policy selection of continuous and discrete actions.Decision making of discrete action policies is mainly based on the advantage actor-critic, and there is a one-to-one correspondence between the discrete action variables and the discrete actor networks.However, the different continuous action variables are only determined by one continuous actor network based on the DDPG algorithm.The lines with arrows indicate the information flow between the decentralized actor networks, the centralized critic network,and the environment.To begin with, each actor network perceives the statest, and then executes the discrete actionsad1t,ad2t, ···,adntand the continuous actionsac1t,ac2t, ···,acnt.As the result of these actions, the state is transformed fromsttost+1and additional statesat+1and rewardrtare generated, which are transmitted together to the single critic network to evaluate the decision-making of policies.

For a certain discrete actor network, its output is the action probability density functionπθdi(adit|st), which represents the probability of each discrete action being selected under the statest.Then the specific discrete action is obtained through sampling.Therefore, the state value functionV(st) is generally used to evaluate the policy, and then optimize the discrete actor network parameters θdiby increasing the probability of good action being selected.However, as for continuous actor network,according to the DDPG algorithm, its outputs are deterministic action valuesactand the currently selected actions are evaluated by the state-action value functionQ(st,act).Based on the above analysis, a state-action(continuous) value function is proposed to approximate the expected returnGtif the policyπθis executed.The function is defined by Bellman equation [36]:

wheresatrepresents the additional state defined in the RL framework.actandadtrepresent all continuous actions and all discrete actions respectively.θ refers to parameters of all actor networks.πθc(st+1) denotes the continuous actionsact+1generated by the continuous actor network based on the statest+1.

The method of temporal-difference (TD) learning is used to update the value ofQ(st,sat,act):

where α denotes the update rate.The update objective of the TD method isrt+1+γQ(st+1,sat+1,act+1).

Then TD error δtis defined as

In order to evaluate the policy of continuous actionsand the policy of discrete actions, the performance objectives are defined.

For continuous actor network, the performance objectiveis defined as

where θcrepresents the parameters of continuous actor networks.

The optimal policy of continuous actionsis the policy that maximizes[37]:

For a certain discrete actor network, first of all, define an advantage function:

whereQ(st,sat,act) is a baseline function that has nothing to do with discrete actionadit.Subtracting this baseline function can reduce the variance but does not change the gradient itself.As defined in [38],dπ(s) is a discounted weighting of encountered states.Then, proof is as follows:

Thus the performance objective[39] is defined as

where θdirepresents the parameters of discrete actor networks.Similarly, the optimal discrete policyis the policy that maximizesJ(πθdi):

As for the calculation of the advantage functionA(st,sat,act,adit), it is troublesome to use two sets of parameters to approximateQ(st,sat,act,adit) andQ(st,sat,act) respectively, so the TD error is usually used directly to approximate the advantage function.It can be proved that δtdefined in (24) is an unbiased estimate ofA(st,sat,act,adit):

However, it can be seen from the definition that the value functionQ(st,sat,act) is a recursive equation.Therefore, it is impossible to calculate the value ofQthrough recursion every time in practical applications, so the single critic network is used to approximate the value ofQ:

where θwdenotes the parameter of critic network.The result is the approximate value ofQobtained through θw.

Algorithm 1 shows how to train the network parameters θ of the overall architecture of the mutli-agent HRL algorithm.The inputs are the real electricity price and residential load of the previous day.After the HRL algorithm training is completed, the trained network parameters θ are output, including the single critic network parameters θw, the continuous actor network parameters θc, and the discrete actor networks parameters θd.

First of all, the estimated network parameters θ are initialized randomly.Then the target network parametersare initialized to the same value as θ.After that, the storage of state transition pairs and the update of network parameters are performed in the loop of 51 000 epochs.Each epoch starts at a random hourtof the day.This randomness can improve the robustness of the neural network to avoid overfitting, so that the optimal scheduling policy can still be obtained when the forecasted electricity price is slightly deviated.Then the initial stateand the initial additional stateare obtained.At each time step, the exploration and exploitation of discrete actionsadtare based on the randomness of selecting actions according to the probability distribution of discrete actions.As for the exploration and exploitation of continuous actionsact,Gaussian noise is added to the decision-making process ofactto change it from a deterministic process to a random process, and thenactis obtained by sampling from this random process.Then the actionsactandadtare executed to complete the state transition and the rewardrtof environmental feedback is observed, and thus a sequence of state transition pairs are formed and stored in experience poolD.While epoch > 1 000, the experience pool is full,and then these state transition pairs in the experience pool are used to update the parameters θ.Specifically,Nstate transition pairs are drawn from experience pool as samples.As a reference objective for optimization, the target state-action (continuous) valueyjis calculated as

Therefore, with the minibatch samples, the loss function is calculeted as

which denotes the mean square error between the target state-action (continuous) valueyjand the state-action(continuous) valueQ(sj,saj,ac j;θw) approximated by the estimated critic network parameters θw.Then, along the gradient direction that minimize the loss function, the parameters of the critic network are updated as

wherelcindicates the learning rate of the critic network parameters and ∇θwL(θw) denotes the gradient ofL(θw)drop.

Then, based on the above definition of the performance objectives of the continuous actor network and discrete actor networks, the performance objectives of these minibatch samples are calculated as

Similar to the update of the critic network parameters,the parameters of the actor networks are updated as

wherelaindicates the learning rate of actor network parameters.Different from the update of the critic network parameters, the update of the actor network parameters is along the gradient direction that maximizes the performance objectives.

The update of the target network parameters adopts soft update, that is, the parameters of the target networks will be updated every step but the update rate is very small.The update is according to the following formula:

where τ denotes the update rate.

Remark 2Due to the structure of parallel distributed actor networks, the computational complexity of the proposed multi-agent HRL algorithm isO(nc+nd), wherencandndrepresent the number of agents with continuous action space and discrete action space.When the number of agents increases, the computational complexity of the proposed algorithm increases linearly, while the computational complexity of the previous algorithms [33] increases exponentially.

4.Experimental results

In this section, the effectiveness of the proposed algorithm is verified through simulation results.This section is divided into two parts.Subsection 4.1 introduces the experimental setup in detail.Then, the simulation results and discussion are presented in Subsection 4.2.

4.1 Experimental setup

The proposed algorithm is a price-based scheduling method, so the evaluation of its performance is based on realworld hourly electricity price.However, the settings of the BESS and DG units are hypothetical based on the actual situations.If the proposed algorithm is applied to the real-world scenarios, it only needs to adjust some coefficients according to the local actual conditions.In this experiment, some parameters are set as follows.TheBmaxandcmaxdo not exceed 3000 MW and 210 MW, respectively.The αlossis set to 0.08.ThePtris 0.14 k/MW.The coefficient µ is set as 0.5.

In the algorithm structure, each actor network consists of an input layer, an output layer, and a hidden layer with 32 hidden neurons.While the critic network contains an input layer, an output layer, and two hidden layers in which the number of hidden neurons are 64 and 32,respectively.In addition, the settings of some hyperparameters in this algorithm are presented in Table 1.

Table 1 Hyperparameters in the algorithm

Remark 3The hyperparameters in this algorithm are set by referring to [26].Among them, the learning rate of the actor networkslais an order of magnitude lower than the learning rate of the critic networklc.Then these hyperparameters are adjusted according to the neural network structure and the actual training results of this experiment.

4.2 Simulation results and discussion

(i) Performance of the forecasting models: The past real data of electricity price and residential load are used to train the neural network for 5000 epochs, and thus the forecasting models are obtained.Then the real electricity price and residential load at 23 o'clock on the previous day and 0 o'clock on the day are input into this model.After continuous state transitions, the forecasted electricity price and residential load for the next day are output.The comparison between the forecasted values and the real values are presented in Fig.3.

Fig.3 Comparison of the forecasted values and true values

As shown in Fig.3, though there exist slight deviations between the forecasted values and the real values, the deviations are within a reasonable range.Therefore, it is effective to use the value of the current hour, the value of the previous hour and the current timetto fit the forecasting models.Accordingly, as long as the value of the last hour of the previous day and the first hour of the day are known, the day-ahead forecast can be carried out.

(ii) Performance of the day-ahead scheduling algorithm:The multi-agent HRL algorithm proposed in this paper is used to perform day-ahead scheduling experiments in a simulation scenario, where the scheduling of the ADN is limited to the charge or discharge of the BESS, the interrupted load of the user-side and the number of operating DG units.In order to reduce the influence of different magnitudes of the schedulable variables on the determination of the scheduling scheme, the schedulable variables are normalized.The optimal day-ahead scheduling scheme obtained after 50 000 epochs training are shown as follows.

Fig.4 shows the hourly charging or discharging of the BESS according to the day-ahead forecasted electricity price.It can be seen that the optimal scheduling scheme guides the BESS to charge when the electricity price is low and to discharge when the electricity price is high.Fig.5 shows the remaining electricity in the BESS per hour after charging or discharging.After a whole day of charging and discharging, the electricity at the end of the day is approximately equal to the value at the beginning of the day, which is beneficial to the long-term scheduling of the BESS.In addition, the electricity of the whole day is maintained within the range of 20 % to 80 % ofBmax,which is conducive to prolonging the service life of the BESS.

Fig.4 Hourly electricity price and charging/discharging of the BESS

Fig.5 Electricity of the BESS per hour

For the interrupted load of the user-side, when it increases, the electricity cost of the user-side under the current electricity price decreases, but the dissatisfaction cost increases.The purpose of the algorithm proposed in this paper is to learn the policy that can balance the electricity cost and the user dissatisfaction cost according to the preference of the user-side.Fig.6 shows the relationship between the interrupted load of the user-side and the electricity price when the user dissatisfaction factor β is 0.0017.It can be seen that when the electricity price increases, the interrupted load of the user-side increases.Otherwise, it decreases.Therefore, the effectiveness of the proposed algorithm is proved.Fig.7 shows the comparison of the interrupted load when the user dissatisfaction factor β takes different values.It can be observed that a larger β corresponds to a larger interrupted load.This is because a larger β means that the user-side prefers to concern about the dissatisfaction cost rather than the electricity cost.

Fig.6 Hourly electricity price and the interrupted load of the userside

Fig.7 Comparison of the interrupted load under different β

Fig.8 shows the number of local DG units in operation per hour according to the forecasted electricity price.As shown in the figure, when the electricity price is in the low-range (0.4 − 0.6 k/MW), it is more economical to purchase electricity from the superior grid, so the DG units are shut down.When the electricity price is in the midrange (0.6 − 0.8 k/MW), the maintenance cost of operating the two DG units at the same time is relatively high, so only one of the DG units is turned on, which can reduce part of the electricity cost.When the electricity price is in the peak-range (0.8 − 1.0 k/MW), the cost of electricity purchase and sale is high.Therefore, the two DG units operate at the same time to make full use of the electricity supply of the DG units and the excess electricity will be sold to the superior grid, which can improve the economy of the ADN.

Fig.8 Hourly electricity price and the number of operating DG units

The above-mentioned different aspects scheduling together constitute the optimal day-ahead scheduling scheme of the ADN.The total cost after scheduling is 8 259.35,which is a reduction of 22.68 % compared with the value of 10 682.29 before scheduling.Moreover, the comparison of the electricity exchanged between the ADN and the superior grid before and after the optimal day-ahead scheduling is shown in Fig.9.It is proved that the proposed scheduling scheme can effectively reduce the total cost of the ADN within a day and alleviate the supply pressure of the superior grid during peak hours.

Fig.9 Comparison of the electricity exchanged between the ADN and the superior grid before and after the optimal day-ahead scheduling

In order to evaluate the performance of the proposed multi-agent HRL algorithm in solving the problems of MDP with hybrid action space, the previous methods for solving the problems of MDP with hybrid action space are compared [33].Among them, the p-DQN algorithm adopts the same network architecture as the HRL algorithm.For the DQN algorithm, the charging behavior of the BESS is discretized into two types: charging or discharging, and the load interruption behavior of the userside is discretized into two types: interruption or noninterruption.Other parameters of the DQN are the same as the HRL.When the networks start to be updated, the scheduling scheme of the whole episode from 0 o’clock to 23 o’clock is completed every 5 epochs, then output the episode return to observe the entire optimization process of these algorithms.The experimental result of DDPG is not included here because it fails to obtain the optimal scheduling scheme in the application scenario of this paper.The training results of different algorithms are compared from the following three dimensions.The comparison of episode return during the training process is shown in Fig.10.Moreover, the comparison of the total electricity cost reduction rate of 24 hours ∆Rcand the variance reduction rate of the electricity exchanged between the ADN and the superior grid of 24 hours ∆Rvafter the optimal scheduling are presented in Table 2.The greater the variance reduction rate of the exchanged electricity, the stronger the ability of the algorithm to alleviate the supply pressure on the superior grid during peak hours.It can be seen that the performance of the proposed algorithm is better than the previous algorithm.Therefore,the superiority of the proposed algorithm is proved.

Fig.10 Episode return of the proposed multi-agent HRL algorithm compared with the DQN and p-DQN algorithm

Table 2 Performances of different algorithms %

5.Conclusions

In this paper, a multi-agent HRL algorithm is proposed to solve the optimal day-ahead scheduling problem of the ADN, where continuous schedulable variables and discrete schedulable variables coexist.With the aiming of minimizing the total cost of the ADN within a day, the optimal scheduling problem is formulated as an MDP with continuous-discrete hybrid action space.In the proposed approach, forecasting models are established to overcome the uncertainty of the future electricity price and residential load.Then, the multi-agent HRL algorithm is proposed to learn the optimal scheduling scheme, which adopts actorcritic and DDPG to the selection of discrete schedulable variables and continuous schedulable variables, respectively.Simulation results show that the multi-agent HRL algorithm can minimize the total cost and alleviate the supply pressure during the peak hours.Furthermore, the previous algorithms are compared, which indicates the superiority of the proposed algorithm.

Journal of Systems Engineering and Electronics2022年3期

Journal of Systems Engineering and Electronics的其它文章: State estimation in range coordinate using range-only measurements; Scattering center modeling for low-detectable targets; Unintentional modulation microstructure enlargement; Image encryption based on a novel memristive chaotic system,Grain-128a algorithm and dynamic pixel masking; Joint waveform selection and power allocation algorithm in manned/unmanned aerial vehicle hybrid swarm based on chance-constraint programming; Acquisition performance of B1I abounding with 5G signals