Deep Reinforcement Learning Based Approach for Optimal Power Flow of Distribution Networks Embedded with Renewable Energy and Storage Devices

Di Cao; Weihao Hu; Xiao Xu; Qiuwei Wu; Qi Huang; Zhe Chen; Frede Blaabjerg

网刊加载中。。。

使用Chrome浏览器效果最佳，继续浏览，你可能不会看到最佳的展示效果，

确定继续浏览么?

复制成功，请在其他浏览器进行阅读

Deep Reinforcement Learning Based Approach for Optimal Power Flow of Distribution Networks Embedded with Renewable Energy and Storage Devices PDF

- ORCID：
Di Cao
✉
- ORCID：
Weihao Hu (Senior Member, IEEE)
✉
- ORCID：
Xiao Xu
✉
- ORCID：
Qiuwei Wu
✉
- ORCID：
Qi Huang
✉
- ORCID：
Zhe Chen (Fellow, IEEE)
✉
- ORCID：
Frede Blaabjerg (Fellow, IEEE)
✉

School of Mechanical and Electrical Engineering, University of Electronic Science and Technology of China, Chengdu, China； Department of Electrical Engineering, Technical University of Denmark, Kgs. Lyngby 2800, Denmark； Department of Energy Technology, Aalborg University, Aalborg, Denmark

Updated：2021-09-27

DOI：10.35833/MPCE.2020.000557

OUTLINE

Abstract

This study proposes a deep reinforcement learning (DRL) based approach to analyze the optimal power flow (OPF) of distribution networks (DNs) embedded with renewable energy and storage devices. First, the OPF of the DN is formulated as a stochastic nonlinear programming problem. Then, the multi-period nonlinear programming decision problem is formulated as a Markov decision process (MDP), which is composed of multiple single-time-step sub-problems. Subsequently, the state-of-the-art DRL algorithm, i.e., proximal policy optimization (PPO), is used to solve the MDP sequentially considering the impact on the future. Neural networks are used to extract operation knowledge from historical data offline and provide online decisions according to the real-time state of the DN. The proposed approach fully exploits the historical data and reduces the influence of the prediction error on the optimization results. The proposed real-time control strategy can provide more flexible decisions and achieve better performance than the pre-determined ones. Comparative results demonstrate the effectiveness of the proposed approach.

Keywords

Deep reinforcement learning (DRL); optimal power flow (OPF); wind turbine; distribution network

I. Introduction

IN the context of the energy shortage, climate change, and environmental protection, the development of clean energy and low-carbon economy, as well as the optimal allocation of energy, is essential [

1]. It is an effective way to use sustainable energy by realizing the local consumption of renewable energy in a distribution network (DN). However, renewable energy is affected by natural conditions and has the characteristics of intermittence and uncertainty, presenting challenges to the dispatch and operation of the DN [2].

The optimal power flow (OPF) problems of the DN can be classified into two categories. The first category is deterministic OPF problems. Specific values of the load demand, sustainable generation, and particular network conditions are usually needed to solve this type of problem. Various mathematical approaches [

3] and swarm intelligence based approaches are proposed for solving deterministic OPF problems [4], [5]. However, the nonlinear characteristics of these problems (introduced by the constraints of either the network or the devices) make it difficult for the optimization tools to find the global optimum [6]. Evolutionary methods are effective optimization methods when the space of policies is sufficiently small or can be structured and a large amount of time is available for the search [7]. However, power systems have an uncertain nature. It is difficult to implement the intelligence-based methods in the actual operation of power system when considering the uncertainty of the load and the intermittency of renewable energy generation.

The second category is probabilistic OPF (P-OPF) problems. To deal with the uncertainty of the DN, numerous approaches for solving the P-OPF problems have been proposed. References [

8], [9] propose stochastic programming based approaches for the optimization of the DN. The stochastic programming based methods assume the knowledge of the distribution of uncertain variables, based on which the scenarios of uncertainty realizations are generated. These methods suffer from a heavy computational burden, as a large number of scenarios must be considered. Moreover, it is difficult to accurately determine the probability distribution of uncertain variables in practice [10]. In contrast to stochastic programming based methods, robust optimization based methods deal with the uncertainty by constructing an uncertainty set and searching the solutions that are robust to all realizations within the set. Robust optimization based methods are proposed for the management of the DN in [11]-[14]. Reference [13] proposes a robust optimization based method that exploits the convex hull tool for the definition of the uncertainty set. Reference [14] proposes a robust quadratic approach for the operation of a smart DN. In the simulation, the proposed approach achieves better performance than the linearized, nonlinear and quadratically constrained ones. The robust optimization based methods require the obtained solution to be immune to the worst case in the uncertainty set. Thus, the results obtained via these methods are relatively conservative. Chance-constrained methods are also used for the optimization of DN operation [15], [16]. The model predictive control (MPC) algorithm is sometimes used in the two-stage optimization of the management of DNs [17]. However, the performance of the MPC algorithm depends on the accuracy of the prediction of the renewable energy generation and load demand. The past operating experience has not been fully used [18], [19]. The aforementioned methods must resolve a stochastic nonlinear problem partially or completely when a new situation is encountered, which may take some time [20]. Therefore, these methods might not be applicable to real-time control problems. Moreover, these methods greatly depend on accurate information regarding the parameters and topologies of the DN [21]. However, it is hard to obtain the reliable network models in practice.

In recent years, machine learning (ML) has been a popular research topic in computer science. By continuously extracting knowledge from historical data, ML-based methods can generate powerful models to deal with the uncertainty and dynamics of a system without a physical model. The learned models can be generalized to new situations and provide control decisions in real time [

22], [23]. Therefore, ML-based methods are promising alternatives with better dynamic performance of real-time optimization of the DN when accurate parameters are unknown [24]. Among the various ML-based methods, reinforcement learning (RL) has the most potential for the optimization of the DN, as it can learn optimal control strategies from historical data without knowing the global optimum [25]. In [26], a Q-learning method is used for the energy management of a hybrid electric vehicle. As an effective and famous RL algorithm, Q-learning involves learning an action value function, which is a discretized lookup-table matrix. The size of the matrix is determined by the discretized states and actions. When the states and actions are high-dimensional vectors, the sharply increasing matrix size of the action value function makes the convergence difficult. This limits the application of Q-learning in practical scenarios with high-dimensional and continuous states and action spaces. To address this problem, [27] proposes the deep RL (DRL) algorithm using a deep neural network (DNN) as the approximator of the action value function. The DNN can take continuous variables as inputs and does not have to discretize the input states. By combining the strong nonlinear approximation ability of the DNN and the decision-making capacity of RL, DRL gives the computer the human-level performance in various complex tasks [28], e.g., play Atari video games and the game Go. In 2016, Google AlphaGo defeated a human champion in chess, which indicates the remarkable potential of DRL.

Various energy management strategies based on the DRL algorithms have been proposed [

20], [22]. Reference [20] proposes a deep Q-network (DQN) based approach for the management of a battery storage system (BSS) in a micro-grid. Simulation results indicate that the proposed approach can deal with the uncertainty of the environment. However, the DQN must discretize the control variables. For optimization problems with a continuous action domain, the discretization of the control variables unusually leads to suboptimal solutions. The deep policy gradient (DPG) based method has been proven effective in the scenarios with high-dimensional and continuous action spaces. Focusing on the building energy optimization problem, [22] proposes a DPG-based method to perform online management of the building energy. The DPG-based method can take multiple actions at the same time and achieve better results than the DQN.

Inspired by recent research, we develop a DPG-based method with continuous action search to solve the P-OPF problem of the DN with renewable energy generation and BSS. The multi-time P-OPF problem is first formulated as a Markov decision process (MDP). Then, the proximal policy optimization (PPO) algorithm, which is the state-of-the-art DPG-based method, is used to solve the MDP, by sequentially considering the influence of the current action on the future. Neural networks (NNs) are used to extract the optimal operation knowledge to cope with the uncertainties from historical data. This model considers the uncertainty of the demand, the initial energy level of the BSS, and the wind power generation. The objective of this model aims to minimize the cost of the power loss by controlling the BSS and the reactive power of the wind turbine under relevant constraints. Comparative experiments are performed using a modified IEEE 33-bus DN to evaluate the performance of the proposed approach. The main contributions of this paper are presented as follows.

First, a real-time energy management strategy for DN based on the DRL algorithm is proposed. The proposed approach embeds operation knowledge extracted from historical data in the DNN to make near-optimal control decisions in real time. The extracted operation knowledge is adaptive to the uncertainty of the system and can be generalized to newly encountered situations. The decision process is similar to recalling the past experience from the memory when a new state is obtained, without resolving the OPF problem. Therefore, the proposed approach can be used for the online optimization of the DN and provide a better response to system dynamics.

Second, the proposed approach decomposes the multi-period decision problem into multiple single-time-step sub-problems, which are sequentially solved while considering their impact on the future. This reduces the computation complexity introduced by the time correlation of the storage devices.

The remainder of this paper is organized as follows. In Section II, the problem formulation is presented. The principle of the proposed approach and the training process are introduced in Section III. The experimental details and the results of a case study are presented in Section IV. Finally, Section V concludes the paper.

II. Problem Formulation

In this section, the mathematical model of the P-OPF problem with wind turbines, load demand, and BSS is presented.

A. Objective Function

The objective of the P-OPF problem is to minimize the cost of power loss. The optimization horizon is 1 day, and the time interval of optimal scheduling is 1 hour. The objective function is formulated as:

\underset{P_{b s s}, Q_{b s s}, Q_{w}}{m i n} F = m i n \sum_{t = 1}^{T} C_{p} (t) P_{l o s s} (t)

(1)

\begin{array}{l} P_{l o s s} (t) = \frac{1}{2} \sum_{i = 1}^{N} \sum_{j = 1}^{N} G (i, j) [V_{e}^{2} (i, t) + V_{f}^{2} (i, t) + V_{e}^{2} (j, t) + \\ V_{f}^{2} (j, t) - 2 (V_{e} (i, t) V_{e} (j, t) + V_{f} (i, t) V_{f} (j, t))] \end{array}

(2)

where F is the total cost of the power loss for an optimization horizon; $P_{l o s s} (t)$ is the power loss of the DN during hour t; $C_{p} (t)$ is the electricity price during hour t; $G (i, j)$ is the real component of the complex admittance matrix elements; $V_{e} (i, t)$ is the real component of the complex voltage at bus i during hour t; $V_{f} (i, t)$ is the imaginary component of the complex voltage at bus i during hour t; T is the length of one trajectory; and N is the number of nodes in the DN. The control variables are $P_{b s s}$ , $Q_{b s s}$ , and $Q_{w}$ , which represent the active power of the BSS, reactive power of the power conditioning system (PCS) of the BSS, and reactive power of the wind turbine, respectively.

B. Constraints

1)　Wind Power

The constraints of the active and reactive power of the wind turbine are expressed as [

29]:

P_{w} (k, t) = \{\begin{array}{l} P_{w, m a x} (k) v_{r} \leq v \leq v_{c o} \\ P_{w, m a x} (k) {(\frac{v}{v_{r}})}^{3} v_{c i} \leq v < v_{r} \\ 0 o t h e r w i s e \end{array}

(3)

P_{w}^{2} (k, t) + Q_{w}^{2} (k, t) \leq S_{w}^{2} (k)

(4)

where $P_{w} (k, t)$ is the active power of wind turbine k during hour t; $P_{w, m a x} (k)$ is the rated power of wind turbine k; v, $v_{r}$ , $v_{c i}$ , and $v_{c o}$ are the actual speed, rated speed, cut-in speed, and cut-out speed of the wind turbine, respectively; $Q_{w} (k, t)$ is the reactive power of wind turbine k during hour t; and $S_{w} (k)$ is the upper bound of the apparent power of wind turbine k. The parameters of the wind turbine are $v_{c i} = 4$ m/s, $v_{r} = 14$ m/s, and $v_{c o} = 24$ m/s.

2)　BSS

The BSS consists of a storage unit and a PCS unit. The PCS controls the charging and discharging processes and permits the outputs of active and reactive power, in accordance with the following constraints:

P_{b s s}^{2} (k, t) + Q_{b s s}^{2} (k, t) \leq S_{P C S, m a x}^{2} (k)

(5)

| P_{b s s} (k, t) | \leq {\bar{P}}_{b s s} (k)

(6)

where $P_{b s s} (k, t)$ is the active power of BBS k during hour t (when BBS k is charging, $P_{b s s} (k, t)$ is a positive value; when it is discharging, $P_{b s s} (k, t)$ is a negative value); $Q_{b s s} (k, t)$ is the reactive power of BBS k during hour t; $S_{P C S, m a x} (k)$ is the upper limit of the apparent power of BBS k; and ${\bar{P}}_{b s s} (k)$ is the charging power limit of BBS k.

The energy balance of the BSS should satisfy (7).

\{\begin{array}{l} E (k, t + 1) - E (k, t) - η_{c h} P_{b s s} (k, t) = 0 P_{b s s} (k, t) > 0 \\ E (k, t + 1) - E (k, t) - \frac{P_{b s s} (k, t)}{η_{d i s}} = 0 P_{b s s} (k, t) \leq 0 \end{array}

(7)

where $E (k, t)$ is the state of charge (SOC) of BSS k during hour t; and $η_{c h}$ and $η_{d i s}$ are the charging and discharging coefficients, respectively. The storage capacity cannot cross the lower or upper bound (20% or 90% of the storage capacity, respectively).

E_{m i n} \leq E (k, t) \leq E_{m a x}

(8)

where $E_{m i n}$ and $E_{m a x}$ are the lower and upper bounds of the SOC of BSS, respectively. Owing to the uncertainty of load demand and renewable energy generation during the intra-day operation, the BSS needs to be flexibly scheduled to cope with the uncertainties in practice. Therefore, the remaining level of BSS is uncertain. In order to get better simulation results of the real circumstance and fully exploit the BSS, the uncertainty of the initial level of BSS is taken into account.

3)　Power Flow and Voltage Constraints

The power flow constraints are expressed as:

\begin{array}{l} V_{e} (i, t) \sum_{j = 1}^{N} (G (i, j) V_{e} (j, t) - B (i, j) V_{f} (j, t)) + \\ V_{f} (i, t) \sum_{j = 1}^{N} (G (i, j) V_{f} (j, t) + B (i, j) V_{e} (j, t)) + P (i, t) = 0 i \in N \end{array}

(9)

P (i, t) = P_{l o a d} (i, t) - P_{w} (i, t) + P_{b s s} (i, t) i \in N

(10)

\begin{array}{l} V_{f} (i, t) \sum_{j = 1}^{N} (G (i, j) V_{e} (j, t) - B (i, j) V_{f} (j, t)) - \\ V_{e} (i, t) \sum_{j = 1}^{N} (G (i, j) V_{f} (j, t) + B (i, j) V_{e} (j, t)) + Q (i, t) = 0 i \in N \end{array}

(11)

Q (i, t) = Q_{l o a d} (i, t) + Q_{b s s} (i, t) - Q_{w} (i, t) i \in N

(12)

where $B (i, j)$ is the imaginary component of the complex admittance matrix elements; $P (i, t)$ and $Q (i, t)$ are the injection values of the active and reactive power at bus i during hour t, respectively; and $P_{l o a d} (i, t)$ and $Q_{l o a d} (i, t)$ are the active and reactive power of the load demand at bus i during hour t, respectively. Equations (9) and (11) are the active and reactive power flow equations, respectively; and (10) and (12) give the injection values of the active and reactive power, respectively.

The voltage constraint is expressed as:

\begin{matrix} V_{m i n} (i) \leq V (i, t) \leq V_{m a x} (i) & i \in N \end{matrix}

(13)

where $V (i, t)$ is the voltage at bus i during hour t; and $V_{m i n} (i)$ and $V_{m a x} (i)$ are the lower and upper bounds of the voltage at bus i, respectively.

The P-OPF problem formulated above is a stochastic nonlinear programming problem with high complexity owing to the network and time domain introduced by the BSS. This study proposes a DRL-based approach to solve this problem, which is described in detail in Section III.

III. Proposed Control Methodology

In this section, the OPF problem is modelled as an MDP first, and then the PPO algorithm is used to solve the MDP. Subsequently, the DNN architecture for function approximation is presented. Finally, the training process of the proposed approach is illustrated in detail.

A. MDP Modelling

The MDP is used to model RL problems. As the optimization of the DN is a sequential decision-making problem, it can be modelled as an MDP with finite time steps. The MDP can be divided into four parts: $<S, A, P, R>$ .

1) S represents the state set. The state $s_{t}$ is composed of five parts: $P_{l o a d} (i, t)$ , $Q_{l o a d} (i, t)$ , $P_{w} (k, t)$ , $E (k, t)$ , and $C_{p} (t)$ .

2) A represents the action set. The action $a_{t}$ is composed of three parts: $P_{b s s} (k, t)$ , $Q_{b s s} (k, t)$ , and $Q_{w} (k, t)$ .

3) P represents the probability of a transition to the next state $s_{t + 1}$ after action $a_{t}$ is taken in state $s_{t}$ . The state transition from $s_{t}$ to $s_{t + 1}$ can be expressed as $s_{t + 1} = f (s_{t}, P_{b s s} (k, t), ω_{t})$ , where $ω_{t}$ represents the randomness of the environment. The state transition for the SOC of BSS $E (k, t)$ is controlled by $P_{b s s} (k, t)$ . This can be denoted explicitly by the equality constraint in (7). Since the wind power generation and load demand for the next hour are not accurately known, the state transitions of $P_{l o a d} (i, t)$ and $P_{w} (k, t)$ are subject to the environmental randomness. However, it is difficult to accurately model the randomness $ω_{t}$ in practice. To address this problem, a model-free DRL-based approach is used to learn the transition procedure from historical data, as described in Section III-B.

4) R represents the reward $r_{t}$ after action $a_{t}$ is taken in state $s_{t}$ . A single-step reward $r_{t}$ is defined as:

r_{t} = P_{l o s s} (t) C_{p} (t) + δ_{1} + δ_{2} + p (t)

(14)

p (t) = \{\begin{array}{l} - η (20 % - E (t)) & E (t) < 20 % \\ 0 & 20 % \leq E (t) \leq 90 % \\ - η (E (t) - 90 %) & E (t) > 90 % \end{array}

(15)

where $δ_{1}$ is the penalty applied when the voltage exceeds the limit; $δ_{2}$ is the penalty applied when the capability limitation of PCS is not satisfied; $p (t)$ is the penalty applied when the upper or lower bound of the storage unit is exceeded; and $η$ is a coefficient. The units of $δ_{1}$ , $δ_{2}$ , and $η$ are $/MWh, thus, the penalty terms have the same measurement term as the cost of the power loss.

At time step t, the agent makes a decision $a_{t}$ based on the observation of the environment $s_{t}$ and then obtains a reward $r_{t}$ . Then, the environment transfers to the next state $s_{t + 1}$ . This is an MDP. In the context of the P-OPF, the SOC of BSS is a continuous variable, which is affected by the charging/discharging action performed by the agent. Therefore, when determining $a_{t}$ , it is reasonable to consider the future reward that the agent obtains after performing action $a_{t}$ . However, the same reward may not be obtained by the agent the next time, even if the same action is considered, owing to the stochastic nature of the environment (i.e., the uncertainty of wind power generation). Therefore, it is necessary to introduce a discount factor $γ \in [0,1]$ to represent the uncertainty of the environment. The discounted cumulative reward that the agent obtains after action $a_{t}$ is performed in state $s_{t}$ is expressed as:

R (t) = \sum_{k = 0}^{T - t} γ^{k} r_{t + k}

(16)

The objective of the RL is to learn a policy, which maps the state $s_{t}$ to the action $a_{t}$ that can maximize the discounted cumulative reward. By formulating the multi-period optimization problem as an MDP with finite time steps, the problems can be solved sequentially using the DRL algorithm by considering their influence on the future. Instead of solving the multi-period optimization problem by traditional approaches, sequentially solving the MDP helps reduce the computation complexity of the proposed approach. The overall structure of the proposed approach for optimization is illustrated in Fig. 1.

Fig. 1 Overall structure of proposed approach for optimization.

It should be noted that although the introduction of the discount factor reduces the complexity of the proposed approach, the selection of $γ$ requires trial and error process, which is a deficiency of the decomposition.

B. Adopting PPO Algorithm to Solve MDP

PPO is an actor-critic based algorithm (consisting of an actor and a critic). The actor is the policy function that maps the state $s_{t}$ to the action $a_{t}$ . The critic is the value function that maps the state $s_{t}$ to a scalar that measures the quality of the input state.

The actor corresponding to the policy function is parameterized by $θ^{μ}$ . In traditional policy-based approaches, the parameters are updated by maximizing the reward [

7], which is expressed as:

\nabla R_{θ^{μ}} = 𝔼_{τ ~ p_{θ^{μ}} (τ)} (R (t) \nabla l g p_{θ^{μ}} (τ)) \approx \frac{1}{N} \sum_{n = 1}^{K} \sum_{t = 1}^{T} R (t_{n}) \nabla l g p_{θ^{μ}} (a_{t, n}^{} | s_{t, n}^{})

(17)

where $𝔼 (\cdot)$ is the expectation function; K is the number of trajectories; $p_{θ^{μ}} (a_{t, n}^{} | s_{t, n}^{})$ is the probability of taking action $a_{t, n}^{}$ in state $s_{t, n}^{}$ under the policy, which is parameterized by $θ^{μ}$ ; $\nabla l g p_{θ^{μ}} (a_{t, n}^{} | s_{t, n}^{})$ is the direction that improves the probability of choosing action $a_{t, n}^{}$ in state $s_{t, n}^{}$ ; and $R (t_{n})$ is the reward, which indicates the extent of the probability improvement. Therefore, $\nabla R_{θ^{μ}}$ can adjust the strategy in the direction that increases the probability of action with a greater reward value in state $s_{t}$ .

In (17), since $R (t_{n})$ represents the discounted cumulative reward that the agent obtains after state $s_{t, n}$ , the parameters of the actor network can only be updated after one episode is completed, which reduces the learning efficiency. To solve this problem, the critic network parameterized by $θ^{Q}$ is introduced. The critic network maps state $s_{t}$ to a scalar $V^{π} (s_{t})$ , which is the expected cumulative reward that the agent obtains after visiting state $s_{t}$ under policy $π$ . The $R (t_{n})$ in (17) can be replaced with the temporal-difference error, which is given by the value function $A (s_{t}, a_{t})$ , as shown in (18):

A (s_{t}, a_{t}) = r (s_{t}, a_{t}) + γ V^{π} (s_{t + 1}) - V^{π} (s_{t})

(18)

The temporal-difference error indicates the advantage of performing action $a_{t}$ in state $s_{t}$ over the expected reward value of all actions. Since $r (s_{t}, a_{t})$ is the immediate reward, the parameter can be updated step by step. The parameters of the value function are optimized by minimizing $L (θ^{Q})$ .

L (θ^{Q}) = 𝔼 ((V^{π} (s_{t}) - y_{t})^{2})

(19)

y_{t} = r (s_{t}, a_{t}) + γ V^{π} (s_{t + 1})

(20)

However, each batch of data can only be used to update the parameter $θ^{μ}$ once, which is a disadvantage of traditional policy gradient methods. To improve the data efficiency and prevent policy updates from becoming too large simultaneously, a clipped objective function is proposed [

30]. The parameters of the policy function

θ^{μ}

are updated by:

\begin{array}{l} L^{C L I P} (θ^{μ}) = \\ \sum_{(s_{t}, a_{t})} m i n (\frac{p_{θ^{μ}} (a_{t} | s_{t})}{p_{θ^{μ^{'}}} (a_{t} | s_{t})} A (s_{t}, a_{t}), c l i p (\frac{p_{θ^{μ}} (a_{t} | s_{t})}{p_{θ^{μ^{'}}} (a_{t} | s_{t})}, 1 - ε, 1 + ε) A (s_{t}, a_{t})) \end{array}

(21)

where $ε$ is the clipping rate, which restricts the update range of the new policy in a trusted region; and $θ^{μ^{'}}$ is the parameters of the “old” actor, which is in charge of interacting with the environment. The data generated by the “old” actor can be utilized to update the parameters of actor $θ^{μ}$ several times. The clipped function $c l i p (\cdot)$ helps the PPO algorithm achieve a trade-off among simplicity, sample complexity, and wall-time [

30].

C. DNN Architecture for Function Approximation

DNN has a powerful function fitting ability. As reported in [

31], an NN can approximate the functions of arbitrary complexity with arbitrary precision. Therefore, NNs are used to fit the value function and policy function in this paper.

In the PPO algorithm, the actor represents the policy function, which maps state $s_{t}$ to action $a_{t}$ , and $s_{t}$ and $a_{t}$ are the input and output of the policy function, respectively.

a_{t} = z_{l}^{μ} (z_{l - 1}^{μ} (\dots z_{1}^{μ} (s_{t})))

(22)

z_{i}^{μ} = g (W_{i}^{μ} o_{i - 1}^{μ} + b_{i}^{μ}) i = 2,3, \dots, l

(23)

where $z_{i}^{μ}$ is the mapping relationship of the $i^{t h}$ layer of the policy function; $o_{i - 1}^{μ}$ is the output of the ${(i - 1)}^{t h}$ layer; $W_{i}^{μ}$ and $b_{i}^{μ}$ are the weight and bias of the $i^{t h}$ layer of the policy function, respectively; and $g (\cdot)$ is the activation function of the neurons.

The critic represents the value function, which maps the state $s_{t}$ to $V^{π} (s_{t})$ :

V^{π} (s_{t}) = z_{l}^{Q} (z_{l - 1}^{Q} (\dots z_{1}^{Q} (s_{t})))

(24)

z_{i}^{Q} = f (W_{i}^{Q} o_{i - 1}^{Q} + b_{i}^{Q}) i = 2,3, . . ., l

(25)

where $z_{i}^{Q}$ is the mapping relationship of the $i^{t h}$ layer of the value function; $o_{i - 1}^{Q}$ is the output of the ${(i - 1)}^{t h}$ layer; $W_{i}^{Q}$ and $b_{i}^{Q}$ are the weight and bias of the $i^{t h}$ layer of the value function, respectively; and $f (\cdot)$ is the activation function of the neurons.

Therefore, the policy function and value function are parameterized by $θ^{μ} = {W_{1}^{μ}, b_{1}^{μ}, W_{2}^{μ}, b_{2}^{μ}, \dots, W_{l}^{μ}, b_{l}^{μ}}$ and $θ^{Q} = {W_{1}^{Q},$ $b_{1}^{Q}, W_{2}^{Q}, b_{2}^{Q}, \dots, W_{l}^{Q}, b_{l}^{Q}}$ , respectively.

D. Training Process

The training process of the DNN is presented in Algorithm 1. The parameters of the proposed approach can be denoted as $θ = {θ^{μ}, θ^{μ^{'}} θ^{Q}}$ . At the beginning of the training process, the $θ$ of all the NNs are randomly initialized. The parameters of the “old” actor $θ^{μ^{'}}$ are copied from $θ^{μ}$ . Then, the algorithm is trained for M episodes to adjust $θ$ . Several actors parameterized by $θ^{μ^{'}}$ simultaneously interact with the environment. At the beginning of an episode, each “old” actor obtains a start state $s_{1}$ of a day randomly chosen from the training data. At each time step, the actor chooses the action according to the input state $s_{t}$ . The action is then performed, and the environment transfers to the next state; simultaneously, a reward is obtained. Then, the advantage estimates are calculated using (18). When all the actors finish T time steps, the parameters of the policy network $θ^{μ}$ are updated by:

\begin{array}{l} L (θ^{μ}) = \frac{1}{M} \cdot \\ \sum_{(s_{t}, a_{t})} m i n (\frac{p_{θ^{μ}} (a_{t} | s_{t})}{p_{θ^{μ^{'}}} (a_{t} | s_{t})} A (s_{t}, a_{t}), c l i p (\frac{p_{θ^{μ}} (a_{t} | s_{t})}{p_{θ^{μ^{'}}} (a_{t} | s_{t})}, 1 - ε, 1 + ε) A (s_{t}, a_{t})) \end{array}

(26)

θ_{t + 1}^{μ} = θ_{t}^{μ} - η_{μ} \nabla_{θ^{μ}} L (θ^{μ})

(27)

where $η_{μ}$ is the learning rate for the policy network; and M is the mini-batch size. Owing to the introduction of the clipped function, the collected data can be used for updating $θ^{μ}$ several times. Simultaneously, the parameter of the critic network is updated by minimizing the loss $L (θ^{Q})$ .

L (θ^{Q}) = \frac{1}{M} (V^{π} (s_{t}) - y_{t})^{2}

(28)

θ_{t + 1}^{Q} = θ_{t}^{Q} - η_{Q} \nabla_{θ^{Q}} L (θ^{Q})

(29)

where $η_{Q}$ represents the learning rate for the critic network. At the end of each episode, set $θ^{μ^{'}} \leftarrow θ^{μ}$ . When the training is finished, the parameters of the algorithm can be output for real-time optimization of the DN.

Algorithm 1: training process of DNN

Input: $η_{Q}$ , $η_{μ}$ , $ε$ , M, $γ$ , T, $N_{a}$

Output: $π$

1: Model initialization: randomly initialize critic network $Q (s, a | θ^{Q})$ and actor $μ (s | θ^{μ})$ with parameters $θ^{Q}$ and $θ^{μ}$ , and initialize “old” actors with parameters $θ^{μ^{'}} \leftarrow θ^{μ}$

2: for $e p i s o d e = 1 : M$ do

3: for $a c t o r = 1 : N_{a}$ do

4: Start state $s_{1}$ of a random day

5: for time step $t = 1 : T$ do

6: Select action according to (22), execute $a_{t}$ , obtain reward $r_{t}$ , and the environment transfers to next state

7: Compute advantage estimates $A (s_{t}, a_{t})$ according to (18)

8: end for

9: end for

10: Optimize the parameters of the actor network $θ^{μ}$ according to (26) and (27)

Optimize the parameters of the critic network $θ^{Q}$ according to (28) and (29)

11: Update parameters of “old” actors: $θ^{μ^{'}} \leftarrow θ^{μ}$

12: end for

E. Reward Rescaling Based on Clipped Reward Function

Owing to the uncertainty of the environment, the variance of the reward is large. This reduces the accuracy of the value-function estimation and increases the variance of the policy gradient, which may reduce the convergence speed and even lead to a suboptimal policy. To address this problem, a clipped function based reward-rescaling technology is introduced in this paper. The reward sent to the value function is scaled as:

r_{t} = c l i p (\frac{r_{t} - m}{σ}, - b, b)

(30)

where $m$ and $σ$ are the mean value and variance of the cumulative discounted reward of an episode, respectively; and -b and b are the lower and upper bounds of reward $r_{t}$ , respectively. The variance of the rescaled reward is significantly reduced, which helps the value function to learn unbiasedly.

IV. Case Study

In this section, the performance of the proposed approach is analyzed according to numerical results for a DN system. First, the application scenario is presented. Second, the experimental setup is detailed. Third, the training process is described to demonstrate that the algorithm can extract useful operation knowledge from the training data to reduce the cost of power loss. Fourth, a comparison is performed using test data to illustrate the generalization ability of the extracted operation knowledge and the benefits of the proposed approach.

A. Application Scenario

The proposed approach is tested on a modified IEEE 33-bus system to demonstrate the potential for reducing the cost of power loss in the DN. The topology of the DN is shown in Fig. 2. The BSSs are connected to buses 8, 15, 24, and 31. Distributed wind turbines are connected to buses 5, 10, 16, 20, 26, 30, 35, and 36. Bus 1 is selected as the slack bus, and the other buses are PQ buses.

Fig. 2 Topology of DN for case study.

The peak price is 117 $/MWh and the off-peak price is 65 $/MWh. The rated power is 500 kW for all the wind turbines. The installed capacity of the BSS is 1000 kWh. The charging and discharging power limit are 300 kW. $η_{c h}$ and $η_{d i s}$ are both set as 0.9. The lower and upper bounds of the storage capacity are set as 20% and 90%, respectively. The wind power generation data obtained from western Denmark cover 65 days and are divided into the following two groups. The data of the first 60 days are used as training data (to train the algorithm). The data of the remaining 5 days are used as test data to evaluate the generalization ability of the extracted operation knowledge and the performance of the proposed approach.

B. DNN Architecture and Hyper-parameter Setting

The PPO algorithm is an actor-critic based DRL method that employs an online actor network, a critic network, and a target network. The actor network is a copy of the online actor network. The input of the actor network is the system state $s_{t}$ , and the output is the action $a_{t}$ . The input of the critic network is also the system state $s_{t}$ . The output is the value of the state $V^{π} (s_{t})$ . Both the actor and critic networks have three hidden layers, which have 200, 100, and 100 neurons, respectively. The NNs use the rectified linear unit for all the hidden layers and the output layer of the critic networks. The output layer of the actor network uses both the tanh activation unit and the softplus activation unit. A workstation with an NVIDIA GeForce 1080Ti graphics processing unit and an Intel Xeon E5-2630 v4 central processing unit is used for the training. The DRL algorithm is implemented in Python with TensorFlow, and the power loss is computed in MATLAB. The parameters of the DRL algorithm are presented in Table I.

TABLE I Parameters of DRL Algorithm

Parameter	Value	Parameter	Value
$γ$	0.99	$ε$	0.1
$η_{μ}$	$10^{- 3}$	M	32
$η_{Q}$	$2 \times 10^{- 3}$	T	24

C. Training Process

The proposed approach and the original PPO algorithm without the clipped reward function are trained off line for 5500 episodes to learn the operation knowledge from the training data.

There are 24 steps in each epoch, which represents one day. The cumulative reward during the training procedure is depicted in Fig. 3. The cumulative reward is shown on a log scale for better visualization. At the beginning of the training, the agent cannot make good decisions and explore the action spaces to achieve more reward information in each state. Through constant interactions with the environment, the proposed approach finally learns a good policy to achieve high cumulative rewards. The proposed approach with the clipped function converges faster than the original PPO algorithm. This is because the clipped function reduces the variance of the reward, thereby reducing the unfavourable influence of the uncertainty of the wind power generation on the approximation of the value function.

Fig. 3 Cumulative reward during training procedure.

The proportion of satisfied constraints (PSC) and the average cost of the power loss for the training data are shown in Fig. 4. At the beginning of the training, the PSC is almost 0, and the cost of the power loss is high. This is because the agent is unaware of how to make optimal decisions to reduce the cost of the power loss while satisfying the correlated constraints. Therefore, the agent attempts to explore the environment and accumulate experience. The PSC increases sharply until the 1300^th episode. At this stage, the cost of the power loss decreases sharply. In this process, the agent learns to make decisions for reducing the cost of the power loss under the correlated constraints.

Fig. 4 PSC and cost of power loss during training procedure.

From the $2000^{t h}$ to 5200^th episodes, the cost of the power loss is relatively low, while the PSC fluctuates between 0.92 and 1. This indicates that the agent has mastered the skills to reduce the cost of the power loss but sometimes violates the constraints. After approximately the 5200^th episode, the PSC is around 1, suggesting that most of the decisions made by the agent satisfy the correlated constraints. This indicates that the proposed approach can extract powerful operating knowledge from training data via the NN to reduce the cost of the power loss under correlated constraints.

D. Comparison Results

1)　Experimental Setup

To test whether the knowledge extracted by the NN can be generalized to new situations and to evaluate the performance of the proposed approach, comparative experiments are performed using test data, which cover 5 days. An uncontrolled strategy, the double DQN (DDQN) algorithm, and stochastic programming (SP) are used for comparison. The optimal solution of the proposed approach is the output of the NN, whose parameter is fixed after the training. The DDQN algorithm is an improved version of deep Q-learning, which solves the problem of overestimation of the value function when the action dimension is high [

32]. The input of the DDQN algorithm is the state

s_{t}

, and the output comprises the discrete actions. Owing to the characteristic of the DDQN algorithm, the control variables must be aggregated. There are three types of control variables:

P_{b s s}

Q_{b s s}

, and

Q_{w}

. Each action is discretized into five values. Therefore, the output layer has 125 neurons, each corresponding to a set of actions. The value functions are approximated by the NNs containing three hidden layers, the neuron numbers of which are 400, 200, and 200, respectively. Note that for the proposed approach, the batteries are uniformly controlled while the wind turbines are controlled separately. For the DDQN and the proposed approach, the uncertainty of the initial SOC of BSS is considered. At the beginning of each episode, the initial SOC of BSS is sampled from Gaussian distribution, the mean and variance of which are 0.5 and 0.1, respectively. The sampled initial SOC of BSS is bounded between 0.2 and 0.9. For the uncertainty modelling of the SP method, it is assumed that the variation of the load demand and the wind power follows a normal distribution. The mean value of the distribution is the forecasted value of the load and wind power. Two hundred sets of scenarios are generated according to the assumed distributions. Then, the number of scenarios is reduced to 20 to reduce the computation burden. Next, the particle swarm optimization algorithm is used to solve the optimization problem.

2)　Performance Evaluation

The cost of the power loss with four different methods on five consecutive test days is shown in Fig. 5. As shown, the cost of the power loss varies significantly among the different cases, owing to the variations of the distributed wind energy generation and load demand. However, the proposed approach always has the minimum cost of the power loss. This indicates that the operation knowledge extracted by the NN can be generalized to new situations with various levels of the distributed wind power generation and load demand.

Fig. 5 Cost of power loss with four different methods on five consecutive test days.

The quantitative results are presented in Table II. Compared with the DDQN method, the proposed approach requires no discretization of the actions and avoids information loss during the training procedure; thus, it achieves better results. The proposed approach also achieves better results than the SP method. This may be because the adaptive control strategies learned by the proposed approach during the training procedure are scalable to newly encountered situations. When the training is finished and the algorithm is deployed in a practical system, the proposed approach and the DDQN method can provide the control decisions in a few milliseconds. The decision process is similar to recalling past experience from memory, without resolving the optimization problem. Thus, the proposed approach can provide control decisions based on the latest observed state of the DN. However, the control decisions of SP are pre-determined and cannot be adjusted according to the latest information of the DN. The real-time decisions provided by the adaptive strategies based on the latest information of the DN can yield better results than the pre-determined decisions provided by SP method. This confirms the efficiency of the proposed approach.

TABLE II Quantitative Results of Different Methods

Method	Average cost ($/day)	Improvement (%)
Uncontrolled	88.90
DDQN	70.78	20.4
SP	69.96	21.3
Proposed	67.48	24.1

The load demand and wind power on a low-wind-speed day and the changes in the cost of the power loss are presented in Fig. 6. Since the optimization horizon is an entire day, no method ensures the global optimum during each hour. It can be observed that the cost of the power loss is high if no control strategy is applied. When the DDQN and SP methods are used, the cost of the power loss is reduced. Compared with the DDQN method, the proposed approach has the continuous action search ability, thus it avoids the information loss and achieves a better control performance. Since the proposed real-time optimization approach makes decisions based on the latest state of the DN, it obtains less cost of the power loss than the pre-determined decisions made by the SP method, i.e., $t = 9 - 18$ hours for example. This is consistent with Fig. 5 and Table II.

Fig. 6 Comparison results on low-wind-speed day. (a) Changes in load demand and wind power. (b) Cost of power loss with four different methods.

V. Conclusion

The increasing penetration of renewable energy and BSS presents great challenges for the operation of the DN. In this context, we propose a DRL-based approach for the management of the DN under uncertainty. The P-OPF problem is first formulated as an MDP with finite time steps. Then, the PPO algorithm is used to solve the MDP sequentially. NNs are used to obtain the optimal operation knowledge from historical data to deal with the uncertainties. A reward-rescaling function is introduced to reduce the influence of the uncertainty of the environment on the learning process and increase the convergence speed. The operation knowledge extracted from the historical data is scalable to newly encountered situations. When the training is complete, the proposed approach can provide control decisions in real time based on the latest state of the DN, without resolving the OPF problem. Comparative tests confirm that the proposed real-time energy management strategy can provide a more flexible control strategy than the pre-determined decisions provided by the SP method. The proposed DRL-based approach is promising for providing the real-time operation of the DN. Considering that demand response is a promising approach to reduce the power loss by providing consumers with economic incentives, we intend to include it in our future works. The safe DRL-based approach for the optimization of DN while explicitly considering the operation constraints will also be studied in our future works.

References

T. Ding, S. Liu, W. Yuan et al., “A two-stage robust reactive power optimization considering uncertain wind power integration in active distribution networks,” IEEE Transactions on Sustainable Energy, vol. 7, no. 1, pp. 301-311, Jan. 2016. [Baidu Scholar]

A. Gabash and P. Li, “Active-reactive optimal power flow in distribution networks with embedded generation and battery storage,” IEEE Transactions on Power Systems, vol. 27, no. 4, pp. 2026-2035, Nov. 2012. [Baidu Scholar]

M. Aien, M. Rashidinejad, and M. Firuzabad. “Probabilistic optimal power flow in correlated hybrid wind-PV power systems: a review and a new approach,” Renewable & Sustainable Energy Reviews, vol. 41, pp. 1437-1446, Jan. 2015. [Baidu Scholar]

N. Taher, H. Z. Meymand, and H. D. Mojarrad. “An efficient algorithm for multi-objective optimal operation management of distribution network considering fuel cell power plants,” Energy, vol. 36, pp. 119-132, Jan. 2011. [Baidu Scholar]

E. Naderi, H. Narimani, M. Fathi et al., “A novel fuzzy adaptive configuration of particle swarm optimization to solve large-scale optimal reactive power dispatch,” Applied Soft Computing, vol. 53, pp. 441-456, Apr. 2017. [Baidu Scholar]

F. Capitanescu, “Critical review of recent advances and further developments needed in AC optimal power flow,” Electric Power Systems Research, vol. 136, pp. 57-68, Jul. 2016. [Baidu Scholar]

R. S. Sutton and A. G. Barto, Reinforcement Learning: an Introduction. Cambridge: MIT Press, 1998. [Baidu Scholar]

T. Niknam, M. Zare, and J. Aghaei, “Scenario-based multiobjective volt/var control in distribution networks including renewable energy sources,” IEEE Transactions on Power Systems, vol. 27, no. 4, pp. 2004-2019, Jul. 2012. [Baidu Scholar]

Y. Xu, Z. Dong, R. Zhang et al., “Multi-timescale coordinated voltage/var control of high renewable-penetrated distribution systems,” IEEE Transactions on Power Systems, vol. 32, no. 6, pp. 4398-4408, Nov. 2017. [Baidu Scholar]

D. Bertsimas, E. Litvinov, X. A. Sun et al., “Adaptive robust optimization for the security constrained unit commitment problem,” IEEE Transactions on Power Systems, vol. 28, no. 1, pp. 52-63, Jan. 2012. [Baidu Scholar]

Y. Xu, J. Ma, Z. Dong et al., “Robust transient stability-constrained optimal power flow with uncertain dynamic loads,” IEEE Transactions on Smart Grid, vol. 8, no. 4, pp. 1911-1921, Jul. 2017. [Baidu Scholar]

F. Capitanescu and L. Wehenkel, “Computation of worst operation scenarios under uncertainty for static security management,” IEEE Transactions on Power Systems, vol. 28, no. 2, pp. 1697-1705, May 2013. [Baidu Scholar]

T. Soares, R. J. Bessa, P. Pinson et al., “Active distribution grid management based on robust AC optimal power flow,” IEEE Transactions on Smart Grid, vol. 9, no. 6, pp. 6229-6241, Nov. 2018. [Baidu Scholar]

J. F. Franco, L. F. Ochoa, and R. Romero, “AC OPF for smart distribution networks: an efficient and robust quadratic approach,” IEEE Transactions on Smart Grid, vol. 9, no. 5, pp. 4613-4623, Sept. 2018. [Baidu Scholar]

E. Dall’Anese, K. Baker, and T. Summers, “Chance-constrained AC optimal power flow for distribution systems with renewables,” IEEE Transactions on Power Systems, vol. 32, no. 5, pp. 3427-3438, Sept. 2017. [Baidu Scholar]

M. Lubin, Y. Dvorkin, and S. Backhaus, “A robust approach to chance constrained optimal power flow with renewable generation,” IEEE Transactions on Power Systems, vol. 31, no. 5, pp. 3840-3849, Sept. 2016. [Baidu Scholar]

P. Fortenbacher, A. Ulbig, S. Koch et al., “Grid-constrained optimal predictive power dispatch in large multi-level power systems with renewable energy sources, and storage devices,” IEEE PES Innovative Smart Grid Technologies, Istanbul, Turkey, Oct. 2014, pp. 1-6. [Baidu Scholar]

H. Shuai, J. Fang, X. Ai et al., “Stochastic optimization of economic dispatch for microgrid based on approximate dynamic programming,” IEEE Transactions on Smart Grid, vol. 10, no. 3, pp. 2440-2452, May 2019. [Baidu Scholar]

H. Shuai, J. Fang, X. Ai et al., “Optimal real-time operation strategy for microgrid: an ADP-based stochastic nonlinear optimization approach,” IEEE Transactions on Sustainable Energy, vol. 10, no. 2, pp. 931-942, Apr. 2019. [Baidu Scholar]

V. Bui, A. Hussain, and H. Kim, “Double deep Q-learning-based distributed operation of battery energy storage system considering uncertainties,” IEEE Transactions on Smart Grid, vol. 11, no. 1, pp. 457-469, Jan. 2020. [Baidu Scholar]

W. Wang, N. Yu, Y. Gao et al., “Safe off-policy deep reinforcement learning algorithm for volt-var control in power distribution systems,” IEEE Transactions on Smart Grid, vol. 11, no. 4, pp. 3008-3018, Jul. 2020. [Baidu Scholar]

E. Mocanu, D. Mocanu, P. Nguyen et al., “On-line building energy optimization using deep reinforcement learning,” IEEE Transactions on Smart Grid, vol. 10, no. 4, pp. 3698-3708, Jul. 2019. [Baidu Scholar]

G. Zhang, W. Hu, D. Cao et al., “Deep reinforcement learning-based approach for proportional resonance power system stabilizer to prevent ultra-low-frequency oscillations,” IEEE Transactions on Smart Grid, vol. 11, no. 6, pp. 5260-5272, Nov. 2020. [Baidu Scholar]

D. Cao, W. Hu, J. Zhao et al., “Reinforcement learning and its applications in modern power and energy systems: a review,” Journal of Modern Power Systems and Clean Energy, vol. 8, no. 6, pp. 1029-1042, Nov. 2020. [Baidu Scholar]

D. Cao, W. Hu, J. B. Zhao et al., “A multi-agent deep reinforcement learning based voltage regulation using coordinated PV inverters,” IEEE Transactions on Power Systems, vol. 35, no. 5, pp. 4120-4123, Sept. 2020. [Baidu Scholar]

X. Qi, G. Wu, K. Boriboonsomsinet et al., “Data-driven reinforcement learning-based real-time energy management system for plug-in hybrid electric vehicles,” Transportation Research Record, vol. 2572, no. 1, pp. 1-8, Jan. 2016. [Baidu Scholar]

V. Mnih, K. Kavukcuoglu, D. Silver et al. (2013, Dec.). Playing Atari with deep reinforcement learning. [Online]. Available: https://arxiv.org/abs/1312.5602 [Baidu Scholar]

V. Mnih, K. Kavukcuoglu, D. Silver et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, pp. 529-533, Feb. 2015. [Baidu Scholar]

G. Kira. “Harvesting the wind: the physics of wind turbines,” Physics and Astronomy Comps Papers, vol. 2015, pp. 1-41, Apr. 2005. [Baidu Scholar]

J. Schulman, F. Wolski, P. Dhariwal et al. (2017, Jul.). Proximal policy optimization algorithms. [Online]. Available: https://arxiv.org/abs/1707.06347 [Baidu Scholar]

K. Hornik, “Approximation capabilities of multilayer feedforward networks,” Neural Networks, vol. 4, pp. 251-257, Jan. 1991. [Baidu Scholar]

H. Van Hasselt, A. Guez, and D. Silver. (2015, Sept.). Deep reinforcement learning with double q-learning. [Online]. Available: https://arxiv.org/abs/1509.06461 [Baidu Scholar]

Address:No.19 Chengxin Avenue, Jiangning District, Nanjing 211106, China

E-mail: mpce@alljournals.cn

Tel:86-25-81093060

Fax:86-25-81093040

Home

Introduction

Editorial Board

For Author

Call For Papers

APC

Sponsor & Publisher