Abstract
This study proposes a deep reinforcement learning (DRL) based approach to analyze the optimal power flow (OPF) of distribution networks (DNs) embedded with renewable energy and storage devices. First, the OPF of the DN is formulated as a stochastic nonlinear programming problem. Then, the multi-period nonlinear programming decision problem is formulated as a Markov decision process (MDP), which is composed of multiple single-time-step sub-problems. Subsequently, the state-of-the-art DRL algorithm, i.e., proximal policy optimization (PPO), is used to solve the MDP sequentially considering the impact on the future. Neural networks are used to extract operation knowledge from historical data offline and provide online decisions according to the real-time state of the DN. The proposed approach fully exploits the historical data and reduces the influence of the prediction error on the optimization results. The proposed real-time control strategy can provide more flexible decisions and achieve better performance than the pre-determined ones. Comparative results demonstrate the effectiveness of the proposed approach.
IN the context of the energy shortage, climate change, and environmental protection, the development of clean energy and low-carbon economy, as well as the optimal allocation of energy, is essential [
The optimal power flow (OPF) problems of the DN can be classified into two categories. The first category is deterministic OPF problems. Specific values of the load demand, sustainable generation, and particular network conditions are usually needed to solve this type of problem. Various mathematical approaches [
The second category is probabilistic OPF (P-OPF) problems. To deal with the uncertainty of the DN, numerous approaches for solving the P-OPF problems have been proposed. References [
In recent years, machine learning (ML) has been a popular research topic in computer science. By continuously extracting knowledge from historical data, ML-based methods can generate powerful models to deal with the uncertainty and dynamics of a system without a physical model. The learned models can be generalized to new situations and provide control decisions in real time [
Various energy management strategies based on the DRL algorithms have been proposed [
Inspired by recent research, we develop a DPG-based method with continuous action search to solve the P-OPF problem of the DN with renewable energy generation and BSS. The multi-time P-OPF problem is first formulated as a Markov decision process (MDP). Then, the proximal policy optimization (PPO) algorithm, which is the state-of-the-art DPG-based method, is used to solve the MDP, by sequentially considering the influence of the current action on the future. Neural networks (NNs) are used to extract the optimal operation knowledge to cope with the uncertainties from historical data. This model considers the uncertainty of the demand, the initial energy level of the BSS, and the wind power generation. The objective of this model aims to minimize the cost of the power loss by controlling the BSS and the reactive power of the wind turbine under relevant constraints. Comparative experiments are performed using a modified IEEE 33-bus DN to evaluate the performance of the proposed approach. The main contributions of this paper are presented as follows.
First, a real-time energy management strategy for DN based on the DRL algorithm is proposed. The proposed approach embeds operation knowledge extracted from historical data in the DNN to make near-optimal control decisions in real time. The extracted operation knowledge is adaptive to the uncertainty of the system and can be generalized to newly encountered situations. The decision process is similar to recalling the past experience from the memory when a new state is obtained, without resolving the OPF problem. Therefore, the proposed approach can be used for the online optimization of the DN and provide a better response to system dynamics.
Second, the proposed approach decomposes the multi-period decision problem into multiple single-time-step sub-problems, which are sequentially solved while considering their impact on the future. This reduces the computation complexity introduced by the time correlation of the storage devices.
The remainder of this paper is organized as follows. In Section II, the problem formulation is presented. The principle of the proposed approach and the training process are introduced in Section III. The experimental details and the results of a case study are presented in Section IV. Finally, Section V concludes the paper.
In this section, the mathematical model of the P-OPF problem with wind turbines, load demand, and BSS is presented.
The objective of the P-OPF problem is to minimize the cost of power loss. The optimization horizon is 1 day, and the time interval of optimal scheduling is 1 hour. The objective function is formulated as:
(1) |
(2) |
where F is the total cost of the power loss for an optimization horizon; is the power loss of the DN during hour t; is the electricity price during hour t; is the real component of the complex admittance matrix elements; is the real component of the complex voltage at bus i during hour t; is the imaginary component of the complex voltage at bus i during hour t; T is the length of one trajectory; and N is the number of nodes in the DN. The control variables are , , and , which represent the active power of the BSS, reactive power of the power conditioning system (PCS) of the BSS, and reactive power of the wind turbine, respectively.
The constraints of the active and reactive power of the wind turbine are expressed as [
(3) |
(4) |
where is the active power of wind turbine k during hour t; is the rated power of wind turbine k; v, , , and are the actual speed, rated speed, cut-in speed, and cut-out speed of the wind turbine, respectively; is the reactive power of wind turbine k during hour t; and is the upper bound of the apparent power of wind turbine k. The parameters of the wind turbine are m/s, m/s, and m/s.
The BSS consists of a storage unit and a PCS unit. The PCS controls the charging and discharging processes and permits the outputs of active and reactive power, in accordance with the following constraints:
(5) |
(6) |
where is the active power of BBS k during hour t (when BBS k is charging, is a positive value; when it is discharging, is a negative value); is the reactive power of BBS k during hour t; is the upper limit of the apparent power of BBS k; and is the charging power limit of BBS k.
The energy balance of the BSS should satisfy (7).
(7) |
where is the state of charge (SOC) of BSS k during hour t; and and are the charging and discharging coefficients, respectively. The storage capacity cannot cross the lower or upper bound (20% or 90% of the storage capacity, respectively).
(8) |
where and are the lower and upper bounds of the SOC of BSS, respectively. Owing to the uncertainty of load demand and renewable energy generation during the intra-day operation, the BSS needs to be flexibly scheduled to cope with the uncertainties in practice. Therefore, the remaining level of BSS is uncertain. In order to get better simulation results of the real circumstance and fully exploit the BSS, the uncertainty of the initial level of BSS is taken into account.
The power flow constraints are expressed as:
(9) |
(10) |
(11) |
(12) |
where is the imaginary component of the complex admittance matrix elements; and are the injection values of the active and reactive power at bus i during hour t, respectively; and and are the active and reactive power of the load demand at bus i during hour t, respectively. Equations (
The voltage constraint is expressed as:
(13) |
where is the voltage at bus i during hour t; and and are the lower and upper bounds of the voltage at bus i, respectively.
The P-OPF problem formulated above is a stochastic nonlinear programming problem with high complexity owing to the network and time domain introduced by the BSS. This study proposes a DRL-based approach to solve this problem, which is described in detail in Section III.
In this section, the OPF problem is modelled as an MDP first, and then the PPO algorithm is used to solve the MDP. Subsequently, the DNN architecture for function approximation is presented. Finally, the training process of the proposed approach is illustrated in detail.
The MDP is used to model RL problems. As the optimization of the DN is a sequential decision-making problem, it can be modelled as an MDP with finite time steps. The MDP can be divided into four parts: .
1) S represents the state set. The state is composed of five parts: , , , , and .
2) A represents the action set. The action is composed of three parts: , , and .
3) P represents the probability of a transition to the next state after action is taken in state . The state transition from to can be expressed as , where represents the randomness of the environment. The state transition for the SOC of BSS is controlled by . This can be denoted explicitly by the equality constraint in (7). Since the wind power generation and load demand for the next hour are not accurately known, the state transitions of and are subject to the environmental randomness. However, it is difficult to accurately model the randomness in practice. To address this problem, a model-free DRL-based approach is used to learn the transition procedure from historical data, as described in Section III-B.
4) R represents the reward after action is taken in state . A single-step reward is defined as:
(14) |
(15) |
where is the penalty applied when the voltage exceeds the limit; is the penalty applied when the capability limitation of PCS is not satisfied; is the penalty applied when the upper or lower bound of the storage unit is exceeded; and is a coefficient. The units of , , and are $/MWh, thus, the penalty terms have the same measurement term as the cost of the power loss.
At time step t, the agent makes a decision based on the observation of the environment and then obtains a reward . Then, the environment transfers to the next state . This is an MDP. In the context of the P-OPF, the SOC of BSS is a continuous variable, which is affected by the charging/discharging action performed by the agent. Therefore, when determining , it is reasonable to consider the future reward that the agent obtains after performing action . However, the same reward may not be obtained by the agent the next time, even if the same action is considered, owing to the stochastic nature of the environment (i.e., the uncertainty of wind power generation). Therefore, it is necessary to introduce a discount factor to represent the uncertainty of the environment. The discounted cumulative reward that the agent obtains after action is performed in state is expressed as:
(16) |
The objective of the RL is to learn a policy, which maps the state to the action that can maximize the discounted cumulative reward. By formulating the multi-period optimization problem as an MDP with finite time steps, the problems can be solved sequentially using the DRL algorithm by considering their influence on the future. Instead of solving the multi-period optimization problem by traditional approaches, sequentially solving the MDP helps reduce the computation complexity of the proposed approach. The overall structure of the proposed approach for optimization is illustrated in

Fig. 1 Overall structure of proposed approach for optimization.
It should be noted that although the introduction of the discount factor reduces the complexity of the proposed approach, the selection of requires trial and error process, which is a deficiency of the decomposition.
PPO is an actor-critic based algorithm (consisting of an actor and a critic). The actor is the policy function that maps the state to the action . The critic is the value function that maps the state to a scalar that measures the quality of the input state.
The actor corresponding to the policy function is parameterized by . In traditional policy-based approaches, the parameters are updated by maximizing the reward [
(17) |
where is the expectation function; K is the number of trajectories; is the probability of taking action in state under the policy, which is parameterized by ; is the direction that improves the probability of choosing action in state ; and is the reward, which indicates the extent of the probability improvement. Therefore, can adjust the strategy in the direction that increases the probability of action with a greater reward value in state .
In (17), since represents the discounted cumulative reward that the agent obtains after state , the parameters of the actor network can only be updated after one episode is completed, which reduces the learning efficiency. To solve this problem, the critic network parameterized by is introduced. The critic network maps state to a scalar , which is the expected cumulative reward that the agent obtains after visiting state under policy . The in (17) can be replaced with the temporal-difference error, which is given by the value function , as shown in (18):
(18) |
The temporal-difference error indicates the advantage of performing action in state over the expected reward value of all actions. Since is the immediate reward, the parameter can be updated step by step. The parameters of the value function are optimized by minimizing .
(19) |
(20) |
However, each batch of data can only be used to update the parameter once, which is a disadvantage of traditional policy gradient methods. To improve the data efficiency and prevent policy updates from becoming too large simultaneously, a clipped objective function is proposed [
(21) |
where is the clipping rate, which restricts the update range of the new policy in a trusted region; and is the parameters of the “old” actor, which is in charge of interacting with the environment. The data generated by the “old” actor can be utilized to update the parameters of actor several times. The clipped function helps the PPO algorithm achieve a trade-off among simplicity, sample complexity, and wall-time [
DNN has a powerful function fitting ability. As reported in [
In the PPO algorithm, the actor represents the policy function, which maps state to action , and and are the input and output of the policy function, respectively.
(22) |
(23) |
where is the mapping relationship of the layer of the policy function; is the output of the layer; and are the weight and bias of the layer of the policy function, respectively; and is the activation function of the neurons.
The critic represents the value function, which maps the state to :
(24) |
(25) |
where is the mapping relationship of the layer of the value function; is the output of the layer; and are the weight and bias of the layer of the value function, respectively; and is the activation function of the neurons.
Therefore, the policy function and value function are parameterized by and , respectively.
The training process of the DNN is presented in Algorithm 1. The parameters of the proposed approach can be denoted as . At the beginning of the training process, the of all the NNs are randomly initialized. The parameters of the “old” actor are copied from . Then, the algorithm is trained for M episodes to adjust . Several actors parameterized by simultaneously interact with the environment. At the beginning of an episode, each “old” actor obtains a start state of a day randomly chosen from the training data. At each time step, the actor chooses the action according to the input state . The action is then performed, and the environment transfers to the next state; simultaneously, a reward is obtained. Then, the advantage estimates are calculated using (18). When all the actors finish T time steps, the parameters of the policy network are updated by:
(26) |
(27) |
where is the learning rate for the policy network; and M is the mini-batch size. Owing to the introduction of the clipped function, the collected data can be used for updating several times. Simultaneously, the parameter of the critic network is updated by minimizing the loss .
(28) |
(29) |
where represents the learning rate for the critic network. At the end of each episode, set . When the training is finished, the parameters of the algorithm can be output for real-time optimization of the DN.
Owing to the uncertainty of the environment, the variance of the reward is large. This reduces the accuracy of the value-function estimation and increases the variance of the policy gradient, which may reduce the convergence speed and even lead to a suboptimal policy. To address this problem, a clipped function based reward-rescaling technology is introduced in this paper. The reward sent to the value function is scaled as:
(30) |
where and are the mean value and variance of the cumulative discounted reward of an episode, respectively; and -b and b are the lower and upper bounds of reward , respectively. The variance of the rescaled reward is significantly reduced, which helps the value function to learn unbiasedly.
In this section, the performance of the proposed approach is analyzed according to numerical results for a DN system. First, the application scenario is presented. Second, the experimental setup is detailed. Third, the training process is described to demonstrate that the algorithm can extract useful operation knowledge from the training data to reduce the cost of power loss. Fourth, a comparison is performed using test data to illustrate the generalization ability of the extracted operation knowledge and the benefits of the proposed approach.
The proposed approach is tested on a modified IEEE 33-bus system to demonstrate the potential for reducing the cost of power loss in the DN. The topology of the DN is shown in

Fig. 2 Topology of DN for case study.
The peak price is 117 $/MWh and the off-peak price is 65 $/MWh. The rated power is 500 kW for all the wind turbines. The installed capacity of the BSS is 1000 kWh. The charging and discharging power limit are 300 kW. and are both set as 0.9. The lower and upper bounds of the storage capacity are set as 20% and 90%, respectively. The wind power generation data obtained from western Denmark cover 65 days and are divided into the following two groups. The data of the first 60 days are used as training data (to train the algorithm). The data of the remaining 5 days are used as test data to evaluate the generalization ability of the extracted operation knowledge and the performance of the proposed approach.
The PPO algorithm is an actor-critic based DRL method that employs an online actor network, a critic network, and a target network. The actor network is a copy of the online actor network. The input of the actor network is the system state , and the output is the action . The input of the critic network is also the system state . The output is the value of the state . Both the actor and critic networks have three hidden layers, which have 200, 100, and 100 neurons, respectively. The NNs use the rectified linear unit for all the hidden layers and the output layer of the critic networks. The output layer of the actor network uses both the tanh activation unit and the softplus activation unit. A workstation with an NVIDIA GeForce 1080Ti graphics processing unit and an Intel Xeon E5-2630 v4 central processing unit is used for the training. The DRL algorithm is implemented in Python with TensorFlow, and the power loss is computed in MATLAB. The parameters of the DRL algorithm are presented in
The proposed approach and the original PPO algorithm without the clipped reward function are trained off line for 5500 episodes to learn the operation knowledge from the training data.
There are 24 steps in each epoch, which represents one day. The cumulative reward during the training procedure is depicted in

Fig. 3 Cumulative reward during training procedure.
The proportion of satisfied constraints (PSC) and the average cost of the power loss for the training data are shown in

Fig. 4 PSC and cost of power loss during training procedure.
From the to 520
To test whether the knowledge extracted by the NN can be generalized to new situations and to evaluate the performance of the proposed approach, comparative experiments are performed using test data, which cover 5 days. An uncontrolled strategy, the double DQN (DDQN) algorithm, and stochastic programming (SP) are used for comparison. The optimal solution of the proposed approach is the output of the NN, whose parameter is fixed after the training. The DDQN algorithm is an improved version of deep Q-learning, which solves the problem of overestimation of the value function when the action dimension is high [
The cost of the power loss with four different methods on five consecutive test days is shown in

Fig. 5 Cost of power loss with four different methods on five consecutive test days.
The quantitative results are presented in
The load demand and wind power on a low-wind-speed day and the changes in the cost of the power loss are presented in

Fig. 6 Comparison results on low-wind-speed day. (a) Changes in load demand and wind power. (b) Cost of power loss with four different methods.
The increasing penetration of renewable energy and BSS presents great challenges for the operation of the DN. In this context, we propose a DRL-based approach for the management of the DN under uncertainty. The P-OPF problem is first formulated as an MDP with finite time steps. Then, the PPO algorithm is used to solve the MDP sequentially. NNs are used to obtain the optimal operation knowledge from historical data to deal with the uncertainties. A reward-rescaling function is introduced to reduce the influence of the uncertainty of the environment on the learning process and increase the convergence speed. The operation knowledge extracted from the historical data is scalable to newly encountered situations. When the training is complete, the proposed approach can provide control decisions in real time based on the latest state of the DN, without resolving the OPF problem. Comparative tests confirm that the proposed real-time energy management strategy can provide a more flexible control strategy than the pre-determined decisions provided by the SP method. The proposed DRL-based approach is promising for providing the real-time operation of the DN. Considering that demand response is a promising approach to reduce the power loss by providing consumers with economic incentives, we intend to include it in our future works. The safe DRL-based approach for the optimization of DN while explicitly considering the operation constraints will also be studied in our future works.
References
T. Ding, S. Liu, W. Yuan et al., “A two-stage robust reactive power optimization considering uncertain wind power integration in active distribution networks,” IEEE Transactions on Sustainable Energy, vol. 7, no. 1, pp. 301-311, Jan. 2016. [Baidu Scholar]
A. Gabash and P. Li, “Active-reactive optimal power flow in distribution networks with embedded generation and battery storage,” IEEE Transactions on Power Systems, vol. 27, no. 4, pp. 2026-2035, Nov. 2012. [Baidu Scholar]
M. Aien, M. Rashidinejad, and M. Firuzabad. “Probabilistic optimal power flow in correlated hybrid wind-PV power systems: a review and a new approach,” Renewable & Sustainable Energy Reviews, vol. 41, pp. 1437-1446, Jan. 2015. [Baidu Scholar]
N. Taher, H. Z. Meymand, and H. D. Mojarrad. “An efficient algorithm for multi-objective optimal operation management of distribution network considering fuel cell power plants,” Energy, vol. 36, pp. 119-132, Jan. 2011. [Baidu Scholar]
E. Naderi, H. Narimani, M. Fathi et al., “A novel fuzzy adaptive configuration of particle swarm optimization to solve large-scale optimal reactive power dispatch,” Applied Soft Computing, vol. 53, pp. 441-456, Apr. 2017. [Baidu Scholar]
F. Capitanescu, “Critical review of recent advances and further developments needed in AC optimal power flow,” Electric Power Systems Research, vol. 136, pp. 57-68, Jul. 2016. [Baidu Scholar]
R. S. Sutton and A. G. Barto, Reinforcement Learning: an Introduction. Cambridge: MIT Press, 1998. [Baidu Scholar]
T. Niknam, M. Zare, and J. Aghaei, “Scenario-based multiobjective volt/var control in distribution networks including renewable energy sources,” IEEE Transactions on Power Systems, vol. 27, no. 4, pp. 2004-2019, Jul. 2012. [Baidu Scholar]
Y. Xu, Z. Dong, R. Zhang et al., “Multi-timescale coordinated voltage/var control of high renewable-penetrated distribution systems,” IEEE Transactions on Power Systems, vol. 32, no. 6, pp. 4398-4408, Nov. 2017. [Baidu Scholar]
D. Bertsimas, E. Litvinov, X. A. Sun et al., “Adaptive robust optimization for the security constrained unit commitment problem,” IEEE Transactions on Power Systems, vol. 28, no. 1, pp. 52-63, Jan. 2012. [Baidu Scholar]
Y. Xu, J. Ma, Z. Dong et al., “Robust transient stability-constrained optimal power flow with uncertain dynamic loads,” IEEE Transactions on Smart Grid, vol. 8, no. 4, pp. 1911-1921, Jul. 2017. [Baidu Scholar]
F. Capitanescu and L. Wehenkel, “Computation of worst operation scenarios under uncertainty for static security management,” IEEE Transactions on Power Systems, vol. 28, no. 2, pp. 1697-1705, May 2013. [Baidu Scholar]
T. Soares, R. J. Bessa, P. Pinson et al., “Active distribution grid management based on robust AC optimal power flow,” IEEE Transactions on Smart Grid, vol. 9, no. 6, pp. 6229-6241, Nov. 2018. [Baidu Scholar]
J. F. Franco, L. F. Ochoa, and R. Romero, “AC OPF for smart distribution networks: an efficient and robust quadratic approach,” IEEE Transactions on Smart Grid, vol. 9, no. 5, pp. 4613-4623, Sept. 2018. [Baidu Scholar]
E. Dall’Anese, K. Baker, and T. Summers, “Chance-constrained AC optimal power flow for distribution systems with renewables,” IEEE Transactions on Power Systems, vol. 32, no. 5, pp. 3427-3438, Sept. 2017. [Baidu Scholar]
M. Lubin, Y. Dvorkin, and S. Backhaus, “A robust approach to chance constrained optimal power flow with renewable generation,” IEEE Transactions on Power Systems, vol. 31, no. 5, pp. 3840-3849, Sept. 2016. [Baidu Scholar]
P. Fortenbacher, A. Ulbig, S. Koch et al., “Grid-constrained optimal predictive power dispatch in large multi-level power systems with renewable energy sources, and storage devices,” IEEE PES Innovative Smart Grid Technologies, Istanbul, Turkey, Oct. 2014, pp. 1-6. [Baidu Scholar]
H. Shuai, J. Fang, X. Ai et al., “Stochastic optimization of economic dispatch for microgrid based on approximate dynamic programming,” IEEE Transactions on Smart Grid, vol. 10, no. 3, pp. 2440-2452, May 2019. [Baidu Scholar]
H. Shuai, J. Fang, X. Ai et al., “Optimal real-time operation strategy for microgrid: an ADP-based stochastic nonlinear optimization approach,” IEEE Transactions on Sustainable Energy, vol. 10, no. 2, pp. 931-942, Apr. 2019. [Baidu Scholar]
V. Bui, A. Hussain, and H. Kim, “Double deep Q-learning-based distributed operation of battery energy storage system considering uncertainties,” IEEE Transactions on Smart Grid, vol. 11, no. 1, pp. 457-469, Jan. 2020. [Baidu Scholar]
W. Wang, N. Yu, Y. Gao et al., “Safe off-policy deep reinforcement learning algorithm for volt-var control in power distribution systems,” IEEE Transactions on Smart Grid, vol. 11, no. 4, pp. 3008-3018, Jul. 2020. [Baidu Scholar]
E. Mocanu, D. Mocanu, P. Nguyen et al., “On-line building energy optimization using deep reinforcement learning,” IEEE Transactions on Smart Grid, vol. 10, no. 4, pp. 3698-3708, Jul. 2019. [Baidu Scholar]
G. Zhang, W. Hu, D. Cao et al., “Deep reinforcement learning-based approach for proportional resonance power system stabilizer to prevent ultra-low-frequency oscillations,” IEEE Transactions on Smart Grid, vol. 11, no. 6, pp. 5260-5272, Nov. 2020. [Baidu Scholar]
D. Cao, W. Hu, J. Zhao et al., “Reinforcement learning and its applications in modern power and energy systems: a review,” Journal of Modern Power Systems and Clean Energy, vol. 8, no. 6, pp. 1029-1042, Nov. 2020. [Baidu Scholar]
D. Cao, W. Hu, J. B. Zhao et al., “A multi-agent deep reinforcement learning based voltage regulation using coordinated PV inverters,” IEEE Transactions on Power Systems, vol. 35, no. 5, pp. 4120-4123, Sept. 2020. [Baidu Scholar]
X. Qi, G. Wu, K. Boriboonsomsinet et al., “Data-driven reinforcement learning-based real-time energy management system for plug-in hybrid electric vehicles,” Transportation Research Record, vol. 2572, no. 1, pp. 1-8, Jan. 2016. [Baidu Scholar]
V. Mnih, K. Kavukcuoglu, D. Silver et al. (2013, Dec.). Playing Atari with deep reinforcement learning. [Online]. Available: https://arxiv.org/abs/1312.5602 [Baidu Scholar]
V. Mnih, K. Kavukcuoglu, D. Silver et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, pp. 529-533, Feb. 2015. [Baidu Scholar]
G. Kira. “Harvesting the wind: the physics of wind turbines,” Physics and Astronomy Comps Papers, vol. 2015, pp. 1-41, Apr. 2005. [Baidu Scholar]
J. Schulman, F. Wolski, P. Dhariwal et al. (2017, Jul.). Proximal policy optimization algorithms. [Online]. Available: https://arxiv.org/abs/1707.06347 [Baidu Scholar]
K. Hornik, “Approximation capabilities of multilayer feedforward networks,” Neural Networks, vol. 4, pp. 251-257, Jan. 1991. [Baidu Scholar]
H. Van Hasselt, A. Guez, and D. Silver. (2015, Sept.). Deep reinforcement learning with double q-learning. [Online]. Available: https://arxiv.org/abs/1509.06461 [Baidu Scholar]