Abstract
The integration of distributed energy resources (DERs) has escalated the challenge of voltage magnitude regulation in distribution networks. Model-based approaches, which rely on complex sequential mathematical formulations, cannot meet the real-time demand. Deep reinforcement learning (DRL) offers an alternative by utilizing offline training with distribution network simulators and then executing online without computation. However, DRL algorithms fail to enforce voltage magnitude constraints during training and testing, potentially leading to serious operational violations. To tackle these challenges, we introduce a novel safe-guaranteed reinforcement learning algorithm, the DistFlow safe reinforcement learning (DF-SRL), designed specifically for real-time voltage magnitude regulation in distribution networks. The DF-SRL algorithm incorporates a DistFlow linearization to construct an expert-knowledge-based safety layer. Subsequently, the DF-SRL algorithm overlays this safety layer on top of the agent policy, recalibrating unsafe actions to safe domains through a quadratic programming formulation. Simulation results show the DF-SRL algorithm consistently ensures voltage magnitude constraints during training and real-time operation (test) phases, achieving faster convergence and higher performance, which differentiates it apart from (safe) DRL benchmark algorithms.
DISTRIBUTION networks have experienced a notable increase in distributed energy resource (DER) integration, including residential photovoltaic (PV) systems, energy storage systems (ESSs), and plug-in electric vehicles (EVs) [
Implementing voltage magnitude regulation adopts one of two approaches: model-based and model-free approaches. Model-based approaches manage voltage magnitude regulation by solving mathematical formulations defined via an objective function and a set of operational constraints [
Several safe DRL algorithms have recently been developed to enforce operational constraints in control systems [
To ensure that the updated policy stays within a feasible set, a cumulative constraint violation index was kept below a predetermined threshold in [
Safety layer-based DRL algorithms are suitable to handle the state-wise constraints (i.e., voltage magnitude), which formulate a policy-independent safety layer to project actions defined by DRL algorithms into a feasible set. In [
Drawing on the pivotal insights [
1) The proposed DF-SRL algorithm incorporates a DistFlow linearization to devise a safety layer, leveraging expert knowledge insights to accurately map the relationship between actions of the agent and voltage magnitude variations in distribution networks.
2) The DRL algorithm overlays the safety layer on top of the DRL policy to recalibrate potentially unsafe actions to conform to safe parameters by optimizing the proximity of these actions in Euclidean space.
3) The error of the safety layer introduced by linearization is corrected by the slack parameter, and a detailed sensitivity and scalability analysis is conducted.
4) The proposed DF-SRL algorithm ensures the practicality and real-time viability of actions and guarantees safety constraints during both the training and application phases.
Voltage fluctuations in distribution networks are predominantly due to variations in active power, such as those caused by overload conditions or high inflows from PV systems [
The voltage magnitude regulation framework for DSO and aggregators is depicted in

Fig. 1 Voltage magnitude regulation framework for DSO and aggregators.
Each network node is associated with an aggregator that oversees a group of consumers equipped with DERs such as residential PV systems, ESSs, and plug-in EVs. These aggregators are empowered to fully control the DERs of their designated consumers, playing a pivotal role in the dynamic management of the distribution network. Aggregators collect consumer data, build baseline electrical consumption profiles, and share the active power flexibility with the DSO control center. Subsequently, the DSO control center deploys a voltage magnitude regulation algorithm to determine the required active power flexibility that each aggregator must provide.
In this paper, we focus on developing an RL-based algorithm to assist the DSO control center in accurately determining the required flexibility provision of each aggregator to achieve voltage magnitude regulation.
In general, the voltage magnitude regulation problem can be modeled using the non-linear programming (NLP) formulation given by (1)-(9). The objective function in (1) aims to minimize the use of flexible active power provided by all aggregators within the set , aiming to regulate the voltage magnitude over the time horizon .
(1) |
s.t.
(2) |
(3) |
(4) |
(5) |
(6) |
(7) |
(8) |
(9) |
The distribution network is formulated based on the power flow formulation shown in (2)-(5), according to the active power , reactive power , and current magnitude of lines, and the voltage magnitude of nodes. The expression in (6) enforces the used flexible active power within the boundaries that each aggregator provides, while (7) and (8) enforce the voltage magnitude and line current limits, respectively. Finally, (9) enforces that only one node is connected to the substation. Flexibility for voltage magnitude regulation at each aggregator can vary over day and time slots [
The voltage magnitude regulation problem can be modeled as a case of CMDPs, represented by a 6-tuple . Here, denotes a state space encompassing the observable states of the distribution network; denotes an action space representing the possible control actions; is the state transition probability function capturing the system dynamics; is the reward function guiding the optimization; is a discount factor reflecting the importance of future rewards; and is a set of immediate constraint functions ensuring operational safety and feasibility. The decision as to which action is chosen in a certain state is governed by a policy . The agent employs the policy to interact with the formulated CMDP and define a trajectory of states, actions, and rewards: . This trajectory not only aims to maximize the cumulative reward but also adheres to the system constraints, thereby balancing the objectives of operational efficiency and safety.
The state at time encapsulates the current operational status of the distribution network, providing a comprehensive view of the system dynamics, and it is defined by:
(10) |
where , which captures the balance among the demand, PV generation, and EV consumption at node .
The action space consists of the set of all possible active power adjustments at each node , defined as .
The DSO seeks to regulate the voltage magnitude into defined boundaries while minimizing the use of total active power flexibility provided by aggregators. Thus, the reward function is defined as the negative of the total used flexible active power, which can be expressed as:
(11) |
This formulation incentivizes the minimization of the total active power flexibility utilized, thereby promoting energy efficiency and cost effectiveness in voltage magnitude regulation. Given the state and action at time step , the system transits to the next state defined by the transition probability function that can be expressed as:
(12) |
where is the reward distribution under the current state and action . The goal of the RL agent is to find a policy that maximizes the cumulative discounted return while ensuring no constraint is violated during the exploration and exploitation processes. is the expectation function of the trajectory distribution under the current policy. is the cumulative return in current trajectory. The penalty term induced by the constraint violations denotes the voltage magnitude violation of node at time step , which is defined as:
(13) |
This formulation ensures that represents a positive penalty term when the voltage magnitude at node deviates outside the acceptable range defined by and , and is zero otherwise.
The voltage magnitude regulation problem formulated as a CMDP can then be expressed using the following constrained optimization formulation:
(14) |
In this formulation, serves as a constraint in the CMDP, ensuring that the policy leads to actions that maintain the voltage magnitude within the specified limits. It is indirectly influenced by the policy through its impact on the state and the action .
The proposed DF-SRL algorithm is defined through a parameterized policy network, denoted by . This policy network selects actions based on the current state, performing exploration and exploitation. To enhance safety and ensure that voltage magnitude constraints are met during the exploration, we introduce a safety layer on top of the policy network . A safety layer is designed based on the parameters and topology of the distribution network, enabling a projection of the original action proposed by the RL algorithm onto a safe domain. A more detailed explanation is provided as follows.
Traditional value-based DRL algorithms fail to solve the voltage magnitude regulation problem due to the continuous nature of the state and action spaces [
(15) |
(16) |
Although the TD3 algorithm effectively handles continuous action space problems, it cannot enforce constraints during the training and testing. To solve the CMDP formulation using the TD3 algorithm, the constraint violations should be added as penalty term to the reward function in (11), defined as:
(17) |
where is used to balance the weights between the total required flexibility and the penalty incurred by the voltage magnitude violations. The constrained optimization problem is reformulated into an unconstrained one in this procedure. However, directly applying penalty terms to the reward function cannot guarantee the feasibility strictly, leading to infeasible operations and poor performance [
Given the topology of a distribution network, the incidence matrix can be defined by:
(18) |
(19) |
(20) |
Given the diagonal matrices and , the relationship between the voltage magnitude of nodes and the net active and reactive power injections and can be expressed as:
(21) |
(22) |
(23) |
The linear power flow formulation presented in (21) involves an approximation that neglects the quadratic term , which represents the line losses in the distribution network. This simplification is based on the findings in [
(24) |
The relationship expressed in (21) is utilized to establish a linear mathematical programming formulation to project potentially unsafe actions, defined by the RL algorithm, into a secure operational region. The primary objective of this formulation is to find the nearest safe action that minimizes the Euclidean distance from the original potentially unsafe action . Thereby, the projection can ensure minimal deviation from the intended control strategy while strictly adhering to operational and safety constraints. The safe action projection is achieved by solving the optimization problem:
(25) |
s.t.
(26) |
(27) |
The slack parameter is introduced to manage the relaxation conditions for the voltage magnitude limits, which compensates for the inaccuracies introduced by the linear model approximation of real voltage magnitudes. By incorporating , we allow for a buffer in the operational constraints that accommodates potential deviations between the predicted and actual voltage magnitudes. This ensures that the projected actions remain within safe operational boundaries, even when the linear relationship underestimates or overestimates the effects of control actions on the voltage levels.
The proposed safety layer can project action to safe domains during the training and online execution process. The proposed DF-SRL algorithm will update the actor and critic networks based on the collected safe trajectories in the replay buffer . Therefore, the proposed DF-SRL algorithm redefines the actor-network and critic-network iteration rules by (28) and (29), respectively.
(28) |
(29) |
Note that the proposed DF-SRL algorithm for integrating the safety layer is specifically designed to be compatible with off-policy model-free algorithms. The off-policy nature of the proposed DF-SRL algorithm allows it to learn from experiences generated by a behavior policy that differs from the target policy trying to learn. This characteristic is crucial for the integration of the safety layer, as it allows the algorithm to handle the mismatched distribution between the original actions and the safe actions without impairing the update performance. Consequently, the safety layer can project potentially unsafe actions into a safe domain, ensuring operational feasibility while maintaining the integrity of the learning process. The proposed DF-SRL algorithm maintains its model-free nature by not explicitly learning the state transition function of the constructed MDP [
In addition to the integration of the safety layer, the proposed DF-SRL algorithm introduces significant novelty in the policy iteration and interaction process. More than just filtering actions, the safety layer actively changes the nature of the interaction data that are fed back into the learning process of the RL agent. By modifying the actions before they are executed (and thus the resulting state transitions and rewards), the safety layer ensures that the data used for training are not only rich in terms of learning opportunities but also aligned with operational safety requirements. This leads to an improvement in both the performance and safety of the learned policy.

Fig. 2 Architecture of proposed DF-SRL algorithm displaying interaction among actor network, critic network, and safety layer.
Algorithm 1 : proposed DF-SRL algorithm |
---|
Define the maximum training epoch and epoch length Initialize parameters of functions , , and , and reply buffer Define the parameters of the safety layer: for to do Sample an initial state from the initial distribution for to do Sample an action with exploration noise , if is not statisfied, then Project to safe action by solving {(25), s.t. (26), (27)}. else Interact with the distribution network and observe the reward and the new state Store the transition tuple in Sample a random mini-batch of transitions from Update the Q-function parameters by using (29) Update the execution policy function parameters by using (28) Update the target Q-function parameters using |
The training process begins by randomly initializing the parameters of the DNN functions and , as well as defining the parameters of the safety layer, i.e., , , , , and . For each training epoch, at each time step , the policy receives the state and samples an action . The safety layer then assesses whether the action falls within the safe domain. The projection model is activated to project actions to a safe action, denoted as , only if action could lead to voltage magnitude violations. Next, a transition tuple is compiled and stored in a replay buffer . A subset of these samples is subsequently selected and used to update the parameters of the functions , , and , as detailed in
To validate the effectiveness of the proposed DF-SRL algorithm, we construct an environment based on a CIGRE residential low-voltage network, as shown in

Fig. 3 Modified CIGRE residential low-voltage network.
To evaluate the performance of the proposed DF-SRL algorithm, we conduct a comparative analysis with several DRL benchmark algorithms, including the state-of-the-art DRL algorithms: DDPG, proximal policy optimization (PPO), TD3, and SAC, as well as a centralized model-based algorithm, i.e., an NLP formulation [
Item | Parameter |
---|---|
DF-SRL | |
Optimizer adopts Adam | |
Learning rate is | |
Batch size is 512, replay buffer is | |
SAC | |
Optimizer Adam | |
Learning rate is | |
Batch size is 512, replay buffer is | |
Entropy is fixed | |
PPO | |
Optimizer adopts Adam | |
Learning rate is | |
Batch size is | |
Aggregator | |
Environment | Reward |
Voltage limit p.u., p.u. |

Fig. 4 Comparative analysis results of different algorithms. (a) Average total reward. (b) Summation of negative value of total used active power. (c) Cumulative penalty for voltage magnitude violations.
As depicted in

Fig. 5 Voltage magnitude of different nodes before and after regulation of DF-SRL, safe DDPG, and TD3 algorithms, and NLP formulation. (a) Node 11. (b) Node 15. (c) Node 17. (d) Node 18.
Algorithm | Average total error (%) | Average number of voltage magnitude violations | Average total computational time (s) |
---|---|---|---|
DF-SRL | 0 | ||
Safe DDPG | |||
DDPG | |||
TD3 | |||
SAC | |||
PPO |
The proposed DF-SRL algorithm capitalizes on the linear relationship between the voltage magnitude and the actions. Nevertheless, the power flow formulation can introduce errors due to the approximation assumptions. The safety layer formulation introduces the slack parameter to overcome this. Primarily, should be determined by the upper error boundary for the DistFlow model compared with the actual voltage magnitude. As the final value used for influences the feasibility and optimality of the actions defined by the proposed DF-SRL algorithm, this subsection presents an in-depth sensitivity analysis of the slack parameter .

Fig. 6 Convergence performance of proposed DF-SRL algorithm for different . (a) Total flexible active power. (b) Number of voltage magnitude violations.
Additionally, in the case of , the proposed DF-SRL algorithm fails to ensure the feasibility of the decided solutions during training, whereas in the cases with set at 0.002 or 0.005, all operational constraints can be successfully enforced. In general, a low value of can make the safe solution of the linear projection model infeasible. Consequently, the resolved safe solution may cause voltage magnitude violations during training, leading to sub-optimal performance after projection. If the proposed DF-SRL algorithm is executed with being 0.002 or 0.005, significant performance improvements in optimality and feasibility are observed, as illustrated in
The scalability of the proposed DF-SRL algorithm is fundamentally determined by the effectiveness of the DistFlow linearization process. This linearization approximation is essential for mapping the actions from the DRL to safe operational domains. Substantial linearization errors can cause inaccuracies within the safety layer, misguiding action projection, compromising policy iterations, and ultimately degrading the overall efficacy of the algorithm.

Fig. 7 Voltage magnitude errors of DistFlow on 18-, 34-, 69-, and 124-node distribution networks.
We collect voltage magnitudes from all nodes within distribution networks of 18, 34, 69, and 124 nodes and calculate the deviations between the DistFlow approximations and actual voltage magnitudes in one year’s data. The voltage magnitude error in the 18-node distribution network ranges from 0.00089 to 0.00163. In the 34-node distribution network, the error ranges from 0.00082 to 0.00175. The 69-node distribution network experiences an error range of 0.0011 to 0.00172, and the 124-node distribution network experiences an error range from 0.00073 to 0.00188. Although the largest distribution network exhibits a broader range of error, the maximum error does not exceed 0.002, suggesting that setting an error threshold of effectively accommodates the inaccuracies induced by the linearization across all tested distribution networks. The results demonstrate the robustness of the DistFlow model, which forms a solid foundation for the safety layer, facilitating its application across diverse distribution network configurations. This generalizability ensures that with precise data on the parameters and topology of the distribution network, the safety layer can be tailored to maintain its accuracy and relevance, regardless of the specific characteristics of the distribution network.
The DF-SRL algorithm developed in this paper demonstrated its superior performance in handling voltage magnitude constraints while maintaining performance efficiency. In the testing phase, the DF-SRL algorithm effectively maintains voltage magnitude constraints even under severe conditions (e.g., under-voltage problem caused by extreme loading at the marginal node of the network), resulting in an operational cost reduction of 17.7% compared with the benchmark algorithms, while ensuring feasibility throughout the entire operation period. Specifically, the DF-SRL algorithm enforced voltage magnitude constraints without violations, even in unseen data. This is attributable to the safety layer embedded in the DF-SRL algorithm, designed to filter out unsafe actions during the training phase, thus eliminating voltage violations. The sensitivity analysis of the slack parameter found that its value significantly impacts the optimality and feasibility of the DF-SRL algorithm. We found that provides an optimal balance between rigorously enforcing the constraints and achieving the highest performance score. The scalability analysis conducted across various network sizes demonstrated conclusively that the DF-SRL algorithm maintains high performance and accuracy in voltage magnitude regulation, effectively substantiating its utility and robustness for practical, large-scale applications. Its versatility allows for integration with any off-policy DRL algorithm, facilitating the resolution of continuous control challenges within distribution network operations underpinned by state-wise constraints.
Nomenclature
Symbol | —— | Definition |
---|---|---|
A. | —— | Sets and Indexes |
B | —— | Batch of data collected from reply buffer to update policy and critic models |
—— | Set of lines connecting nodes in distribution network | |
, | —— | Node indexes |
—— | Set of nodes in distribution network | |
—— | Time step index | |
—— | Set of time steps | |
B. | —— | Parameters |
—— | Standard deviation of Gaussian distribution | |
θ, | —— | Parameters of critic network and related gradient |
, | —— | Parameters of trained policy and related gradient |
—— | Small value added to control relaxation condition of voltage magnitude limits | |
—— | Iterative number | |
, | —— | Elements of connection matrices and |
, | —— | “From” and “to” node indexes for lines |
—— | The maximum current limit of line connecting node m and node n | |
—— | Gaussian distribution | |
, | —— | Active and reactive power demands at node m and time step t |
—— | Active power from electric vehicles (EVs) at node m and time step t | |
, | —— | The maximum and minimum active power provided by aggregator at node |
, | —— | The maximum and minimum active power provided by aggregator at node and time step |
, | —— | Active and reactive power demands at node and time step |
—— | Active power generation of photovoltaic (PV) systems at node and time step | |
—— | Net active power of node at time step | |
, | —— | Trained |
, | —— | Resistance and reactance of line connecting nodes and |
, | —— | The maximum and minimum voltage magnitude limits |
, | —— | Upper and lower bounds for squared voltage magnitude |
—— | Voltage magnitude at slack node, which is typically considered constant and known | |
C. | —— | Continuous Variables |
—— | Current magnitude in line connecting nodes and at time step | |
—— | Flexible active power provided by aggregator at node and time step | |
, | —— | Active and reactive power flows from node to node at time step |
, | —— | Active and reactive power injections of slack node at time step |
—— | Active power flexibility provided at node i | |
pi, qi | —— | Active and reactive power of node i |
vi | —— | Voltage magnitude of node i |
—— | Voltage magnitude of node at time step | |
D. | —— | Matrices and Vectors |
, | —— | Original and projected (safe) action vectors |
, | —— | Matrices used in linear power flow formulation |
, | —— | Diagonal matrices constructed from and |
, | —— | Connection matrices representing “from” and “to” nodes of lines |
—— | Unit matrix | |
—— | Full incident matrix of distribution network | |
—— | Incidence matrix of distribution network | |
—— | Column of incidence matrix corresponding to slack node | |
, | —— | Vectors representing net active and reactive power injections |
, | —— | Vectors representing resistance and reactance of lines |
—— | Vector representing squared voltage magnitude of nodes | |
—— | Vector representing voltage magnitude of node m | |
—— | Unit vector with dimension equal to number of lines in network |
References
S. J. Davis, N. S. Lewis, M. Shaner et al., “Net-zero emissions energy systems,” Science, vol. 360, p. 9793, Jun. 2018. [Baidu Scholar]
A. G. Trojani, M. S. Moghaddam, and J. M. Baigi, “Stochastic security-constrained unit commitment considering electric vehicles, energy storage systems, and flexible loads with renewable energy resources,” Journal of Modern Power Systems and Clean Energy, vol. 11, no. 5, pp. 1405-1414, Sept. 2023. [Baidu Scholar]
A. Fu, M. Cvetkovic, and P. Palensky, “Distributed cooperation for voltage regulation in future distribution networks,” IEEE Transactions on Smart Grid, vol. 13, no. 6, pp. 4483-4493, Nov. 2022. [Baidu Scholar]
X. Chen, E. Dall’Anese, C. Zhao et al., “Aggregate power flexibility in unbalanced distribution systems,” IEEE Transactions on Smart Grid, vol. 11, no. 1, pp. 258-269, Jan. 2020. [Baidu Scholar]
C. Li, K. Zheng, H. Guo et al., “Intra-day optimal power flow considering flexible workload scheduling of IDCs,” Energy Reports, vol. 9, pp. 1149-1159, Sept. 2023. [Baidu Scholar]
Y. Li, Y. Gu, G. He et al., “Optimal dispatch of battery energy storage in distribution network considering electrothermal-aging coupling,” IEEE Transactions on Smart Grid, vol. 14, no. 5, pp. 3744-3758, Sept. 2023. [Baidu Scholar]
M. Glavic, “(Deep) Reinforcement learning for electric power system control and related problems: a short review and perspectives,” Annual Reviews in Control, vol. 48, pp. 22-35, Oct. 2019. [Baidu Scholar]
S. Hou, E. M. Salazar, P. P. Vergara et al., “Performance comparison of deep RL algorithms for energy systems optimal scheduling,” in Proceedings of 2022 IEEE PES Innovative Smart Grid Technologies Conference Europe, Novi Sad, Serbia, Oct. 2022, pp. 1-6. [Baidu Scholar]
M. Xia, F. Chen, Q. Chen et al., “Optimal scheduling of residential heating, ventilation and air conditioning based on deep reinforcement learning,” Journal of Modern Power Systems and Clean Energy, vol. 11, no. 5, pp. 1596-1605, Sept. 2023. [Baidu Scholar]
P. P. Vergara, M. Salazar, J. S. Giraldo et al., “Optimal dispatch of PV inverters in unbalanced distribution systems using reinforcement learning,” International Journal of Electrical Power and Energy Systems, vol. 136, p. 107628, Mar. 2022. [Baidu Scholar]
S. Wang, J. Duan, D. Shi et al., “A data-driven multi-agent autonomous voltage control framework using deep reinforcement learning,” IEEE Transactions on Power Systems, vol. 35, no. 6, pp. 4644-4654, Nov. 2020. [Baidu Scholar]
H. Ding, Y. Xu, B. C. S. Hao et al., “A safe reinforcement learning approach for multi-energy management of smart home,” Electric Power Systems Research, vol. 210, p. 108120, Sept. 2022. [Baidu Scholar]
E. M. S. Duque, J. S. Giraldo, P. P. Vergara et al., “Community energy storage operation via reinforcement learning with eligibility traces,” Electric Power Systems Research, vol. 212, p. 108515, Nov. 2022. [Baidu Scholar]
J. Garcıa and F. Fernández, “A comprehensive survey on safe reinforcement learning,” Journal of Machine Learning Research, vol. 16, pp. 1437-1480, Jul. 2015. [Baidu Scholar]
S. Zhang, R. Jia, H. Pan et al., “A safe reinforcement learning-based charging strategy for electric vehicles in residential microgrid,” Applied Energy, vol. 348, p. 121490, Oct. 2023. [Baidu Scholar]
H. Ding, Y. Xu, B. C. S. Hao et al., “A safe reinforcement learning approach for multi-energy management of smart home,” Electric Power Systems Research, vol. 210, p. 108120, Sept. 2022. [Baidu Scholar]
X. Yang, H. He, Z. Wei et al., “Enabling safety-enhanced fast charging of electric vehicles via soft actor critic-Lagrange DRL algorithm in a cyber-physical system,” Applied Energy, vol. 329, p. 120272, Jan. 2023. [Baidu Scholar]
H. Cui, Y. Ye, J. Hu et al., “Online preventive control for transmission overload relief using safe reinforcement learning with enhanced spatial-temporal awareness,” IEEE Transactions on Power Systems, vol. 39, no. 1, pp. 517-532, Jan. 2024. [Baidu Scholar]
J. Achiam, D. Held, A. Tamar et al., “Constrained policy optimization,” in Proceedings of International Conference on Machine Learning, Sydney, Australia, Aug. 2017, pp. 22-31. [Baidu Scholar]
H. Li and H. He, “Learning to operate distribution networks with safe deep reinforcement learning,” IEEE Transactions on Smart Grid, vol. 13, no. 3, pp. 1860-1872, May 2022. [Baidu Scholar]
H. Li, Z. Wan, and H. He, “Constrained EV charging scheduling based on safe deep reinforcement learning,” IEEE Transactions on Smart Grid, vol. 11, no. 3, pp. 2427-2439, May 2020. [Baidu Scholar]
W. Zhao, T. He, R. Chen et al. (2023, Feb.). State-wise safe reinforcement learning: a survey. [Online]. Available: https://www.ijcai.org/proceedings/2023/763 [Baidu Scholar]
S. Hou, E. M. S. Duque, P. Palensky et al. (2023, Jul.). A constraint enforcement deep reinforcement learning framework for optimal energy storage systems dispatch. [Online]. Available: https://arxiv.org/abs/2307.14304 [Baidu Scholar]
S. Hou, P. P. Vergara, E. M. S. Duque et al., “Optimal energy system scheduling using a constraint-aware reinforcement learning algorithm,” International Journal of Electrical Power & Energy Systems, vol. 152, p. 109230, Oct. 2023. [Baidu Scholar]
W. Cui, J. Li, and B. Zhang, “Decentralized safe reinforcement learning for inverter-based voltage control,” Electric Power Systems Research, vol. 211, p. 108609, Oct. 2022. [Baidu Scholar]
W. Wang, N. Yu, Y. Gao et al., “Safe off-policy deep reinforcement learning algorithm for volt-var control in power distribution systems,” IEEE Transactions on Smart Grid, vol. 11, no. 4, pp. 3008-3018, Jul. 2020. [Baidu Scholar]
M. Zhang, G. Guo, T. Zhao et al., “DNN assisted projection based deep reinforcement learning for safe control of distribution grids,” IEEE Transactions on Power Systems, vol. 39, no. 4, pp. 5687-5698, Jul. 2024. [Baidu Scholar]
G. Dalal, K. Dvijotham, M. Vecerik et al. (2018, Jan.). Safe exploration in continuous action spaces. [Online]. Available: https://arxiv.org/pdf/1801.08757v1 [Baidu Scholar]
M. Eichelbeck, H. Markgraf, and M. Althoff, “Contingency-constrained economic dispatch with safe reinforcement learning,” in Proceedings of 2022 21st IEEE International Conference on Machine Learning and Applications, Nassau, Bahamas, Dec. 2022, pp. 597-602. [Baidu Scholar]
P. Kou, D. Liang, C. Wang et al., “Safe deep reinforcement learning-based constrained optimal control scheme for active distribution networks,” Applied Energy, vol. 264, p. 114772, Apr. 2020. [Baidu Scholar]
T. H. Pham, G. de Magistris, and R. Tachibana, “OptLayer – practical constrained optimization for deep reinforcement learning in the real world,” in Proceeding of 2018 IEEE International Conference on Robotics and Automation, Brisbane, Australia, May 2018, pp. 6236-6243. [Baidu Scholar]
E. D. Klenske and P. Hennig, “Dual control for approximate bayesian reinforcement learning,” Journal of Machine Learning Research, vol. 17, pp. 1-30, Aug. 2016. [Baidu Scholar]
X. Zhang, T. Yu, Z. Pan et al., “Lifelong learning for complementary generation control of interconnected power grids with high-penetration renewables and EVs,” IEEE Transactions on Power Systems, vol. 33, no. 4, pp. 4097-4110, Jul. 2018. [Baidu Scholar]
V. Mnih, K. Kavukcuoglu, D. Silver et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, pp. 529-533, Feb. 2015. [Baidu Scholar]
T. Lilicrap, J. Hunt, A. Pritzel et al., “Continuous control with deep reinforcement learning,” in Proceedings of International Conference on Learning Representations, San Juan, Puerto Roco, May 2016, pp. 1221-234. [Baidu Scholar]
S. Fujimoto, H. Hoof, and D. Meger, “Addressing function approximation error in actor-critic methods,” in Proceedings of International Conference on Machine Learning, Stockholm, Sweden, Jul. 2018, pp. 1587-596. [Baidu Scholar]
E. Schweitzer, S. Saha, A. Scaglione et al., “Lossy DistFlow formulation for single and multiphase radial feeders,” IEEE Transactions on Power Systems, vol. 35, no. 3, pp. 1758-1768, May 2020. [Baidu Scholar]
R. S. Sutton and A. G. Barto, “Reinforcement learning: an introduction,” IEEE Transactions on Neural Networks, vol. 9, no. 5, p. 1054, Sept. 1998. [Baidu Scholar]