Abstract
The optimal dispatch of energy storage systems (ESSs) in distribution networks poses significant challenges, primarily due to uncertainties of dynamic pricing, fluctuating demand, and the variability inherent in renewable energy sources. By exploiting the generalization capabilities of deep neural networks (DNNs), the deep reinforcement learning (DRL) algorithms can learn good-quality control models that adapt to the stochastic nature of distribution networks. Nevertheless, the practical deployment of DRL algorithms is often hampered by their limited capacity for satisfying operational constraints in real time, which is a crucial requirement for ensuring the reliability and feasibility of control actions during online operations. This paper introduces an innovative framework, named mixed-integer programming based deep reinforcement learning (MIP-DRL), to overcome these limitations. The proposed MIP-DRL framework can rigorously enforce operational constraints for the optimal dispatch of ESSs during the online execution. This framework involves training a Q-function with DNNs, which is subsequently represented in a mixed-integer programming (MIP) formulation. This unique combination allows for the seamless integration of operational constraints into the decision-making process. The effectiveness of the proposed MIP-DRL framework is validated through numerical simulations, demonstrating its superior capability to enforce all operational constraints and achieve high-quality dispatch decisions and showing its advantage over existing DRL algorithms.
Set of actions
Set of nodes with energy storage systems (ESSs)
Set of lines in distribution network
Indices of nodes
Set of nodes in distribution network
Index of time steps
Index used for summations over layers and units
Index of units
Index of layers in deep neural network (DNN)
Total number of layers (excluding the input layer) in DNN
State transition function
Reward function
Set of states
Set of time steps
Total number of units in layer
Charging and discharging efficiencies of ESSs
Discount factor
Parameter of trained policy
Policy network
Electricity price at time step
Penalty factor
Parameter of trained critic networks consisting of weights and biases
Gradients for updating policy parameters
Bias of unit in layer
Objective function cost of unit in layer
Objective function cost for binary activation variable of unit in layer
) Activation function, specifically ReLU function for DNN
The maximum squared value of current magnitude for line
, The maximum and minimum charging/discharging power of ESS connected to node
Critic network
Resistance and reactance of line
, The maximum and minimum states of charge (SOCs) of ESS connected to node
, The maximum and minimum squared values of voltage magnitudes
Capacity of ESS connected to node
Active power, reactive power, and current of
line at time step
Active power generation of photovoltaic (PV) system at node at time step t
Net power ofnode at time step t
Charging/discharging power of ESS connected tonode at time step
Active and reactive power demands of node at time step
Active and reactive power from slack node at time step
SOC of ESS connected to node at time step
Voltage of node at time step
Slack variable associated with ReLU function for unit in layer
Output of unit in layer
Matrix of biases for layer
Matrix of weights for layer
Output vector of layer
THE proliferation of distributed energy resources (DERs) poses various challenges in the control and operation of electrical distribution networks [
Traditional research, e.g., [
Implementing DRL algorithms in a real system typically follows a two-stage process: ① an offline initial training stage utilizing a simulator, and ② an online execution of the trained algorithm into the real system [
Several approaches have been developed to improve the constraint enforcement capabilities of DRL algorithms [
Instead, safe DRL algorithms are implemented to directly handle constraints in distribution network operations without adding penalty terms in the reward function. In [
A summary of different constraint enforcement approaches used by safe DRL algorithms in various operational problems of energy system is presented in
Reference | Operational problem | Constraint enforcement approach | Is open- accessed? |
---|---|---|---|
[ | Microgrid operation | Penalty function | No |
[ | Voltage regulation | Yes | |
[ | Optimal power flow | No | |
[ | Energy dispatch | No | |
[ | Optimal energy system dispatch | Yes | |
[ | Home energy management | Primal-dual DDPG | No |
[ | Electric vehicle (EV) in microgrid | Primal-dual SAC | No |
[ |
Microgrid energy management |
Constrained policy optimization | No |
[ | Cooling system control | Gaussian process | No |
[ | EV charging/discharging operation | Lagrange SAC | No |
[ | Distribution network operation | Safe layer | No |
[ | Voltage regulation | Safe layer | Yes |
[ | Microgrid operation | Q-network formulated MIP | Yes |
[ | Energy management | Safe layer | No |
[ | Energy hub trading | Gaussian process or safe layer | No |
[ | Microgrid operation | Action projection | No |
[ | Distribution network operation |
Constrained policy optimization | No |
[ | EV management |
The optimal dispatch of ESSs mandates strict operational constraints so that the safety and feasibility can be guaranteed, especially during the online execution [
In our previous work [
1) We propose the MIP-DRL framework to enforce operational constraints with strict adherence during the online operations. Utilizing the robust constraint enforcement capabilities of MIP, the proposed MIP-DRL framework ensures compliance with operational constraints, guaranteeing zero constraint violations during the online execution. This innovation extends the theoretical underpinnings of DRL applicability and enables the feasibility of its real-time applications.
2) The proposed MIP-DRL framework broadens its utility across diverse DRL algorithms that employ DNNs for Q-function approximation. We implement and test the proposed MIP-DRL framework with SOTA standard DRL algorithms such as DDPG and SAC, demonstrating the capability to strictly enforce the operational constraints.
3) Demonstrating its practical efficacy, the proposed MIP-DRL framework is used to address the complex challenge of the optimal dispatch problem for ESSs in distribution networks. The results illustrate the performance superiority of the proposed MIP-DRL framework over existing standard or safe DRL algorithms to improve the performance and ensure action feasibility, even in unseen scenarios.
The optimal dispatch of ESSs in a distribution network can be modeled using the nonlinear programming (NLP) formulation given by (1)-(11). The objective function in (1) aims to minimize the total operational cost over the time horizon , comprising the cost of importing power from the main grid. The operational cost at time step is settled according to the day-ahead electricity prices in €/kWh.
(1) |
s.t.
(2) |
(3) |
(4) |
(5) |
(6) |
(7) |
(8) |
(9) |
(10) |
(11) |
The steady-state operation of distribution network is modeled by the load flow sweep method, as shown in (2)-(5), in terms of the active power , reactive power , and current magnitude of line mn at time step , and the voltage magnitude of node m at time step .
In the formulated problem, we assume that only PV panels and ESSs are installed in the distribution networks. The active power flexibility provided by the dispatch of ESSs is used to provide economic benefits and ensure safe voltage magnitude levels for the distribution network. It should be mentioned that the ESS model can be further refized, including a detailed physical dynamic model, e.g., efficiency curves, temperature, and degradation. However, since this paper aims to assess the performance of the proposed MIP-DRL framework, the ESS dynamics are simplified using the linear model [
The above mathematical formulation can be modeled as a finite MDP, represented by a 5-tuple . The decision of which action is chosen in a particular state is governed by a policy . In a standard RL algorithm, an RL agent employs the policy to interact with the formulated MDP, which defines a trajectory of states, actions, and rewards: . Here, the goal of RL agent is to estimate a policy that maximizes the expected discounted return , where is the expectation of the trajectory distribution under the current policy; and is the cumulative return.
Different from the standard RL algorithm, in a constrained MDP, the RL agent aims to estimate a policy confined in a feasible set , where is a cost-based constraint function induced by the constraint violation functions ; and is the cumulative constraint violation. Based on these definitions, a constrained MDP can be formulated as a constrained optimization problem:
(12) |
A more detailed MDP description for the optimal dispatch problem of ESSs is presented below.
The state denotes the operating status of the distribution network that the agent can observe. The PV generation and consumption , day-ahead electricity price , and current time step belong to endogenous features, which are independent of the agent actions, while belongs to exogenous features, which depends on the agent action and previous state .
The action , which refers to the charging/discharging dispatch for the ESS connected to node in the distribution network. , and is a continuous space.
Given the state and action , the system transiting to the next state is defined by the transition probability:
(13) |
The transition probability function models the endogenous distribution network and ESS dynamics, determined by the physical model of the distribution network and ESSs, and the exogenous uncertainty caused by the PV generation, demand consumption, and day-ahead electricity price dynamics. In practice, it is not possible to build an accurate mathematical model for such a transition probability function. Nevertheless, the model-free RL algorithms do not require prior knowledge of function as it can be implicitly learned by interacting with the environment.
RL algorithms can learn representative operation strategies from interactions with the environment. To achieve this goal, the environment must provide a reward to quantify the goodness of any action taken during the interaction process. In this case, the raw reward is defined as the negative value of the operational cost for the distribution network, i.e.,
(14) |
DRL algorithms optimize the operational costs while adhering to the operational constraints of ESSs and the distribution network. These constraints include the SOC limit (7), the maximum discharging/charging limit (8), and voltage magnitude constraint (9). While constraints on action spaces ((7) and (8)) are straightforward to enforce through action boundaries, the voltage magnitude constraint (9) requires addressing the physical dynamics of the distribution network. To manage these limits, the constraint violation functions are integrated into the reward function (14) as penalties, converting the constrained optimization problem (12) into an unconstrained one, redefined as:
(15) |
where balances the operational costs against penalties for constraint violations. The constraint violation functions in (15) can be modeled using different functions, e.g., function, which is defined as [
(16) |
Nevertheless, it is critical to notice that enforcing operational constraints by only adding a penalty term into the reward function during the training might lead to infeasible operational states during the online execution, as observed in [
The proposed MIP-DRL framework is defined through two main procedures: ① training, where the Q-function is approximated, and ② deployment, which is executed during the online decision-making. Both of these procedures are explained in detail below [
The step-by-step training for the proposed MIP-DRL framework integrates concepts from actor-critic DRL algorithms, including DDPG [

Fig. 1 Training of proposed MIP-DRL framework. (a) Interaction with environment. (b) Environment (distribution network). (c) Policy network .
In general, the main objective of actor-critic algorithms is to approximate a good policy network while the Q-function is used during exploration to improve the quality of the policy network. After training, the Q-function is discarded. Different from this procedure, the proposed MIP-DRL framework follows the actor-expert definition [
(17) |
As a result, the training procedure for the MIP-DRL algorithms, i.e., MIP-DDPG, MIP-TD3, and MIP-SAC, resembles that of their corresponding standard DRL algorithms. Nevertheless, the actions defined using only such a Q-function cannot strictly enforce the operational constraints during the online execution. To overcome this, the proposed MIP-DRL framework leverages the MIP formulation of the trained Q-function to enforce operational constraints during the online execution.
The trained Q-function obtained from MIP-DRL algorithms with fixed parameters can be transformed into an MIP model, facilitating the operational constraint enforcement during the online execution. This transformation enables the incorporation of system constraints directly into the action decision process, as detailed in [
Based on the definitions in [
(18) |
s.t.
(19) |
(20) |
(21) |
Each layer in DNN-formulated Q-function has units, with being the unit index in layer . We denote the output vector of layer as , . The weights and biases are fixed (constant) parameters, and the same holds for the objective function costs and . The activation function output for each unit is defined by (19), while (20) and (21) define the lower and upper bounds for the and variables. For the input layer (), the input is the same as the inputs of Q-function , i.e., state and action , while the defined bounds have physical meanings (the same limits as the inputs of ). For , the bounds are defined based on the fixed parameters, as explained in [
Then, the max-Q problem for Q-function in (17) is equivalent to solving (18)-(21) [
(22) |
To better understand the MIP formulation,

Fig. 2 Visualization representation of MIP formulation.
The online execution for the MIP-DRL algorithms, i.e., MIP-DDPG, MIP-TD3, and MIP-SAC, as shown in
Algorithm 1 : online execution for MIP-DDPG, MIP-TD3, and MIP-SAC |
---|
1: Extract trained parameters from 2: Formulate as an MIP formulation according to (18)-(21) and add the operational constraints (7)-(9) 3: Extract initial state based on real-time data 4: for do 5: Get the optimal action by solving (22) using commercial MIP solvers for state 6: end for |
1) Environment Data and Framework Implementation
To demonstrate the effectiveness of the proposed MIP-DRL framework, a modified IEEE 34-node test system is used, as shown in

Fig. 3 Modified IEEE 34-node test system with distributed PV generation and ESSs.
Algorithm or environment | Parameter |
---|---|
MIP-DDPG |
|
| |
Learning rate: 4 | |
Batch size: | |
Replay buffer size: 5 | |
MIP-TD3 |
|
| |
Learning rate: 4 | |
Batch size: | |
Replay buffer size: 5 | |
MIP-SAC |
|
| |
Learning rate: 4 | |
Batch size: | |
Replay buffer size: 5 | |
| |
Reward |
|
ESSs |
|
2) Validation and Benchmarks for Comparison
To demonstrate the superior performance of the MIP-DRL algorithms (MIP-DDPG, MIP-TD3, and MIP-SAC), we compare their dispatch outcomes with those of standard DRL algorithms (DDPG, TD3, and SAC) and a safe DRL algorithm (safe DDPG). The hyperparameters of DDPG, TD3, and SAC are aligned with those of MIP-DDPG, MIP-TD3, and MIP-SAC, respectively. For safe DDPG, we adopt a linear safe layer and follow the default parameter settings as described in [

Fig. 4 Results during training process for MIP-DRL algorithms. (a) Average total reward. (b) Operational cost. (c) Cumulative penalty of voltage magnitude violations.
After the last training episode, the cumulative penalty of voltage magnitude violations of MIP-TD3 is around 1. In contrast, a higher cumulative penalty of voltage magnitude violations for the MIP-DDPG and MIP-SAC is observed at around 2. This result shows that MIP-DRL algorithms can effectively learn from interactions, reducing the cumulative penalty of voltage magnitude violations while minimizing the total operation cost by learning to dispatch the ESSs correctly. However, these trained policies cannot strictly enforce voltage magnitude constraints. If such algorithms are used directly during the online execution, they might lead to infeasible operation, causing voltage violations.

Fig. 5 Voltage magnitude of nodes to which ESSs are connected, SOC of ESSs, and day-ahead electricity price. (a) Voltage magnitude of nodes (without operation of ESSs). (b) Day-ahead electricity price. (c) Voltage magnitude of nodes (MIP-DDPG). (d) SOC of ESSs (MIP-DDPG). (e) Voltage magnitude of nodes (MIP-TD3). (f) SOC of ESSs (MIP-TD3). (g) Voltage magnitude of nodes (MIP-SAC). (h) SOC of ESSs (MIP-SAC).

Fig. 6 Charging/discharging decisions and SOC changes of ESS connected to node 27 provided by different algorithms. (a) NLP formulation. (b) MIP-DDPG. (c) DDPG. (d) Safe DDPG.

Fig. 7 Voltage magnitude of node 27 to which an ESS is connected.
Comparing the optimal solution provided by solving the NLP formulation, it can be observed that the MIP-DRL algorithms dispatch the ESSs following a more conservative approach (see charging/discharging behavior in
This can be considered a sub-optimal decision. In this case, the operational cost resulting from the dispatch decisions provided by MIP-DDPG, MIP-TD3, and MIP-SAC are 9.5%, 12.9%, and 18.4% higher, respectively, than the optimal solution provided by the NLP formulation. The difference in this dispatch decision can be due to the estimated Q-function, which might not be good enough to represent the true Q-function. As the MIP-DRL algorithms choose actions that maximize the -value estimation, the largest -value might not represent the best action for this specific state-action pair. Nevertheless, even in executing a sub-optimal decision, the MIP-DRL algorithms enforce all voltage magnitude constraints, guaranteeing the operational feasibility. On the other hand, the safe DRL algorithm, i.e., safe DDPG, fails to enforce voltage magnitude constraints strictly, as the safe layer cannot track the dynamics of complex environments.
Algorithm | Error of operational cost (%) | Number of voltage magnitude violations | Computational time (s) |
---|---|---|---|
MIP-TD3 | 13.2±0.5 | 0 | 576.7 |
MIP-DDPG | 10.4±0.7 | 0 | 435.1 |
MIP-SAC | 19.3±1.5 | 0 | 576.3 |
TD3 | 28.5±0.4 | 332 | 160.1 |
DDPG | 34.3±0.7 | 4511 | 160.1 |
SAC | 32.2±0.5 | 4417 | 160.1 |
Safe-DDPG | 39.7±0.8 | 411 | 370.1 |
Node number | Training time (hour) | Computational time (s) | Number of voltage magnitude violations | Error of operational cost (%) |
---|---|---|---|---|
34 | 4.0 | 435.1 | 0 | 10.40.7 |
69 | 4.7 | 496.9 | 0 | 10.10.9 |
123 | 6.5 | 533.4 | 0 | 11.30.7 |
We have successfully combined deep learning and optimization theory to bring constraint enforcement to DRL algorithms. By using the trained Q-network as the surrogate function of the optimal operational decisions, we have guaranteed the optimality of the action from the Q-network through the MIP formulation. Moreover, by integrating the voltage constraints into the MIP formulation, the feasibility of the action is enforced. However, the performance of MIP-DRL algorithms is determined by the approximation quality of the Q-network obtained after the training process. During this training process, the Q-iteration faces the exploration v.s. exploitation dilemma, which can impact the approximation quality. For instance, the MIP-DDPG outperforms the MIP-TD3, while the MIP-SAC performs poorly. This discrepancy may be caused by the divergence between the exploration policies, leading to different exploration efficiencies and Q-network update rules. The conservative performance of the MIP-SAC might be caused by the soft Q-network update rule, which introduces more assumptions, impacting the estimation for accurate approximation.
Formulating a trained Q-network as an MIP problem introduces extra computational time due to the maximization of the Q-value function. In this case, such an MIP formulation is considered to be a nondeterministic polynomial (NP) complete problem. The worst-case computational time grows exponentially with the number of integer variables, which is proportional to the total number of ReLU activation functions used. However, the computational time can be greatly reduced by various techniques like improved branch-and-bound, and customized ReLU function algorithms [
This paper proposes an MIP-DRL framework to define high-quality dispatch decisions (in terms of the total operational cost) for ESSs in a distribution network, while ensuring their technical feasibility (related to enforcing voltage magnitude constraints). The proposed MIP-DRL framework consists of a Q-iteration and deployment procedure. During the Q-iteration procedure, a DNN is trained to represent the accurate state-action value function. Then, during the deployment procedure, this Q-function DNN is transformed into an MIP formulation that can be solved by commercial solvers. Results show that the dispatch decisions defined by MIP-DRL algorithms can ensure zero voltage magnitude violations while standard DRL algorithms fail to meet such constraints in uncertain scenarios. Additionally, the MIP-DRL algorithms show less errors compared with the optimal solution obtained with a perfect forecast of the stochastic variables.
Reference
Y. Li, Y. Gu, G. He et al., “Optimal dispatch of battery energy storage in distribution network considering electrothermal-aging coupling,” IEEE Transactions on Smart Grid, vol. 14, no. 5, pp. 3744-3758, Sept. 2023. [Baidu Scholar]
A. Marot, A. Kelly, M. Naglic et al., “Perspectives on future power system control centers for energy transition,” Journal of Modern Power Systems and Clean Energy, vol. 10, no. 2, pp. 328-344, Mar. 2022. [Baidu Scholar]
C. Li, K. Zheng, H. Guo et al., “Intra-day optimal power flow considering flexible workload scheduling of IDCs,” Energy Reports, vol. 9, pp. 1149-1159, Sept. 2023. [Baidu Scholar]
P. P. Vergara, J. C. López, M. J. Rider et al., “Optimal operation of unbalanced three-phase islanded droop-based microgrids,” IEEE Transactions on Smart Grid, vol. 10, no. 1, pp. 928-940, Jan. 2019. [Baidu Scholar]
D. Cao, W. Hu, J. Zhao et al., “Reinforcement learning and its applications in modern power and energy systems: a review,” Journal of Modern Power Systems and Clean Energy, vol. 8, no. 6, pp. 1029-1042, Nov. 2020. [Baidu Scholar]
Z. Yin, S. Wang, and Q. Zhao, “Sequential reconfiguration of unbalanced distribution network with soft open points based on deep reinforcement learning,” Journal of Modern Power Systems and Clean Energy, vol. 11, no. 1, pp. 107-119, Jan. 2023. [Baidu Scholar]
C. Huang, H. Zhang, L. Wang et al., “Mixed deep reinforcement learning considering discrete-continuous hybrid action space for smart home energy management,” Journal of Modern Power Systems and Clean Energy, vol. 10, no. 3, pp. 743-754, May 2022. [Baidu Scholar]
J. Degrave, F. Felici, J. Buchli et al., “Magnetic control of tokamak plasmas through deep reinforcement learning,” Nature, vol. 602, no. 7897, pp. 414-419, 2022. [Baidu Scholar]
Y. Du, F. Li, K. Kurte et al., “Demonstration of intelligent HVAC load management with deep reinforcement learning: real-world experience of machine learning in demand control,” IEEE Power and Energy Magazine, vol. 20, no. 3, pp. 42-53, May 2022. [Baidu Scholar]
A. Ray, J. Achiam, and D. Amodei. (2019, Oct.). Benchmarking safe exploration in deep reinforcement learning. [Online]. Available: https://arxiv.org/abs/1910.01708 [Baidu Scholar]
H. Ding, Y. Xu, B. Chew et al., “A safe reinforcement learning approach for multi-energy management of smart home,” Electric Power Systems Research, vol. 210, p. 108120, Sept. 2022. [Baidu Scholar]
E. M. S. Duque, J. S. Giraldo, P. P. Vergara et al., “Community energy storage operation via reinforcement learning with eligibility traces,” Electric Power Systems Research, vol. 212, p. 108515, Nov. 2022. [Baidu Scholar]
P. P. Vergara, M. Salazar, J. S. Giraldo et al., “Optimal dispatch of PV inverters in unbalanced distribution systems using reinforcement learning,” International Journal of Electrical Power & Energy Systems, vol. 136, p. 107628, Mar. 2022. [Baidu Scholar]
S. Hou, E. M. Salazar, P. P. Vergara et al., “Performance comparison of deep RL algorithms for energy systems optimal scheduling,” in Proceedings of 2022 IEEE PES Innovative Smart Grid Technologies Conference Europe (ISGT-Europe), Novi Sad, Serbia, Oct. 2022, pp. 1-6. [Baidu Scholar]
X. Yang, H. He, Z. Wei et al., “Enabling safety-enhanced fast charging of electric vehicles via soft actor critic-Lagrange DRL algorithm in a cyber-physical system,” Applied Energy, vol. 329, p. 120272, Jan. 2023. [Baidu Scholar]
H. Cui, Y. Ye, J. Hu et al., “Online preventive control for transmission overload relief using safe reinforcement learning with enhanced spatial-temporal awareness,” IEEE Transactions on Power Systems, vol. 39, no. 1, pp. 517-532, Jan. 2024. [Baidu Scholar]
J. Achiam, D. Held, A. Tamar et al., “Constrained policy optimization,” in Proceedings of International Conference on Machine Learning, Sydney, Australia, Aug. 2017, pp. 22-31. [Baidu Scholar]
H. Li and H. He, “Learning to operate distribution networks with safe deep reinforcement learning,” IEEE Transactions on Smart Grid, vol. 13, no. 3, pp. 1860-1872, May 2022. [Baidu Scholar]
H. Li, Z. Wan, and H. He, “Constrained EV charging scheduling based on safe deep reinforcement learning,” IEEE Transactions on Smart Grid, vol. 11, no. 3, pp. 2427-2439, May 2020. [Baidu Scholar]
G. Dalal, K. Dvijotham, M. Vecerik et al. (2018, Jan.). Safe exploration in continuous action spaces. [Online]. Available: https://arxiv.org/abs/1801.08757 [Baidu Scholar]
G. Ceusters, M. A. Putratama, R. Franke et al., “An adaptive safety layer with hard constraints for safe reinforcement learning in multi-energy management systems,” Sustainable Energy, Grids and Networks, vol. 36, p. 101202, Dec. 2023. [Baidu Scholar]
M. Eichelbeck, H. Markgraf, and M. Althoff, “Contingency-constrained economic dispatch with safe reinforcement learning,” in Proceedings of 2022 21st IEEE International Conference on Machine Learning and Applications, Nassau, Bahamas, Dec. 2022, pp. 597-602. [Baidu Scholar]
P. Kou, D. Liang, C. Wang et al., “Safe deep reinforcement learning-based constrained optimal control scheme for active distribution networks,” Applied Energy, vol. 264, p. 114772, Apr. 2020. [Baidu Scholar]
S. Gros, M. Zanon, and A. Bemporad, “Safe reinforcement learning via projection on a safe set: how to achieve optimality?” IFAC-PapersOnLine, vol. 53, no. 2, pp. 8076-8081, Apr. 2020. [Baidu Scholar]
S. Hou, P. P. Vergara, E. M. S. Duque et al., “Optimal energy system scheduling using a constraint-aware reinforcement learning algorithm,” International Journal of Electrical Power & Energy Systems, vol. 152, p. 109230, Oct. 2023. [Baidu Scholar]
Y. Ji, J. Wang, J. Xu et al., “Real-time energy management of a microgrid using deep reinforcement learning,” Energies, vol. 12, no. 12, p. 2291, Jun. 2019. [Baidu Scholar]
J. Wang, W. Xu, Y. Gu et al., “Multi-agent reinforcement learning for active voltage control on power distribution networks,” Advances in Neural Information Processing Systems, vol. 34, pp. 3271-3284, Dec. 2021. [Baidu Scholar]
Y. Zhou, B. Zhang, C. Xu et al., “A data-driven method for fast AC optimal power flow solutions via deep reinforcement learning,” Journal of Modern Power Systems and Clean Energy, vol. 8, no. 6, pp. 1128-1139, Nov. 2020. [Baidu Scholar]
L. Liu, J. Zhu, J. Chen et al., “Deep reinforcement learning for stochastic dynamic microgrid energy management,” in Proceedings of 2021 IEEE 4th International Electrical and Energy Conference, Wuhan, China, May 2021, pp. 1-6. [Baidu Scholar]
Y. Ji, J. Wang, J. Xu et al., “Data-driven online energy scheduling of a microgrid based on deep reinforcement learning,” Energies, vol. 14, no. 8, p. 2120, Apr. 2021. [Baidu Scholar]
S. Zhang, R. Jia, H. Pan et al., “A safe reinforcement learning-based charging strategy for electric vehicles in residential microgrid,” Applied Energy, vol. 348, p. 121490, Oct. 2023. [Baidu Scholar]
Y. Ye, H. Wang, P. Chen et al., “Safe deep reinforcement learning for microgrid energy management in distribution networks with leveraged spatial-temporal perception,” IEEE Transactions on Smart Grid, vol. 14, no. 5, pp. 3759-3775, Sept. 2023. [Baidu Scholar]
P. Yu, H. Zhang, and Y. Song, “District cooling system control for providing regulation services based on safe reinforcement learning with barrier functions,” Applied Energy, vol. 347, p. 121396, Oct. 2023. [Baidu Scholar]
M. M. Hosseini and M. Parvania, “On the feasibility guarantees of deep reinforcement learning solutions for distribution system operation,” IEEE Transactions on Smart Grid, vol. 14, no. 2, pp. 954-964, Mar. 2023. [Baidu Scholar]
Y. Shi, G. Qu, S. Low et al., “Stability constrained reinforcement learning for real-time voltage control,” in Proceedings of 2022 American Control Conference, Atlanta, USA, Jun. 2022, pp. 2715-2721. [Baidu Scholar]
D. Qiu, Z. Dong, X. Zhang et al., “Safe reinforcement learning for real-time automatic control in a smart energy-hub,” Applied Energy, vol. 309, p. 118403, Mar. 2022. [Baidu Scholar]
H. Park, D. Min, J. H. Ryu et al., “DIP-QL: a novel reinforcement learning method for constrained industrial systems,” IEEE Transactions on Industrial Informatics, vol. 18, no. 11, pp. 7494-7503, Nov. 2022. [Baidu Scholar]
L. H. Macedo, J. F. Franco, M. J. Rider et al., “Optimal operation of distribution networks considering energy storage devices,” IEEE Transactions on Smart Grid, vol. 6, no. 6, pp. 2825-2836, Nov. 2015. [Baidu Scholar]
T. P. Lillicrap, J. J. Hunt, A. Pritzel et al. (2015, Sept.). Continuous control with deep reinforcement learning. [Online]. Available: https://arxiv.org/abs/1509.02971 [Baidu Scholar]
S. Fujimoto, H. Hoof, and D. Meger, “Addressing function approximation error in actor-critic methods,” in Proceedings of International Conference on Machine Learning, Stockholm, Sweden, Jul. 2018, pp. 1587-1596. [Baidu Scholar]
T. Haarnoja, A. Zhou, P. Abbeel et al. (2018, Jan.). Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. [Online]. Available: https://arxiv.org/abs/1801.01290 [Baidu Scholar]
S. Lim, A. Joseph, L. Le et al. (2018, Oct.). Actor-expert: a framework for using Q-learning in continuous action spaces. [Online]. Available: https://arxiv.org/abs/1810.09103 [Baidu Scholar]
M. Fischetti and J. Jo, “Deep neural networks and mixed integer linear optimization,” Constraints, vol. 23, no. 3, pp. 296-309, Jul. 2018. [Baidu Scholar]
G. F. Montufar, R. Pascanu, K. Cho et al. (2014, Dec.). On the number of linear regions of deep neural networks. [Online]. Available: https://proceedings.neurips.cc/paperfiles/paper/2014/file/ 109d2dd3608f669ca17920c511c2a41e-Paper.pdf [Baidu Scholar]
F. Ceccon, J. Jalving, J. Haddad et al. (2022, Feb.). OMLT: optimization & machine learning toolkit. [Online]. Available: https://arxiv.org/abs/2202.02414 [Baidu Scholar]
Gurobi Optimization, LLC. (2022, Jun.). What’s new – Gurobi 10.0. [Online]. Available: https://www.gurobi.com/whats-new-gurobi-10-0/ [Baidu Scholar]
S. Hou. (2022, Dec.). Energy management MIP deep reinforcement learning. [Online]. Available: https://github.com/ShengrenHou/Energy-management-MIP-Deep-Reinforcement-Learning [Baidu Scholar]
P. Vergara. (2022, Dec.). MIP-DRL-framework. [Online]. Available: https://github.com/distributionnetworksTUDelft/MIP-DRL-Framework [Baidu Scholar]
T. Wei and C. Liu, “Safe control with neural network dynamic models,” in Proceedings of Learning for Dynamics and Control Conference, Hawaii, USA, Jul. 2022, pp. 739-750. [Baidu Scholar]