Abstract
The increasing integration of intermittent renewable energy sources (RESs) poses great challenges to active distribution networks (ADNs), such as frequent voltage fluctuations. This paper proposes a novel ADN strategy based on multi-agent deep reinforcement learning (MADRL), which harnesses the regulating function of switch state transitions for the real-time voltage regulation and loss minimization. After deploying the calculated optimal switch topologies, the distribution network operator will dynamically adjust the distributed energy resources (DERs) to enhance the operation performance of ADNs based on the policies trained by the MADRL algorithm. Owing to the model-free characteristics and the generalization of deep reinforcement learning, the proposed strategy can still achieve optimization objectives even when applied to similar but unseen environments. Additionally, integrating parameter sharing (PS) and prioritized experience replay (PER) mechanisms substantially improves the strategic performance and scalability. This framework has been tested on modified IEEE 33-bus, IEEE 118-bus, and three-phase unbalanced 123-bus systems. The results demonstrate the significant real-time regulation capabilities of the proposed strategy.
THE large-scale integration of intermittent distributed generation such as renewable energy sources (RESs) presents prospects for enhancing the energy decarbonization and flexibility of active distribution networks (ADNs). However, the uncertainty of RES output also challenges the operation of ADNs, including more frequent voltage violations and increased network loss [
Existing cutting-edge approaches in the field of ADN operation optimization mainly include model-based methods such as mathematical programming [
With respect to the dynamic adaptive strategy for real-time control, the deep reinforcement learning (DRL) is a promising alternative algorithm with model-free characteristics [
Notably, the aforementioned literature predominantly utilizes centralized methods that require global system data for decision-making, resulting in possible single-point failures. For the centralized single-agent DRL, a high-dimensional action space may incur dimensionality curses [
To address the limitations of prior studies, this paper proposes a novel MADRL-based real-time optimization strategy for ADN that fully harnesses switch state transitions while ensuring scalability. After adopting the optimal switch deployment calculated by the prior reconfiguration preliminary, the real-time management of DERs based on the MADRL-trained strategies can be executed to optimize the ADN operation. Unlike [
The major contributions of this paper are summarized as follows.
1) A novel MADRL-based DER control method that integrates a preliminary model-based switch reconfiguration is proposed to optimize the ADN in real time. To the best of our knowledge, the existing DRL-based ADN control strategies are primarily combined with the day-ahead scheduling of CB and OLTC [
2) The PS mechanism is integrated into the TD3 algorithm to solve the formulated problem. By sharing identical network parameters and samples gathered by all agents, this mechanism considerably enhances the algorithmic scalability in larger systems [
3) The proposed MADRL-based optimization strategy exhibits superior real-time decision-making capability and generalization performance against various unseen scenarios, which has been verified in several test systems.
The remainder of this paper is organized as follows. Section II presents the problem formulation of ADN optimization model. Section III formulates the proposed model within the decentralized partially observable Markov decision process (Dec-POMDP) framework. Section IV discusses the proposed parameter sharing-prioritized experience replay-independent twin delayed deep deterministic policy gradient (PS-PER-ITD3) algorithm. The simulation results and conclusions are presented in Sections V and VI, respectively.
To fully utilize the regulating function of switch state transitions in ADN operation optimization, we first calculate the optimal 24-hour switch states by solving a mixed-integer second-order-cone programming (MISOCP) reconfiguration problem [
(1) |
(2) |
(3) |
(4) |
(5) |
(6) |
(7) |
(8) |
where is the operation period segment; and are the bus set and branch set of ADN, respectively; is the set of buses linked to substation; is the number of buses linked to substations; is the number of buses of ADN; is the switch status of branch at time (0 represents off and 1 represents on); is the auxiliary variable to ensure connectivity; indicates that is the downstream bus of ; and are the maximum active and reactive power that the RES device installed on bus can output under external weather conditions at time t, respectively; is the active load demand of bus ; is the active power flow on branch at time ; is the reactive power flow on branch at time ; and are the resistance and impedance of branch , respectively; and are the amplitudes of voltage and current phasors, respectively; the subscripts min and max represent the minimum and maximum values, respectively; and is a huge relaxation coefficient.
Formulas (
Subsequently, whether it is during offline training or online execution, the MADRL-based DER control of ADN will be used under the topology with the optimal 24-hour switch state, which has often been neglected in previous studies [
The objective function of the real-time optimization model for ADN operation is composed of the cost of network loss and the voltage violation penalty, as given by:
(9) |
(10) |
(11) |
where F is the objective of ADN optimization in the entire period; is the active power output of DER installed on bus at time that can be adjusted; is the unit cost of active power loss ; is the penalty factor of voltage violation ; and are the target coefficients of loss cost and voltage penalty, respectively; is the function for depicting nodal voltage violation [
(12) |
(13) |
where is the reactive power output of the DER installed on bus at time . DERs consist of RESs and energy storage systems (ESSs).
Ensuring the secure operation of the ADN necessitates maintaining nodal voltages within predetermined ranges and preventing the RES device from surpassing its maximum power output. As the inverter-based RESs have not yet been widely adopted, we employ the traditional method of curtailing the active power output of RES to optimize the ADN operation.
(14) |
(15) |
(16) |
where and are the active and reactive power outputs of the RES installed on bus i at time t, respectively.
ESSs are introduced into our model to further enhance the regulatory effect in ADN optimization. The mathematical model of ESS can be expressed as:
(17) |
(18) |
where is the state of charge (SOC) of ESS installed on bus at time ; and are the discharging and charging efficiencies, respectively; is the output power of ESS installed on bus at time ; and and are the minimum and maximum SOCs of ESS, respectively.
In this section, we formulate the proposed optimization strategy within the MADRL framework. First, the basic concepts of the Markov decision process (MDP) and Dec-POMDP are briefly explained to facilitate the modeling procedure. Second, the optimization problem is constructed as a Dec-POMDP model that focuses on determining the state variables, action variables, and other essential factors.
First, the concept of an MDP in single-agent DRL is introduced. An MDP can be modeled as a tuple consisting of state space , action space , state transition probability , reward function , and the discount factor . In the MDP, an agent will make action at each time step based on the environmental observation ; then, it will obtain a reward . Meanwhile, the state is transmitted to the next new state according to the state transition probability .
We define as the trajectory of MDP and as the mapping of the action probability distribution for each state. The objective of the agent is to find a control policy that can maximize the cumulative reward .
Dec-POMDP is a variant of MDP under a multi-agent full cooperation mode, which indicates that each agent shares an identical target and reward. It can be described by a tuple , including agents, global state variable , the local observation of agent , the action of agent , reward , state transition function , and discount factor . The interaction process of POMDP is similar to that of the MDP; thus, it is not repeated here.
A schematic of the complete framework of the proposed optimization strategy is shown in

Fig. 1 Complete framework of proposed optimization strategy.
The online decentralized process is described as follows. After receiving the trained policies, each regional agent can achieve online local DER control without the need for information exchange, which has been learned in the offline centralized training process.
This part describes the construction of the DER control problem as a Dec-POMDP model that contains several fundamental elements.
1) Observation space: the observation of agent is expressed as:
(19) |
where represents the buses that belong to region .
2) Action space: is the action of the regional agent at time , which refers to the active power output of DER that regional agent can control:
(20) |
where is the active power output of the PV installed on bus ; and is the active power output of the wind turbine (WT) installed on bus .
3) Constraints: the action space and observation space must satisfy the RES output and SOC constraints:
(21) |
(22) |
where and are the power factors of the PV and WT installed on bus j, respectively.
Formulas (
(23) |
(24) |
(25) |
where is the predicted active power output of ESS that has not been clipped within safe range.
Formulas (
4) Reward function: as this paper constructs the optimization problem as a Dec-POMDP model under a full cooperative framework, each agent shares an identical reward as:
(26) |
The reward function in (26) includes the penalty of the voltage violation and the cost of the network loss, as expressed in (9)-(11).

Fig. 2 Centralized training framework of proposed PS-PER-ITD3.
Compared with DDPG, TD3 has been widely recognized for its effectiveness in alleviating the “bootstrapping” phenomenon, owing to the double-Q clipped learning technique [
Under the ITD3 framework, each agent behaves as an independent TD3 learner that is unable to capture transitions gathered by other agents, leading to a nonstationary Markovian environment [
However, as global critics must receive global information, the algorithms such as MADDPG [
PS [
Furthermore, the shared networks can be updated based on the experiences collected by all agents [
A comparison between the PS-integrated MADRL algorithm and the conventional global critic based algorithm (e.g., MADDPG) is presented in

Fig. 3 Comparison between PS-integrated MADRL algorithm and conventional global critic based algorithm. (a) PS-integrated MADRL algorithm. (b) Global critic based algorithm.
The TD3 algorithm is an improvement of DDPG [
1) Setting the absolute value of the temporal difference (TD) error of the experience transition as its priority , where represents the index of transition.
(27) |
where and are the target networks of twin critic networks and , respectively.
2) Computing the sampling probability P for each experience in the replay buffer based on their priority:
(28) |
where NM is the number of transitions stored in the replay buffer; and is the priority exponent that needs to be adjusted.
3) Sampling a minibatch of transitions stored in the replay buffer according to their computed probability P.
4) Computing the importance sampling weights for the sampled transitions as:
(29) |
where is the importance sampling exponent that needs to be adjusted. The importance sampling weight and TD error will be used to calculate the critic loss.
The PER mechanism deviates from uniform sampling by assigning higher sampling weights to transitions with higher learning values, as indicated by the absolute TD error. This error is positively correlated with the extent to which the critic has been inadequately trained for the corresponding experience. Therefore, sampling these experiences during the updating process can effectively enhance the model performance.
The complete training framework of the proposed PS-PER-ITD3 is illustrated in
The regional agent firstly receives the local observation at time and generates an action according to the local observation and the shared policy network . Subsequently, the output action will be added with a random Gaussian noise to promote exploration. The aggregated action set is then implemented on the virtual ADN, whose topology at time is calculated using the optimal reconfiguration preliminary (1)-(8). Subsequently, each agent receives the same reward and its next observation by solving the power flow issues. Finally, the transitions of all regional agents at time are formulated and sent to the shared experience replay buffer. During each episode of the centralized offline training, the MADRL aggregator will choose each agent and sample transitions from the shared buffer to update the model parameters, where is the set of minibatch samples. A more concise training framework of the proposed optimization strategy is shown in

Fig. 4 Training framework of proposed PS-PER-ITD3.
After sampling transitions from the buffer, the mean-squared TD error of the sampled transitions is then calculated as the loss of critic :
(30) |
The target value is calculated using the predicted next action that is approximated by the target actor network :
(31) |
Similar operations are executed again with another critic :
(32) |
Subsequently, the two loss functions are utilized to update the weights of corresponding shared twin critics through the gradient descent algorithm:
(33) |
where is the learning rate of ; and is the learning rate of .
As for the actor network , the policy gradient for updating can be expressed as:
(34) |
Finally, the weights of the three shared target networks are updated softly according to their corresponding networks with the fixed frequency :
(35) |
In this paper, the modified IEEE 33-bus system [
The function is used as the terminal activation function of the actor network to limit its output to , which can be linearly scaled back to the output power of the DER device. Both the MADRL algorithm and power flow calculations are run in MATLAB 2022b with an Intel Core i9 CPU and Nvidia RTX4090 GPU. The topology, DER installation, and subnetwork division results for each test system are presented in Appendix A. The operation optimization model formulated for the three-phase unbalanced 123-bus system is presented in Appendix B. Detailed parameters such as load ratio and branch impedance can be learned from MATPOWER [
First, the reconfiguration issue of the modified IEEE 33-bus system is solved using GUROBI to acquire 24-hour optimal switch deployment, which is presented in Appendix A. Subsequent MADRL training and execution occur under the calculated topologies with the optimal 24-hour switch states.
To explore the optimal operation control strategies, several DRL algorithms are applied, i.e., the single-agent DDPG (SADDPG), MADDPG, PS-ITD3, and PS-PER-ITD3. The training performances of these algorithms in the reconfigured IEEE 33-bus system are presented in

Fig. 5 Training performances of different algorithms in reconfigured IEEE 33-bus system.
1) All the DRL algorithms converge terminally, except for SADDPG. The optimization objective is to adjust the output of the nine DERs in the IEEE 33-bus system to promote security and economy. However, the dimensionality of the action space in SADDPG, which is nine, is too large to learn a stable control strategy, thus resulting in severe fluctuations. Similarly, other single-agent DRL approaches exhibit poor scalability in the scenarios with numerous DERs.
2) Comparing the volatility of the three MADRL algorithms during the 30
3) By analyzing the convergence performances of the three MADRL algorithms during the 200
The variations in the average loss cost and voltage violation penalty of the PS-PER-ITD3 are shown in

Fig. 6 Average loss cost and voltage violation penalty of PS-PER-ITD3 in reconfigured IEEE 33-bus system.
As shown in
Algorithm | Voltage violation rate (%) | Loss cost ($) |
---|---|---|
MISOCP | 0 | 117.72 |
SADDPG | 0 | 127.89 |
MADDPG | 0 | 126.77 |
PS-ITD3 | 0 | 124.65 |
PS-PER-ITD3 | 0 | 121.84 |
However, none of the four DRL-based algorithms violate the voltage limits. This illustrates that the optimization task in the IEEE 33-bus system is not difficult; thus, a larger system is required to test the methodological scalability rigorously.
Subsequently, a comparative experiment is conducted to verify the significance of the reconfiguration prior to the proposed optimization strategy. This involved two scenarios: one without reconfiguration (scenario 1) and the other with reconfiguration (scenario 2). The comparisons of average training rewards are presented in

Fig. 7 Scenario comparisons of training in IEEE 33-bus system.
Scenario | Algorithm | Voltage violation rate (%) | Loss cost ($) |
---|---|---|---|
1 | MISOCP | 0 | 140.03 |
PS-ITD3 | 1.340 | 142.87 | |
PS-PER-ITD3 | 1.091 | 143.90 | |
2 | PS-PER-ITD3 | 0 | 121.84 |
1) The terminal rewards of PS-ITD3 and PS-PER-ITD3 in scenario 1 both converge at , which is 30 less than that of PS-PER-ITD3 in scenario 2.
2) The online execution results indicate that both the PS-ITD3 and the PS-PER-ITD3 in scenario 1 cannot strictly restrict the nodal voltage within a safe range, with approximately 1% of the buses violating the limits over 24 hours. Conversely, an identical PS-PER-ITD3 in scenario 2 rectifies this issue. Furthermore, the loss cost in scenario 1 is $20 more than that in scenario 2. These observations demonstrate the necessity of a preliminary reconfiguration in ADN optimization.
Similar operations are implemented to calculate the optimal reconfiguration deployment for an IEEE 118-bus system. A 24-hour reconfiguration period is stipulated because of the concerns that more frequent switch operations could promote potential risks.
The aforementioned MADRL algorithms are tested in a reconfigured IEEE 118-bus system to evaluate their scalability. The reward convergence curves of different algorithms in the reconfigured IEEE 118-bus system are presented in

Fig. 8 Algorithmic training comparisons in reconfigured IEEE 118-bus system.
1) MADDPG in the reconfigured IEEE 118-bus system shows considerable volatility, with several collapses and surges during training, and converges until the 370
2) The terminal rewards of PS-PER-ITD3, PS-ITD3, and MADDPG are approximately , , and , respectively. Compared with the IEEE 33-bus system, the PS-PER-ITD3 exhibits significant superiority over MADDPG in terms of training reward and convergence rate in the reconfigured IEEE 118-bus system. This demonstrates that the performance gap between the PS-integrated algorithms and the conventional MADRL algorithms increases with the system scale.
To enable better reward convergence, the target coefficients and in the original IEEE 118-bus system without reconfiguration are set distinct from the reconfigured IEEE 118-bus system. Owing to this stipulation, a comparison of the comprehensive rewards of two scenarios, as shown in

Fig. 9 Loss cost and voltage violation penalty of PS-PER-ITD3 in IEEE 118-bus system. (a) Reconfigured IEEE 118-bus system (scenario 2). (b) Original IEEE 118-bus system without reconfiguration (scenario 1).
1) The loss cost of the reconfigured IEEE 118-bus system converges at approximately $475, whereas that of the original IEEE 118-bus system converges at $750.
2) The voltage violation penalty of the reconfigured IEEE 118-bus system converges at 32, whereas that of the original IEEE 118-bus system converges at 185.
3) In summary, the PS-PER-ITD3 in scenario 2 exhibits superior mitigation effects on both the loss cost and voltage violation, which preliminarily confirms the necessity and effectiveness of the integrated reconfiguration.
The online decision-making effects of different algorithms in the reconfigured and original IEEE 118-bus systems are compared in Table III.
System | Algorithm | Voltage violationrate (%) | Loss cost ($) |
---|---|---|---|
Original IEEE 118-bus | MISOCP | 4.926 | 748.30 |
PS-PER-ITD3 | 5.525 | 736.43 | |
Reconfigured IEEE 118-bus | MISOCP | 0.812 | 427.39 |
MADDPG | 0.983 | 494.87 | |
PS-ITD3 | 1.017 | 478.12 | |
PS-PER-ITD3 | 0.983 | 470.70 |
1) In contrast to the IEEE 33-bus system, the task difficulty of the IEEE 118-bus system increases significantly, as reflected by the higher voltage violation rates.
2) With the observation that both the voltage violation penalty and loss cost in scenario 1 are markedly higher than those in scenario 2, the necessity of the reconfiguration prior to the DRL-based ADN operation optimization is again verified in the larger IEEE 118-bus system.
3) In both scenarios 1 and 2, the voltage violation penalty and loss cost of the proposed PS-PER-ITD3 are close to the values calculated by MISOCP, demonstrating its superiority in approaching the theoretical optimum. Notably, the PS-PER-ITD3 abnormally outperforms MISOCP in scenario 1, which originates from the random noise added to the renewable predicted output during the MADRL training.
The decision-making time for different optimization algorithms in the reconfigured IEEE 118-bus system and reconfigured IEEE 33-bus system are compared in Table IV. In this context, all references to decision-making time refer to single time-step values.
System | Algorithm | Decision-making time (s) |
---|---|---|
Reconfigured IEEE 33-bus | MISOCP | 1.1070 |
PS-ITD3 | 0.0118 | |
PS-PER-ITD3 | 0.0099 | |
Reconfigured IEEE 118-bus | MISOCP | 1.2320 |
PS-ITD3 | 0.0137 | |
PS-PER-ITD3 | 0.0124 |
Table IV lists notable observations regarding the online decision-making rates of different algorithms for the two test systems. Specifically, the MADRL algorithms exhibit faster decision-making rates (millisecond level) by leveraging experiences extracted from training. This renders them well suited for addressing almost any short-term control requirement involving ADN fluctuation mitigation.
However, with the decision-making time of 1.107 s and 1.232 s for MISOCP in the reconfigured IEEE 33-bus system and reconfigured IEEE 118-bus systems, respectively, there is no overwhelming superiority in terms of speed for the MADRL algorithms regarding online real-time control. This is likely due to the simplicity of the test system. Therefore, further assessment of larger systems is necessary.
Based on the stipulated 10% uncertainty in the renewable predicted output, 50 test days are generated to verify the generalization of the proposed PS-PER-ITD3 against unseen scenarios (unknown renewable predicted output). All the renewable predicted output data from the test days are excluded from the training process. The performance of the proposed PS-PER-ITD3 on the test day set is compared with that of MISOCP, as shown in

Fig. 10 Generalization validation on test day set in reconfigured IEEE 118-bus system.
1) The proposed PS-PER-ITD3 shows significant decision-making effects on the test day set because the gap from the optimum calculated by MISOCP is small, which confirms its superior generalization in similar but unseen scenarios.
2) Although the proposed PS-PER-ITD3 does not acquire the same effects as the MISOCP algorithm, it still exhibits nearly unique and overwhelming generalization and decision-making rates that are well suited for the real-time control of ADNs. However, the model-based algorithms require re-computation whenever the ADN scenario changes.
The proposed PS-PER-ITD3 is then tested on a three-phase unbalanced 123-bus system to evaluate its scalability and decision-making effects. With three phases per bus, the scale and complexity of the 123-bus system far exceed those like IEEE 118-bus system. Invalid load nodes and vacant branches are omitted so only 114 valid buses remain. The corresponding numerical results are presented in Table V and

Fig. 11 24-hour voltage distribution comparisons in three-phase unbalanced 123-bus system. (a) MISOCP. (b) PS-PER-ITD3.
Algorithm | Decision-making time (s) | Loss (kWh) |
---|---|---|
MISOCP | 5.240 | 1197 |
PS-PER-ITD3 | 0.049 | 1334 |
1) Both the MISOCP [
① Existing model-based algorithms (such as MISOCP) for solving three-phase optimal power flow issues commonly assume that the voltage phasors are nearly balanced [
② MISOCP aims to discover a unique optimal solution for a given scenario, whereas DRL is prone to exploring a policy that can be utilized to acquire near-optimal decision-making effects in numerous unseen scenarios. This normally incurs a weak sacrifice of solution optimality in the DRL-based algorithm, which further aggregates owing to the multiphase coupling nature of the 123-bus system and the multiagent learning mode of the PS-PER-ITD3.
2) In contrast to the IEEE 33-bus and IEEE 118-bus systems, the decision-making time of MISOCP in the three-phase unbalanced 123-bus system significantly exceeds that of the PS-PER-ITD3. Consequently, for large-scale three-phase unbalanced distribution systems, the conventional MISOCP is limited to day-ahead or intraday optimizations with minute-level decision intervals. However, PS-PER-ITD3 still exhibits online millisecond-level decision-making capabilities.
In conclusion, compared with the conventional model-based MISOCP, the proposed PS-PER-ITD3 exhibits overwhelming generalization and decision-making rates in a three-phase unbalanced 123-bus system. In addition, the integrated PS mechanism retains sufficient optimization capability and scalability in the three-phase unbalanced test system because its gap from MISOCP is acceptable, even with the aforementioned unavoidable inherent sacrifices.
An MADRL-based real-time optimization strategy of ADN is proposed to mitigate the voltage violations and network losses. Adopting the optimal switch deployment calculated by the prior reconfiguration preliminary, the ADN is then partitioned into multiple parallel regional agents and trained by the proposed PS-PER-ITD3. The PS mechanism is integrated into the ITD3-based MADRL algorithm, which significantly enhances its scalability and stability in larger systems by substituting the conventional global critic mechanism with shared network parameters and replay buffer.
In the numerical studies, the PS-PER-ITD3 and several other algorithms are tested on IEEE 33-bus, IEEE 118-bus, and three-phase unbalanced 123-bus systems. The simulation results confirm the scalability and superiority of the proposed PS-PER-ITD3 for real-time operation control of ADN. Moreover, a scenario-based comparative experiment demonstrates the necessity and effectiveness of the preliminary reconfiguration in the proposed PS-DER-ITD3. Based on the aforementioned experiments, the proposed PS-DER-ITD3 outperforms others on its convergence rapidity, online decision-making rate, excellent generalization, and scalability.
Further studies are required to explore better coordination modes between reconfiguration and DER control under the MADRL-based framework and to improve strategic scalability in large-scale systems.
Appendix
The information of the test distribution systems is shown in Tables AI-AIII and Figs. A1-A4. In Fig. A2, under the original topology without reconfiguration, lines with red triangle are closed, and dashed lines are open; under the reconfigured topology, lines with red triangle are open, and dashed lines are closed (remain unchanged for 24 hours).
System | Partitioning result | |||
---|---|---|---|---|
Region 1 | Region 2 | Region 3 | Region 4 | |
IEEE 33-bus | Buses 1-11 | Buses 12-22 | Buses 23-33 | - |
IEEE 118-bus | Buses 1-29 | Buses 30-58 | Buses 59-87 | Buses 88-116 |
123-bus | Buses 1-28 | Buses 29-56 | Buses 57-85 | Buses 85-112 |
System | PV | WT | ESS |
---|---|---|---|
IEEE 33-bus | Buses 6, 13, and 27 | Buses 10, 16, and 30 | Buses 4, 15, and 29 |
IEEE 118-bus | Buses 9, 55, 80, and 114 | Buses 23, 51, 67, and 104 | Buses 10, 39, 60, and 91 |
123-bus | Buses 6, 14, 28, 35, 46, 52, 57, 62, 75, 92, 101, 107 (only inverter-based PV) |
Time (hour) | Open branch | Time (hour) | Open branch |
---|---|---|---|
1 | 14, 28, 33, 34, 37 | 13 | 33, 34, 35, 36, 37 |
2 | 9, 14, 28, 34, 37 | 14 | 5, 14, 33, 35, 36 |
3 | 9, 14, 28, 34, 37 | 15 | 9, 14, 34, 36, 37 |
4 | 5, 14, 33, 35, 36 | 16 | 9, 34, 35, 36, 37 |
5 | 5, 14, 28, 34, 35 | 17 | 9, 34, 35, 36, 37 |
6 | 14, 28, 33, 34, 35 | 18 | 14, 33, 34, 35, 36 |
7 | 9, 14, 28, 34, 37 | 19 | 9, 34, 35, 36, 37 |
8 | 9, 14, 28, 34, 37 | 20 | 9, 14, 28, 34, 37 |
9 | 9, 28, 34, 35, 37 | 21 | 9, 28, 33, 34, 35 |
10 | 9, 14, 28, 34, 37 | 22 | 5, 9, 14, 28, 34 |
11 | 14, 33, 34, 36, 37 | 23 | 5, 9, 14, 28, 35 |
12 | 9, 19, 28, 34, 35 | 24 | 5, 9, 14, 34, 37 |

Fig. A1 Topology of IEEE 33-bus system.

Fig. A2 Topology of IEEE 118-bus system.

Fig. A3 Topology of three-phase unbalanced 123-bus system.
1) To simplify the issue and reduce risks, the period of reconfiguration in the IEEE 118-bus system is 24-hour whereas that in the IEEE 33-bus system is 1-hour.
2) All the vacant branches and invalid buses of the three-phase unbalanced 123-bus system are omitted, and hence, there are only 114 valid buses.
3) As discussed in [
The operation optimization model of three-phase unbalanced 123-bus network is shown in Fig. A4.

Fig. A4 Operation optimization model of three-phase unbalanced 123-bus system.
1) Objective Function
An optimization model for the three-phase unbalanced ADN is constructed to minimize the comprehensive objective, which is composed of active network loss and voltage violation penalty.
The detailed mathematical form is given as:
(A1) |
where , , and represent the multiplying coefficient of voltage violation penalty, voltage violation penalty, and phases in , respectively. The corresponding two compositions are calculated by
(A2) |
(A3) |
where and are the real-time active and reactive power flows of branch at phase , respectively; is the self-resistance of branch at phase ; is the penalty to voltage violation with ; is the acceptable range of voltage at each phase; and is the voltage amplitude of bus at phase .
2) Constraints
1) Three-phase unbalanced power flow equation
(A4) |
(A5) |
where represents the active power injections on bus ; represents the reactive power injections on bus ; and are the real and imaginary submatrices of branch in the admittance matrix , respectively, is the number of bus set ; denotes the cosine calculation; denotes the sine calculation; ; and is denoted as:
(A6) |
(A7) |
(A8) |
where represents the root bus of the unbalanced distribution system.
2) Operation constraints of inverter-based PV
(A9) |
(A10) |
where is the maximum output active power of the inverter-based PV installed on phase of bus ; is the nominal capacity of the PV inverter; and is the actual output reactive power of PV inverter installed on phase of bus , which can be adjusted in quadrants I and IV of the P-Q coordinate system dynamically. Note that only the unbalanced test system utilizes inverter-based PV in this paper.
3) POMDP Modeling
This paper formulates the operation optimization of three-phase unbalanced distribution system as a Dec-POMDP model.
1) Observation of the agent
(A11) |
where is the total active load in regional agent ; and is the total reactive load in regional agent .
2) Action of agent
(A12) |
3) Constraints of agent
(A13) |
(A14) |
4) Reward
(A15) |
References
G. Švenda, I. Krstić, S. Kanjuh et al., “Volt var watt optimization in distribution network with high penetration of renewable energy sources and electric vehicles,” in Proceedings of 2022 IEEE PES Innovative Smart Grid Technologies Conference Europe (ISGT-Europe), Novi Sad, Serbia, Oct. 2022, pp. 1-5. [Baidu Scholar]
S. H. Low, “Convex relaxation of optimal power flow – part I: formulations and equivalence,” IEEE Transactions on Control of Network Systems, vol. 1, no. 1, pp. 15-27, Mar. 2014. [Baidu Scholar]
S. H. Low, “Convex relaxation of optimal power flow – part II: exactness,” IEEE Transactions on Control of Network Systems, vol. 1, no. 2, pp. 177-189, Jun. 2014. [Baidu Scholar]
Y. Fan, L. Feng, and G. Li, “Dynamic optimal power flow in distribution networks with wind/PV/storage based on second-order cone programming,” in Proceedings of 2020 5th Asia Conference on Power and Electrical Engineering (ACPEE), Chengdu, China, Jun. 2020, pp. 1136-1142. [Baidu Scholar]
M. Niu, C. Wan, and Z. Xu, “A review on applications of heuristic optimization algorithms for optimal power flow in modern power systems,” Journal of Modern Power Systems and Clean Energy, vol. 2, no. 4, pp. 289-297, Dec. 2014. [Baidu Scholar]
Y. Ai, M. Du, Z. Pan et al., “The optimization of reactive power for distribution network with PV generation based on NSGA-III,” CPSS Transactions on Power Electronics and Applications, vol. 6, no. 3, pp. 193-200, Sept. 2021. [Baidu Scholar]
D. Cao, W. Hu, X. Xu et al., “Deep reinforcement learning based approach for optimal power flow of distribution networks embedded with renewable energy and storage devices,” Journal of Modern Power Systems and Clean Energy, vol. 9, no. 5, pp. 1101-1110, Sept. 2021. [Baidu Scholar]
H. Liu and W. Wu, “Two-stage deep reinforcement learning for inverter-based volt-var control in active distribution networks,” IEEE Transactions on Smart Grid, vol. 12, no. 3, pp. 2037-2047, May 2021. [Baidu Scholar]
D. Cao, W. Hu, J. Zhao et al., “A multi-agent deep reinforcement learning based voltage regulation using coordinated PV inverters,” IEEE Transactions on Power Systems, vol. 35, no. 5, pp. 4120-4123, Sept. 2020. [Baidu Scholar]
X. Sun and J. Qiu, “Two-stage volt/var control in active distribution networks with multi-agent deep reinforcement learning method,” IEEE Transactions on Smart Grid, vol. 12, no. 4, pp. 2903-2912, Jul. 2021. [Baidu Scholar]
D. Hu, Z. Ye, Y. Gao et al., “Multi-agent deep reinforcement learning for voltage control with coordinated active and reactive power optimization,” IEEE Transactions on Smart Grid, vol. 13, no. 6, pp. 4873-4886, Nov. 2022. [Baidu Scholar]
H. Wu, Z. Xu, M. Wang et al., “Two-stage voltage regulation in power distribution system using graph convolutional network-based deep reinforcement learning in real time,” International Journal of Electrical Power & Energy Systems, vol. 151, p. 109158, Sept. 2023. [Baidu Scholar]
H. Liu, C. Zhang, Q. Chai et al., “Robust regional coordination of inverter-based volt/var control via multi-agent deep reinforcement learning,” IEEE Transactions on Smart Grid, vol. 12, no. 6, pp. 5420-5433, Nov. 2021. [Baidu Scholar]
J. Zhang, Y. Guan, L. Che et al., “EV charging command fast allocation approach based on deep reinforcement learning with safety modules,” IEEE Transactions on Smart Grid, doi: 10.1109/TSG.2023.3281782 [Baidu Scholar]
Y. Zhang, X. Wang, J. Wang et al., “Deep reinforcement learning based volt-var optimization in smart distribution systems,” IEEE Transactions on Smart Grid, vol. 12, no. 1, pp. 361-371, Jan. 2021. [Baidu Scholar]
Z. Yin, S. Wang, and Q. Zhao, “Sequential reconfiguration of unbalanced distribution network with soft open points based on deep reinforcement learning,” Journal of Modern Power Systems and Clean Energy, vol. 11, no. 1, pp. 107-119, Jan. 2023. [Baidu Scholar]
J. Zhang, M. Cui, and Y. He, “Dual timescales voltages regulation in distribution systems using data-driven and physics-based optimization,” IEEE Transactions on Industrial Informatics, doi: 10.1109/TII.2023.3274216 [Baidu Scholar]
Y. Pei, J. Zhao, Y. Yao et al., “Multi-task reinforcement learning for distribution system voltage control with topology changes,” IEEE Transactions on Smart Grid, vol. 14, no. 3, pp. 2481-2484, May 2023. [Baidu Scholar]
M. R. Dorostkar-Ghamsari, M. Fotuhi-Firuzabad, M. Lehtonen et al., “Value of distribution network reconfiguration in presence of renewable energy resources,” IEEE Transactions on Power Systems, vol. 31, no. 3, pp. 1879-1888, May 2016. [Baidu Scholar]
R. A. Jabr, R. Singh, and B. C. Pal, “Minimum loss network reconfiguration using mixed-integer convex programming,” IEEE Transactions on Power Systems, vol. 27, no. 2, pp. 1106-1115, May 2012. [Baidu Scholar]
J. K. Gupta, M. Egorov, and M. Kochenderfer, “Cooperative multi-agent control using deep reinforcement learning,” Proceedings of International Conference on Autonomous Agents and Multiagent Systems, vol. 10642, pp. 66-83, Nov. 2017. [Baidu Scholar]
S. Fujimoto, H. Hoof, and D. Meger, “Addressing function approximation error in actor-critic methods,” in Proceedings of International Conference on Machine Learning, Stockholm, Sweden, Jun. 2018, pp. 1587-1596. [Baidu Scholar]
S. Gronauer and K. Diepold, “Multi-agent deep reinforcement learning: a survey,” Artificial Intelligence Review, vol. 55, no. 2, pp. 895-943, Apr. 2022. [Baidu Scholar]
D. Qiu, Y. Ye, D. Papadaskalopoulos et al., “Scalable coordinated management of peer-to-peer energy trading: a multi-cluster deep reinforcement learning approach,” Applied Energy, vol. 292, p. 116940, Jun. 2021. [Baidu Scholar]
N. Yang, B. Ding, P. Shi et al., “Improving scalability of multi-agent reinforcement learning with parameters sharing,” in Proceedings of 2022 IEEE International Conference on Joint Cloud Computing (JCC), Fremont, USA, Aug. 2022, pp. 37-42. [Baidu Scholar]
H. Liu and W. Wu, “Online multi-agent reinforcement learning for decentralized inverter-based volt-var control,” IEEE Transactions on Smart Grid, vol. 12, no. 4, pp. 2980-2990, Jul. 2021. [Baidu Scholar]
X. Liu, S. Li, and J. Zhu, “Optimal coordination for multiple network-constrained VPPs via multi-agent deep reinforcement learning,” In IEEE Transactions on Smart Grid, vol. 14, no. 4, pp. 3016-3031, Jul. 2023. [Baidu Scholar]
Y. Tao, J. Qiu, S. Lai et al., “A data-driven agent-based planning strategy of fast-charging stations for electric vehicles,” IEEE Transactions on Sustainable Energy, vol. 14, no. 3, pp. 1357-1369, Jul. 2023. [Baidu Scholar]
A. Tampuu, T. Matiisen, D. Kodelja et al. (2015, Nov.). Multiagent cooperation and competition with deep reinforcement learning. [Online]. Available: https://arxiv.org/abs/1511.08779 [Baidu Scholar]
R. Lowe, Y. Wu, A. Tamar et al. (2017, Jun.). Multi-agent actor-critic for mixed cooperative-competitive environments. [Online]. Available: https://arxiv.org/abs/1706.02275 [Baidu Scholar]
J. Ackermann, V. Gabler, T. Osa et al. (2011, Oct.). Reducing overestimation bias in multi-agent domains using double centralized critics. [Online]. Available: https://arxiv.org/abs/1910.01465 [Baidu Scholar]
D. Qiu, J. Wang, Z. Dong et al., “Mean-field multi-agent reinforcement learning for peer-to-peer multi-energy trading,” IEEE Transactions on Power Systems, vol. 38, no. 5, pp. 4853-4866, Sept. 2023. [Baidu Scholar]
T. Schaul, J. Quan, I. Antonoglou et al. (2015, Nov.). “Prioritized experience replay. [Online]. Available: https://arxiv.org/abs/1511.05952 [Baidu Scholar]
R. D. Zimmerman, C. E. Murillo-Sánchez, and R. J. Thomas, “MATPOWER: steady-state operations, planning, and analysis tools for power systems research and education,” IEEE Transactions on Power Systems, vol. 26, no. 1, pp. 12-19, Feb. 2011. [Baidu Scholar]
D. Zhang, Z. Fu, and L. Zhang, “An improved TS algorithm for loss-minimum reconfiguration in large-scale distribution systems,” Electric Power Systems Research, vol. 77, no. 5-6, pp. 685-694, Apr. 2007. [Baidu Scholar]
W. H. Kersting, “Radial distribution test feeders,” IEEE Transactions on Power Systems, vol. 6, no. 3, pp. 975-985, Aug. 1991. [Baidu Scholar]
R. R. Nejad and W. Sun, “Distributed load restoration in unbalanced active distribution systems,” IEEE Transactions on Smart Grid, vol. 10, no. 5, pp. 5759-5769, Sept. 2019. [Baidu Scholar]