Abstract
In the context of large-scale photovoltaic integration, flexibility scheduling is essential to ensure the secure and efficient operation of distribution networks (DNs). Recently, deep reinforcement learning (DRL) has been widely applied to scheduling problems. However, most methods neglect the vulnerability of DRL to state adversarial attacks such as load redistribution attacks, significantly undermining its security and reliability. To this end, a flexibility scheduling method is proposed based on robust graph DRL (RoGDRL). A flexibility gain improvement model considering temperature-dependent resistance is first proposed, which considers weather factors as additional variables to enhance the precision of flexibility analysis. Based on this, a state-adversarial two-player zero-sum Markov game (SA-TZMG) model is proposed, which converts the robust DRL scheduling problem into a Nash equilibrium problem. The proposed SA-TZMG model considers the physical constraints of state attacks that guarantee the maximal flexibility gain for the defender when confronted with the most sophisticated and stealthy attacker. A two-stage RoGDRL algorithm is proposed, which introduces the graph sample and aggregate (GraphSAGE) driven soft actor-critic to capture the complex feature about the neighbors of nodes and their properties via inductive learning, thereby solving the Nash equilibrium policies more efficiently. Simulations based on the modified IEEE 123-bus system demonstrates the efficacy of the proposed method.
OPERATIONAL flexibility denotes the capacity of the power system to maintain safe and efficient operation, which is critical for distribution networks (DNs) with high penetration rates of photovoltaic (PV) [
Recently, many research studies on flexibility scheduling have emerged [
The primary controllable resources for enhancing operational flexibility include energy storage systems (ESSs) [
Flexibility scheduling is a typical mixed-integer nonlinear programming problem. The most common methods for solving such problems include heuristic algorithms and mathematical programming methods such as mixed-integer second-order cone programming (MISOCP) and linearized approximation programming (LAP). However, these methods face three primary challenges. ① The solution quality of heuristic algorithms cannot be guaranteed, and they typically require extensive computation [
Although the DRL is increasingly being used to address complex DN scheduling challenges, its weaknesses are becoming more apparent. DRL is notably susceptible to disturbances from adversarial noise, with the neural network policies of DRL being highly vulnerable to state adversarial attacks [
To enhance the robustness of DRL models, the adversarial DRL framework is proposed to identify and adapt to potential adversarial attacks. Specifically, adversarial attacks are defined as attack agents and participate in the training process of defense agents (i.e., robust DRL model). By integrating strategies such as adversarial training and noise injection, this framework strengthens the resistance of the DRL model to input perturbations [
Despite these advancements, the security and robustness of the DRL against state adversarial attacks in the optimal DN scheduling remain underexplored. Furthermore, the current robust adversarial DRL framework typically allows the attack agent to arbitrarily modify state values within a specified range, which is impractical. This is because, in power systems, adversarial attacks that fail to adhere to physical characteristics and constraints are easily detected by bad data detection (BDD) mechanisms and, thereby, would not be considered for further decision-making [
To address the aforementioned challenges, this paper proposes a flexibility scheduling method based on robust GDRL (RoGDRL) to enhance the robustness of the DRL-driven scheduling system against state adversarial attacks. Initially, a mathematical model of flexibility scheduling accounting for temperature-dependent resistance is constructed, which improves the operational flexibility and economic efficiency by coordinating various flexibility resources such as tie switches, ESSs, SOPs, PVs, and static var compensators (SVCs). Subsequently, a novel state-adversarial TZMG (SA-TZMG) model for flexibility scheduling is proposed. This model frames the challenge of DRL-based scheduling under state adversarial attacks as a Nash equilibrium problem. Then, a two-stage RoGDRL algorithm with an alternating adversarial training framework is developed to solve the game model. The proposed algorithm utilizes a graph sample and aggregate driven soft actor-critic (SAGESAC) agent to extract feature representations from graph-structured states. The graph sample and aggregate (GraphSAGE) [
1) The proposed method integrates weather variables and employs a steady-state thermal balance function to accurately assess temperature-dependent resistance. This enhances the accuracy of flexibility gain evaluation and the reliability of scheduling decisions.
2) The proposed SA-TZMG model incorporates realistic physical constraints of state attacks, enabling an attacker to generate LR samples that evade the BDD mechanism. This ensures that the defender can make informed decisions to mitigate the impact of more stealthy state adversarial attacks.
3) By combining SAGESAC with an alternating adversarial training framework, the proposed two-stage RoGDRL algorithm demonstrates exceptional robustness against state adversarial attacks and is highly competitive in enhancing operational flexibility compared with existing DRL algorithms.
The remainder of this paper is organized as follows. The mathematical model of flexibility gain improvement is formulated and converted into an SA-TZMG in Sections II and III, respectively. A novel two-stage RoGDRL algorithm is formulated in Section IV. Case study results are presented in Section V. The conclusions are shown in Section VI.
To address the flexibility scheduling problem in DNs, a flexibility gain indicator that encompasses node flexibility, branch transfer flexibility, and economic efficiency is first proposed. The flexibility gain during period t can be expressed as:
(1) |
where fN,t is the node flexibility gain; fB,t is the branch transfer flexibility gain; and fC,t is the cost flexibility gain. The flexibility gain metric evaluates improvements in operational flexibility from multiple perspectives following the implementation of scheduling strategies. The sub-indicators for node, branch transfer, and cost flexibility gains share uniform dimensionality, allowing their aggregation into a comprehensive index through summation. A detailed description is provided below.
Node flexibility, which reflects the local states of flexibility demand and supply [
(2) |
where and are the per-unit voltage values at node i during period t before and after the control, respectively; is the node set; and is a small constant, which is set to be 1
Branch transfer flexibility signifies the ability of the DN to relocate local flexibility for spatial-temporal balancing [
(3) |
where I and I are the currents of branch ij during period t before and after the control, respectively; is the carrying capacity of branch ij; and is the branch set.
Node and branch transfer flexibilities form the foundational framework for quantitative analysis of DN flexibility [
(4) |
where is the electricity price; is the generation revenue of PV; is the switching action cost; and are the resistances of branch ij during period t before and after the control, respectively; is the number of actions for discrete devices; P is the power loss of converter i of the SOP during period t; P and P are the PV power curtailments during period t before and after the control, respectively; and is the SOP set.
Enhancing the cost flexibility gain can ensure low-cost operation while mitigating the phenomenon of PV power curtailment, thereby improving the utilization rate of PVs. This encourages the investment enthusiasm of DSOs and PV investors for further large-scale PV deployment in the future, thereby promoting shared interests among multiple stakeholders.
In this study, the relationships between the three sub-indicators of flexibility gain and the operational flexibility is outlined as follows.
1) Node flexibility gain quantitatively reflects the reduction in voltage deviation. A larger fN,t indicates a smaller node voltage deviation after implementing scheduling strategies. Nodes with sufficient flexibility can support the reduction of the voltage deviation to the desired range [
2) The branch transfer flexibility gain quantitatively reflects the proportion of reduction in branch loading rate. A larger fB,t indicates a lower branch loading rate after implementing scheduling strategies. The lower the loading rate, the greater the remaining capacity of the branch available for transfer flexibility [
3) The cost flexibility gain quantitatively reflects the proportion of reduction in the system operation cost. A larger fC,t indicates a lower operation cost after implementing scheduling strategies. In DNs, utilizing or enhancing the operational flexibility incurs certain costs, which represent the value attribute of flexibility and are crucial for evaluating the availability and usability of scheduling strategies [
Moreover, from the perspective of DN operational performance, improving the flexibility gain can reduce the energy loss cost and PV power curtailment, enhance the voltage stability, and mitigate the branch congestion. Traditionally, achieving these objectives simultaneously often requires a multi-objective optimization framework. However, the proposed flexibility gain indicator simplifies this into a single-objective optimization problem, diminishing the complexity inherent in the multi-objective scheduling problem.
The accuracy of flexibility gain calculations depends on the precision of power flow analysis. However, many existing scheduling methods overlook the characteristics of dynamic branch resistance changes during power flow analysis. This is because the resistance of metallic conductors changes with their temperature. Specifically, the relationship is governed by:
(5a) |
(5b) |
where rij,t is the resistance of branch ij during period t; r is the resistance of branch ij in the reference temperature ; Tij,t is the conductor temperature of branch ij during period t; and is the temperature constant.
The branch temperature is determined by weather factors and branch current, adhering to the steady heat balance function as specified in IEEE Std 738-2012 [
(6) |
where q and q are the heat dissipations via air convection and surface radiation of branch ij, respectively; q is the calorific value of branch ij from the solar radiation; is the current of branch ij during period t; is the solar radiation level; D is the diameter of the branch; is the solar absorptivity; is the emissivity of the conductor; is the ambient temperature; is the projected area of the conductor per unit length of branch ij; is the solar radiation angle; and q, q, and la are the convection heat loss coefficients. The details can be referred to in [
The flexibility gain improvement model considering temperature-dependent resistance can be written as:
(7) |
where and are the vectors of equality and inequality constraint equations, encompassing power balance, operational safety, equipment operational limits, and temperature-dependent resistance constraints [
Problem (7) can be transformed into a Markov decision process (MDP) and addressed using DRL algorithms. However, state adversarial attacks pose a substantial risk, as they can disturb the inputs to the DRL model. Such disturbances can alter the decisions of the model, impacting the results. LR attacks represent a common type of state adversarial attack for power systems.
Referring to [
(8a) |
(8b) |
(8c) |
where and are the power injections of false loads and PVs, respectively; is the false branch power measurement injection; SF and KD are the shifting factor and load incidence matrices, respectively; S and S are the load demand and PV output matrices, respectively; and S is the PV capacity matrix.
In the robust adversarial framework [
The MDP is a mathematical framework that provides a formalism for modeling sequential decision-making problems. TZMG is considered an extension of MDP to game-theoretic scenarios. It consists of six-tuple as follows:
(9) |
where S is the game state space;
Notably, in the conventional TZMG model, actions of both defender and attacker influence the environment simultaneously, leading to the generation of rewards [

Fig. 1 Interaction between players and its effects on environment.
As shown in
1) State space: the state of the DN is usually defined as the Euclidean space. However, stacking the features of nodes by a specific order may cause a loss of topology information and dependencies between nodes [
Graph data can be represented as , where is the set of edges; and X is the set of features, including the node feature matrix Xbus and the feature matrix Xedge. For the DN, Xedge is the adjacency matrix. Moreover, the node feature is the system operation state. In summary, the state during period t can be represented as . during period t can be written as:
(10) |
where the operation state mainly comprises the active power injection of the node , reactive power injection of the node , and weather factors, including wind speed , wind angle , temperature , and solar irradiance ; and n and d are the numbers of rows and columns of the feature matrix, respectively.
Remarke: currently, the primary method for collecting weather information is through the automatic weather station (AWS). It can transmit weather data via wired or wireless communication and is highly automated, accurate, and reliable [
2) Defender’s action: the defender’s action
(11) |
where a, a, are the control actions for PV, SVC, and ESS, respectively, detailed in [
3) Attacker’s action: the attacker’s action at each time step, which is a continuous variable, is defined as , where aatt,load,t and aatt,pv,t are the attack actions targeting the load and PV state variables, respectively. This study sets to be 0.3 [
(12) |
Diverging from existing adversarial attack modeling approaches, this study further constrains the attacker’s actual executed actions to ensure compliance with the LR mechanism as:
(13) |
where is the operation of calculating the mean value; and is a Boolean variable. If attacker’s action satisfies the actual physical constraints in (8), ; otherwise, .
4) Reward: the reward in SA-TZMG is one of the critical factors determining the game result. It is represented as the immediate reward provided by the environment when the defender acts
(14) |
where if A satisfies the operational security constraints, otherwise, ; and is a negative constant.
Reference [
(15) |
where is the expected cumulative reward function; T is the total number of optimization periods; and is the expected value function. Traditional DRL algorithms mainly focus on solving single-agent MDP and are not designed to directly address max-min problems [
In TZMG, once one player’s policy is fixed, the max-min problem becomes a single-agent MDP, and a deterministic policy is sufficient to achieve optimality [
In Stage I, the training of the attacker aims to ascertain the optimal attacker’s policy , keeping the defender’s policy fixed and aiming to minimize the defender’s cumulative reward. It is framed as a constrained minimization problem:
(16) |
It is essential to highlight that is pre-trained at this stage. The rationale behind pre-training without the influence of adversarial attacks lies in its effectiveness in establishing an optimal initial exploration strategy for the attacker [
In Stage II, the defender focuses on augmenting its resilience to state attacks by developing a robust defense policy, with the attacker’s policy held constant. This stage is characterized as a constrained maximization problem:
(17) |
where Ŝt is the perturbed state after FDIA. The policy learning tasks for both stages can be addressed by employing the proposed SAGESAC. The specific algorithm for these two stages will be detailed below.
At this stage, state attack signals generated by the attacker are continuous variables. Thus, the soft actor-critic (SAC) is adopted as the foundational framework to learn . The objective can be expressed as a maximizing problem with a negative reward function:
(18) |
where is the policy entropy, which is a measure of the randomness in the action selection of the policy, encouraging exploration by penalizing certainty in action choices; denotes the average calculation under all possible actions ; is the temperature parameter used to balance the relationship between the policy entropy and expected rewards; and is the negative value of the logarithmic probability of action given the state .
To exploit the critical attributes of the observation, this study introduces GraphSAGE as a feature extractor to efficiently encapsulate the characteristics of graph-structured states. Traditional feature extractor, named graph convolutional network (GCN), utilizes a transductive learning approach that requires static graph structures [
1) Sampling from neighboring nodes. For each node i, a fixed number of neighboring nodes are randomly sampled from its neighboring node set , thereby reducing the number of neighboring nodes to be processed and, consequently, the computational complexity of the model.
2) Aggregating features from neighboring nodes. Specific aggregators are used to aggregate features of the sampled neighboring nodes, obtaining a comprehensive representation of neighborhood features.
(19) |
where AGG denotes the aggregation operation, and in practice, it can be various aggregators such as mean aggregator; is the
3) Updating feature representation of node. is combined with the feature of the current node (e.g., through concatenation). The feature representation of the node is then updated using a neural network layer (e.g., a fully connected layer).
(20) |
where wk and bk are the learnable coefficient matrices; is the ReLU activation function; and denotes the concatenation operation. In the subsequent stage of the GraphSAGE process, the newly generated features will be utilized. Following K aggregation layers, a feature vector , is produced as the output.
As shown in

Fig. 2 Feature extraction process driven by GraphSAGE.
The output of the graph policy network can be parameterized by a Gaussian distribution , which can be denoted as:
(21) |
where and are the mean and variance of the action for the policy network, respectively. Then, the mini-batch data from the replay buffer are sampled, and θ is typically updated using gradient descent. The loss function is expressed as:
(22) |
(23) |
where is the loss function of the policy network; is the differential form of ; is the learning rate; denotes the average calculation under all possible states and actions ; is the replay buffer; and is the Q-function value, which is parameterized by the graph critic network. The Q-function updates the parameters via the following method:
(24) |
where is the loss function of the Q-function; denotes the average calculation under all possible states and actions ; denotes the average calculation under all possible actions ; and is the target Q-function value. Periodically, the parameters of the critic network are copied to the target critic network to stabilize learning.
(25) |
(26) |
where and are the sets of parameters for the critic and target critic networks, respectively; is the differential form of ; and is the soft update coefficient. Following the training framework of the naive SAC, the attacker can learn an attacker’s policy to achieve the worst-case performance under a limited state attack.
This stage addresses a robust policy learning problem in a hybrid discrete-continuous action space. It is worth noting that the robust policy learning aims to enhance decision-making resilience under state adversarial attacks. Thus, the perturbed state Ŝt is used to train the neural network. Rt obtained during the training process is the flexibility gain for A performed by the DN environment under a clean state St.
The SAGESAC is extended by introducing two parallel graph policy networks to address policy learning issues in mixed discrete-continuous action spaces. One is designated for generating discrete actions, and the other is for generating continuous actions. Its objective function is still to maximize the sum of expected rewards and policy entropy. However, it uniquely accounts for the policy entropy of discrete and continuous actions. The objective function can be expressed as:
(27) |
where is the discrete action policy network, employing the Gumbel-Softmax function for selecting discrete actions Ad,t [
(28) |
where denotes the average calculation under all possible states and actions ; and is the Q-function value for robust policy learning.
The gradient descent method is also employed to optimize the loss functions of the policy network, aiming to learn the optimal parameters as follows:
(29) |
where and are the differential forms of and , respectively.
The critic network is updated by minimizing the following loss function :
(30) |
(31) |
(32) |
where denotes the average calculation under all possible states and actions ; denotes the average calculation under all possible actions ; is the differential form of ; is the target Q-function value for robust policy learning; and and are the sets of parameters for the critic network and target critic network, respectively. A robust scheduling policy under the attacker’s policy can be obtained by optimizing and .
Following the pre-training of the defender’s policy, a two-stage alternate training sequence is initiated. In the first stage, the pre-training defender’s parameters are held constant while the attacker’s parameters are optimized to learn the attacker’s policy. After completing C1 training episodes, the process shifts to optimize the defender’s parameters, keeping the attacker’s parameters static, to develop a robust defense policy. After training the defender for C2 episodes, this cycle is then repeated. This iterative training strategy ensures continuous improvement and adaptation of both agents. The alternate training process of RoGDRL is shown in
Algorithm 1 : alternate training process of RoGDRL |
---|
Input: number of alternate periods C, and numbers of episodes C1 and C2 for training Stages I and II |
Output: parameters of attacker and defender |
for alternate periods of do |
Get the optimal defender policy |
for episodes of do |
for do |
Output the action with fixed |
Calculate the reward Rt |
Store in buffer and update parameters of |
end for |
end for |
for episodes of do |
for do |
Output the action with fixed |
Calculate the reward Rt |
Store in buffer and update parameters of |
end for |
end for |
end for |
This paper uses the load and PV data from Jinan, China in 2020 to generate plenty of load and PV profiles. Then, the resultant load and PV instances are normalized to match the scale of power demands in the simulated system to train the proposed method. The weather data of the region in 2020 are from Solcast [

Fig. 3 Modified IEEE 123-bus system.
The proposed algorithm is compared with the existing GDRL algorithms, including graph attention soft actor-critic (GATSAC) [

Fig. 4 Cumulative reward curves for different algorithms.
Furthermore, the cumulative reward curves of the proposed algorithm demonstrate significant oscillations during the adversarial training process. At the stage where the attack strategy is being learned, a decrease in the cumulative reward curve indicates that the attacker’s adversarial strategy has successfully disrupted the defender’s decision-making process. On the contrary, an increase in the cumulative reward curve indicates that the defender is learning how to effectively counteract the attacker’s strategy, thus progressively enhancing the quality of its decision-making. This illustrates that the defender adjusts its strategy in response to adversarial challenges to maximize long-term rewards. As time progresses, the cumulative reward curve tends to stabilize, implying that the proposed algorithm becomes increasingly efficient and robust in counteracting the impacts of state adversarial attacks through adversarial training.
Notably, the final reward performance of the proposed algorithm, which operates in an adversarial training environment, is lower than that of SAGESAC and GATSAC, both of which operate in a clean environment, free from adversarial attacks. This difference arises because the proposed algorithm is designed to address the max-min problem, as described in (15), rather than solely maximizing rewards, in contrast to SAGESAC and GATSAC. By sacrificing some reward optimality, the proposed algorithm enhances its robustness against state attacks. Although its decision outcomes are not optimal, the proposed algorithm maintains commendable performance stability in the face of state adversarial attacks. This aspect will be explored further in subsequent analyses.
In adversarial training, a powerful and stealthy attacker is crucial, ensuring the defender can achieve optimal rewards in worst-case scenarios [
1) Scenario A: without considering the physical constraints and BDD mechanism.
2) Scenario B: without considering the physical constraints.
3) Scenario C: considering the physical constraints.
The attack vectors generated in Scenarios A and B are consistent, and the only difference is whether BDD is performed to eliminate anomalous attack vectors. The test rewards of three algorithms after encountering these three attack scenarios and the perturbation residual statistics for Scenarios A and C are shown in

Fig. 5 Rewards of three algorithms in different attack scenarios and perturbation residual statistics for scenarios A and C. (a) Rewards. (b) Perturbation residual statistics.
As shown in
The impact of state adversarial attacks on system operational performance is further analyzed through the following five cases.
1) Case 1: naive scheduling without considering attack.
2) Case 2: naive scheduling considering attack.
3) Case 3: robust scheduling without considering attack.
4) Case 4: robust scheduling considering attack.
5) Case 5: without control.
In this context, naive scheduling employs the SAGESAC to output scheduling strategies, while robust scheduling utilizes the proposed algorithm for its strategy output. The optimization results in different cases are shown in
Case | Flexibility gain | Operation cost (¥) | The maximum voltage deviation | The maximum average loading rate |
---|---|---|---|---|
1 | 30.67 | 615.86 | 0.0588 | 0.4889 |
2 | 21.66 | 1511.01 | 0.0793 | 0.5311 |
3 | 27.38 | 749.66 | 0.0679 | 0.5233 |
4 | 28.55 | 650.86 | 0.0651 | 0.5116 |
5 | 3746.50 | 0.0909 | 0.5594 |
In
Comparison between Case 3 and Case 4 reveals that the proposed algorithm exhibits a 4.09% decrease in flexibility gain in scenarios without state attacks. This indicates that while adversarial training enhances the robustness of the proposed algorithm against state adversarial attacks, it may lead to overfitting of the neural network policy to adversarial features. Such overfitting results in a slight degradation of algorithmic decision-making performance when processing clean state data under normal (attack-free) conditions.
In summary, the proposed algorithm significantly enhances the robustness against state adversarial attacks while still maintaining a relatively high operational flexibility. Although this algorithm sacrifices some decision-making performance, it is justified because naive DRL algorithms can be severely compromised in the presence of state adversarial attacks.
To demonstrate the effectiveness of the proposed method, this subsection first analyzes the flexibility gain on the test day and the corresponding changes in node voltage deviation, average branch loading rate, and operation cost. The operation data of PV and load on the test day, with a time resolution of one hour, are shown in

Fig. 6 Operation data of PV and load on test day.

Fig. 7 Operational performance of DN. (a) Flexibility gain. (b) The maximum voltage deviation. (c) Average branch loading rate. (d) Operation cost.
As shown in
As shown in
In
In
To provide a detailed analysis of how the proposed method enhances the flexibility of the DN,

Fig. 8 Scheduling strategies of different controllable resources. (a) Active power of SOPs. (b) Reactive power of SOPs. (c) Reactive power of SVCs. (d) Active power of ESSs.
As shown in
In
In
In
In summary, the proposed method effectively coordinates various controllable resources by maximizing flexibility gain. This alleviates the spatiotemporal mismatch between PV generation and load demand, thereby enhancing the flexibility of the DN.
To further illustrate the comprehensive enhancement of the operational efficiency of the DN through optimizing flexibility gain, three independent objectives, i.e., the maximum voltage deviation, operation cost, and average branch loading rate, are employed to formulate a multi-objective optimization model. The multi-objective particle swarm optimization (MOPSO) is used to generate the Pareto front, and the technique based on the order of preference by similarity to the ideal solution strategy is utilized to determine the optimal compromise solution. The maximum number of iterations is 500, with a population of 100. Optimization results are shown in
Time | Model | Operation cost (¥) | The maximum voltage deviation (p.u.) | Average branch loading rate (p.u.) | Test time (s) |
---|---|---|---|---|---|
14 | MOPSO | 78.23±8.21 | 0.062±0.005 | 0.54±0.04 | 731.64 |
Proposed | 60.98 | 0.059 | 0.51 | 0.05 | |
22 | MOPSO | 25.91±1.62 | 0.04±0.002 | 0.31±0.01 | 629.71 |
Proposed | 22.14 | 0.035 | 0.22 | 0.05 |
Weather factors and power flow are the key determinants of line resistance. Thus, we analyze the impact of dynamic weather and system power flow on the flexibility gain of the system over a year. In 2020, the varying ranges of air temperature, wind speed, wind direction, and solar radiation in Jinan, China were ℃, 0.1-9.3 m/s, 0°-360°, and 0-1002 J/

Fig. 9 Relative error results of flexibility gain with and without considering weather factors.
This study introduces a flexibility scheduling method for DNs based on RoGDRL. A mathematical model for flexibility scheduling with temperature-dependent resistance constraints is initially constructed. Based on this, an SA-TZMG model is proposed, which enhances the safety and robustness of the flexibility scheduling method. Finally, a two-stage RoGDRL algorithm based on SAGESAC is designed to achieve robust DRL-based flexibility scheduling, employing an alternate training method through alternating attack and defense. Numerical analysis indicates that:
1) Compared with the traditional DRL-based optimization methods, the proposed method demonstrates stronger robustness against state adversarial attacks.
2) Enhancing flexibility gain can comprehensively improve the operational performance of the DN, thereby better adapting to the large-scale integration of PV.
3) Considering temperature-dependent resistance is crucial in the optimization process to accurately model the dynamic changes of the line resistance, significantly impacting the accuracy of decision-making.
There are several directions for future work. Firstly, additional flexibility analysis indicators could be integrated into the flexibility gain to further enhance flexibility scheduling performance. Additionally, the constructed state-adversarial model could be extended based on the Stackelberg game with incomplete information to address information asymmetry between the attacker and defender, considering attack resource constraints.
References
S. Zhang, S. Ge, H. Liu et al., “Region-based flexibility quantification in distribution systems: an analytical approach considering spatio-temporal coupling,” Applied Energy, vol. 355, p. 122175, Feb. 2024. [Baidu Scholar]
X. Yang, C. Xu, H. He et al., “Flexibility provisions in active distribution networks with uncertainties,” IEEE Transactions on Sustainable Energy, vol. 12, no. 1, pp. 553-567, Jan. 2021. [Baidu Scholar]
M. Rayati, M. Bozorg, R. Cherkaoui et al., “Distributionally robust chance constrained optimization for providing flexibility in an active distribution network,” IEEE Transactions on Smart Grid, vol. 13, no. 4, pp. 2920-2934, Jul. 2022. [Baidu Scholar]
H. Ji, C. Wang, P. Li et al., “Quantified analysis method for operational flexibility of active distribution networks with high penetration of distributed generators,” Applied Energy, vol. 239, pp. 706-714, Apr. 2019. [Baidu Scholar]
J. Jian, P. Li, H. Ji et al., “DLMP-based quantification and analysis method of operational flexibility in flexible distribution networks,” IEEE Transactions on Sustainable Energy, vol. 13, no. 4, pp. 2353-2369, Oct. 2022. [Baidu Scholar]
P. Li, Y. Wang, H. Ji et al., “Operational flexibility of active distribution networks: definition, quantified calculation and application,” International Journal of Electrical Power & Energy Systems, vol. 119, p. 105872, Jul. 2020. [Baidu Scholar]
Y. Su and J. Teh, “Two-stage optimal dispatching of AC/DC hybrid active distribution systems considering network flexibility,” Journal of Modern Power Systems and Clean Energy, vol. 11, no. 1, pp. 52-65, Jan. 2023. [Baidu Scholar]
IEEE Draft Standard for Calculating the Current-temperature Relationship of Bare Overhead Conductors, IEEE Std P738-2012 Draft 09 (Revision of IEEE Std 738-2006), pp. 1-67, 2012. [Baidu Scholar]
C. Rakpenthai and S. Uatrongjit, “Temperature-dependent unbalanced three-phase optimal power flow based on alternating optimizations,” IEEE Transactions on Industrial Informatics, vol. 20, no. 3, pp. 3619-3627, Mar. 2024. [Baidu Scholar]
Q. Xing, Z. Chen, T. Zhang et al., “Real-time optimal scheduling for active distribution networks: a graph reinforcement learning method,” International Journal of Electrical Power & Energy Systems, vol. 145, p. 108637, Feb. 2023. [Baidu Scholar]
Y. Gao, W. Wang, J. Shi et al., “Batch-constrained reinforcement learning for dynamic distribution network reconfiguration,” IEEE Transactions on Smart Grid, vol. 11, no. 6, pp. 5357-5369, Nov. 2020. [Baidu Scholar]
L. Zhang, H. Ye, F. Ding et al., “Increasing PV hosting capacity with an adjustable hybrid power flow model,” IEEE Transactions on Sustainable Energy, vol. 14, no. 1, pp. 409-422, Jan. 2023. [Baidu Scholar]
Z. Wu, Y. Li, W. Gu et al., “Multi-timescale voltage control for distribution system based on multi-agent deep reinforcement learning,” International Journal of Electrical Power and Energy Systems, vol. 147, p. 108830, May 2023. [Baidu Scholar]
D. Cao, J. Zhao, J. Hu et al., “Physics-informed graphical representation-enabled deep reinforcement learning for robust distribution system voltage control,” IEEE Transactions on Smart Grid, vol. 15, no. 1, pp. 233-246, Jan. 2024. [Baidu Scholar]
Y. Zhang, M. Yue, J. Wang et al., “Multi-agent graph-attention deep reinforcement learning for post-contingency grid emergency voltage control,” IEEE Transactions on Neural Networks and Learning Systems, vol. 35, no. 3, pp. 3340-3350, Mar. 2024. [Baidu Scholar]
R. Wang, X. Bi, and S. Bu, “Real-time coordination of dynamic network reconfiguration and volt-var control in active distribution network: a graph-aware deep reinforcement learning approach,” IEEE Transactions on Smart Grid, vol. 15, no. 3, pp. 3288-3302, May 2024. [Baidu Scholar]
T. Liu, A. Jiang, J. Zhou et al., “GraphSAGE-based dynamic spatial-temporal graph convolutional network for traffic prediction,” IEEE Transactions on Intelligent Transportation Systems, vol. 24, no. 10, pp. 11210-11224, Oct. 2023. [Baidu Scholar]
Y. Zheng, Z. Yan, K. Chen et al., “Vulnerability assessment of deep reinforcement learning models for power system topology optimization,” IEEE Transactions on Smart Grid, vol. 12, no. 4, pp. 3613-3623, Jul. 2021. [Baidu Scholar]
I. Ilahi, M. Usama, J. Qadir et al., “Challenges and countermeasures for adversarial attacks on deep reinforcement learning,” IEEE Transactions on Artificial Intelligence, vol. 3, no. 2, pp. 90-109, Apr. 2022. [Baidu Scholar]
P. Zhao, C. Gu, Y. Ding et al., “Cyber-resilience enhancement and protection for uneconomic power dispatch under cyber-attacks,” IEEE Transactions on Power Delivery, vol. 36, no. 4, pp. 2253-2263, Aug. 2021. [Baidu Scholar]
X. Wei, J. Lei, J. Shi et al., “A data-driven approach for quantifying and evaluating overloading dependencies among power system branches under load redistribution attacks,” IEEE Transactions on Smart Grid, vol. 15, no. 4, pp. 4050-4062, Jul. 2024. [Baidu Scholar]
L. Zeng, M. Sun, X. Wan et al., “Physics-constrained vulnerability assessment of deep reinforcement learning-based SCOPF,” IEEE Transactions on Power Systems, vol. 38, no. 3, pp. 2690-2704, May 2023. [Baidu Scholar]
J. Moos, K. Hansel, H. Abdulsamad et al., “Robust reinforcement learning: a review of foundations and recent advances,” Machine Learning and Knowledge Extraction, vol. 4, no. 1, pp. 276-315, Mar. 2022. [Baidu Scholar]
L. Pinto, J. Davidson, R. Sukthankar et al., “Robust adversarial reinforcement learning,” in Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, Aug. 2017, pp. 2817-2826. [Baidu Scholar]
Z. Ni and S. Paul, “A multistage game in smart grid security: a reinforcement learning solution,” IEEE Transactions on Neural Networks and Learning Systems, vol. 30, no. 9, pp. 2684-2695, Sept. 2019. [Baidu Scholar]
H. Zhang, H. Chen, C. Xiao et al., “Robust deep reinforcement learning against adversarial perturbations on state observations,” in Proceedings of International Conference on Learning Representation, Red Hook, USA, Dec. 2020. pp. 21024-21037. [Baidu Scholar]
L. Zeng, D. Qiu, and M. Sun, “Resilience enhancement of multi-agent reinforcement learning-based demand response against adversarial attacks,” Applied Energy, vol. 324, p. 119688, Oct. 2022. [Baidu Scholar]
W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” in Proceedings of Advances in Neural Information Processing Systems, Long Beach, USA, Dec. 2017, pp. 1025-1035. [Baidu Scholar]
C. Wang, P. Li, and H. Yu, “Development and characteristic analysis of flexibility in smart distribution network,” Automation of Electric Power Systems, vol. 42, no. 10, pp. 13-21, Sept. 2018. [Baidu Scholar]
C. Chen, L. Shen, F. Zou et al., “Towards practical Adam: non-convexity, convergence theory, and mini-batch acceleration,” The Journal of Machine Learning Research, vol. 23, no. 1, pp. 10411-10457, Jan. 2022. [Baidu Scholar]
Z. Yin, S. Wang, and Q. Zhao, “Sequential reconfiguration of unbalanced distribution network with soft open points based on deep reinforcement learning,” Journal of Modern Power Systems and Clean Energy, vol. 11, no. 1, pp. 107-119, Jan. 2023. [Baidu Scholar]
Y. Zhu and D. Zhao, “Online minimax Q network learning for two-player zero-sum Markov games,” IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 3, pp. 1228-1241, Mar. 2022. [Baidu Scholar]
W. Liao, B. Bak-Jensen, J. R. Pillai et al., “A review of graph neural networks and their applications in power systems,” Journal of Modern Power Systems and Clean Energy, vol. 10, no. 2, pp. 345-360, Mar. 2022. [Baidu Scholar]
A. Amin and M. Mourshed, “Weather and climate data for energy applications,” Renewable and Sustainable Energy Reviews, vol. 192, p. 114247, Mar. 2024. [Baidu Scholar]
S. Frank, J. Sexauer, and S. Mohagheghi, “Temperature-dependent power flow,” IEEE Transactions on Power Systems, vol. 28, no. 4, pp. 4007-4018, Nov. 2013. [Baidu Scholar]
L. S. Shapley, “Stochastic games,” Proceedings of the National Academy of Sciences of the United States of America, vol. 39, no. 10, pp. 1095-1100, Oct. 1953. [Baidu Scholar]
K. Shimizu and E. Aiyoshi, “Necessary conditions for min-max problems and algorithms by a relaxation procedure,” IEEE Transactions on Automatic Control, vol. 25, no. 1, pp. 62-66, Feb. 1980. [Baidu Scholar]
Solcast. (2023, Jul.). Solar API and solar weather forecasting tool. [Online]. Available: https://solcast.com.au [Baidu Scholar]
Google Drive. (2024, Apr.). Parameter settings of algorithms and controllable resources. [Online]. Available: Available: https://drive.google.com/file/d/1ZIW7zBRXtc-9yBuuOw5JjWpPsb57oaTY/view?usp=sharing&usp=embed_facebook [Baidu Scholar]