A Flexibility Scheduling Method for Distribution Network Based on Robust Graph DRL Against State Adversarial Attacks

Ziyang Yin; Shouxiang Wang; Qianyu Zhao

网刊加载中。。。

使用Chrome浏览器效果最佳，继续浏览，你可能不会看到最佳的展示效果，

确定继续浏览么?

复制成功，请在其他浏览器进行阅读

A Flexibility Scheduling Method for Distribution Network Based on Robust Graph DRL Against State Adversarial Attacks PDF

- ORCID：
Ziyang Yin (Student Member, IEEE)
✉
- ORCID：
Shouxiang Wang (Senior Member, IEEE)
✉
- ORCID：
Qianyu Zhao
✉

Key Laboratory of Smart Grid of Ministry of Education, Tianjin University, Tianjin 300072, China

Updated：2025-03-26

DOI：10.35833/MPCE.2024.000409

OUTLINE

Abstract

In the context of large-scale photovoltaic integration, flexibility scheduling is essential to ensure the secure and efficient operation of distribution networks (DNs). Recently, deep reinforcement learning (DRL) has been widely applied to scheduling problems. However, most methods neglect the vulnerability of DRL to state adversarial attacks such as load redistribution attacks, significantly undermining its security and reliability. To this end, a flexibility scheduling method is proposed based on robust graph DRL (RoGDRL). A flexibility gain improvement model considering temperature-dependent resistance is first proposed, which considers weather factors as additional variables to enhance the precision of flexibility analysis. Based on this, a state-adversarial two-player zero-sum Markov game (SA-TZMG) model is proposed, which converts the robust DRL scheduling problem into a Nash equilibrium problem. The proposed SA-TZMG model considers the physical constraints of state attacks that guarantee the maximal flexibility gain for the defender when confronted with the most sophisticated and stealthy attacker. A two-stage RoGDRL algorithm is proposed, which introduces the graph sample and aggregate (GraphSAGE) driven soft actor-critic to capture the complex feature about the neighbors of nodes and their properties via inductive learning, thereby solving the Nash equilibrium policies more efficiently. Simulations based on the modified IEEE 123-bus system demonstrates the efficacy of the proposed method.

Keywords

Distribution network; photovoltaic; flexibility scheduling; deep reinforcement learning; cyber attack

I. Introduction

OPERATIONAL flexibility denotes the capacity of the power system to maintain safe and efficient operation, which is critical for distribution networks (DNs) with high penetration rates of photovoltaic (PV) [

1]. In DNs, operational flexibility fundamentally reflects the level of coordination and utilization of controllable resources within the system, where the essence of scheduling methods is precisely to enhance and apply flexibility [2]. Consequently, distribution system operators (DSOs) can improve operational flexibility by coordinating diverse controllable resources through optimized scheduling, an approach known as flexibility scheduling.

Recently, many research studies on flexibility scheduling have emerged [

2]-[7]. In [2], a flexibility analysis framework is designed to fully exploit the controllability of various resources, thereby achieving the goals of improving operation costs, voltage distribution, and risk control through optimal scheduling. In [4], operational flexibility indices, encompassing node, system, and network transmission flexibility, are developed, offering a flexibility perspective for reinterpreting the scheduling problems of the DN. Based on this, [5] establishes a unified framework for quantifying and enhancing operational flexibility, aiming to achieve a feasible balance of the DSO’s diverse flexibility demands, including reducing operation costs, improving voltage profiles, and alleviating branch congestion. The purpose of flexibility scheduling is to satisfy the DSO’s comprehensive demands for the economical, safe, and clean operation of the DN by fully unleashing the regulation capabilities of controllable resources within the network [6]. Thus, it is necessary to integrate various controllable resources in the DN into a unified optimization framework.

The primary controllable resources for enhancing operational flexibility include energy storage systems (ESSs) [

2], PV inverters [3], soft open points (SOPs) [5], and sectionalizing and tie switches [7]. The DSO requires a meticulously designed scheduling model to coordinate these discrete and continuous controllable resources to enhance operational flexibility while satisfying physical and operational constraints. However, most modeling methods for flexibility scheduling assume constant line resistance. Contrarily, numerous studies have demonstrated that line resistance is dynamically variable [8], [9]. Specifically, [9] formulates an optimal power flow that considers transmission line conductor temperatures to improve the accuracy of optimal power flow analysis. Thus, including temperature-dependent resistance is crucial for enhancing the precision of flexibility scheduling.

Flexibility scheduling is a typical mixed-integer nonlinear programming problem. The most common methods for solving such problems include heuristic algorithms and mathematical programming methods such as mixed-integer second-order cone programming (MISOCP) and linearized approximation programming (LAP). However, these methods face three primary challenges. ① The solution quality of heuristic algorithms cannot be guaranteed, and they typically require extensive computation [

10]. ② Commercial solvers exhibit relatively low efficiency in solving MISOCP problems [11]. ③ While LAP offers higher computational efficiency, it necessitates the imposition of assumptions and simplifications to ensure solvability of the scheduling model. These may deviate the model from reality, thereby reducing the accuracy of the solution [12]. By contrast, the deep reinforcement learning (DRL) avoids the need for undue simplifications and assumptions in the DN model. It generates an optimal policy, i.e., decision-making rules for the optimization problem, rather than a singular optimal solution. Thus, the trained DRL algorithm can achieve real-time and online decision-making without iteration based on the learned policy and current state [13]. To further enhance the decision-making performance, some studies have introduced an innovative approach by integrating graph neural networks (GNNs) with DRL, i.e., graph DRL (GDRL), and applied it to DN optimization [14]-[16]. The rationale for this integration is the ability of GNNs to effectively capture the complex topological structure of the DN and the relationships between nodes [17], significantly enhancing the adaptability of the model to dynamic changes.

Although the DRL is increasingly being used to address complex DN scheduling challenges, its weaknesses are becoming more apparent. DRL is notably susceptible to disturbances from adversarial noise, with the neural network policies of DRL being highly vulnerable to state adversarial attacks [

18]. These attacks introduce slight input perturbations, leading to unpredictable errors and potentially severe security implications [19]. Among the numerous cyber-attacks currently faced by power systems, false data injection attack (FDIA) is considered one of the most serious threats to secure system operation [20]. Load redistribution (LR) is a type of FDIA triggered by false load data, which consequently affects operational actions and leads to economic loss and physical damage to devices due to incorrect operational decisions [21]. Given that DRL makes decisions based on real-time measurements, LR presents an effective strategy for attackers to introduce state adversarial attacks into trained DRL models, thereby significantly undermining their decision-making capabilities. Studies in [18] and [22] explored the vulnerability of the DRL model to data perturbations in power network reconfiguration and optimal power flow. The findings reveal that small disturbances in input data can lead to drastically different control decisions and introduce significant risks. Therefore, it is crucial to improve the defense mechanisms and robustness of DRL models against state attacks before implementing them in actual DNs.

To enhance the robustness of DRL models, the adversarial DRL framework is proposed to identify and adapt to potential adversarial attacks. Specifically, adversarial attacks are defined as attack agents and participate in the training process of defense agents (i.e., robust DRL model). By integrating strategies such as adversarial training and noise injection, this framework strengthens the resistance of the DRL model to input perturbations [

23]. In [24], a robust adversarial reinforcement learning method based on a two-player zero-sum Markov game (TZMG) is proposed to improve the robustness against changes in environmental parameters. Similarly, [25] introduces a TZMG model for the cybersecurity in power grids, employing reinforcement learning to simulate attacker behaviors and aid defenders in devising superior strategies for relay protection. Nonetheless, reinforcement learning experiences diminished efficiency and convergence in large-scale scenarios. To address this, [26] presents an alternating training robust DRL based on the state-adversarial Markov decision process (SA-MDP), notably enhancing defenders’ capabilities against state adversarial attacks. In [27], an adversary-based robust DRL approach is proposed to strengthen the resilience of DRL-based demand response management systems against cyber-attacks.

Despite these advancements, the security and robustness of the DRL against state adversarial attacks in the optimal DN scheduling remain underexplored. Furthermore, the current robust adversarial DRL framework typically allows the attack agent to arbitrarily modify state values within a specified range, which is impractical. This is because, in power systems, adversarial attacks that fail to adhere to physical characteristics and constraints are easily detected by bad data detection (BDD) mechanisms and, thereby, would not be considered for further decision-making [

22]. This situation causes the absence of decision-making experience with stealthy adversarial samples during the adversarial training stage [27], making the learned robust DRL model sensitive to more realistic state attack signals.

To address the aforementioned challenges, this paper proposes a flexibility scheduling method based on robust GDRL (RoGDRL) to enhance the robustness of the DRL-driven scheduling system against state adversarial attacks. Initially, a mathematical model of flexibility scheduling accounting for temperature-dependent resistance is constructed, which improves the operational flexibility and economic efficiency by coordinating various flexibility resources such as tie switches, ESSs, SOPs, PVs, and static var compensators (SVCs). Subsequently, a novel state-adversarial TZMG (SA-TZMG) model for flexibility scheduling is proposed. This model frames the challenge of DRL-based scheduling under state adversarial attacks as a Nash equilibrium problem. Then, a two-stage RoGDRL algorithm with an alternating adversarial training framework is developed to solve the game model. The proposed algorithm utilizes a graph sample and aggregate driven soft actor-critic (SAGESAC) agent to extract feature representations from graph-structured states. The graph sample and aggregate (GraphSAGE) [

28] enhances the ability of the proposed algorithm to capture the operational characteristics of the DN and accelerates the learning process. Experimental results demonstrate the effectiveness of the proposed method. The main contributions are summarized as follows.

1) The proposed method integrates weather variables and employs a steady-state thermal balance function to accurately assess temperature-dependent resistance. This enhances the accuracy of flexibility gain evaluation and the reliability of scheduling decisions.

2) The proposed SA-TZMG model incorporates realistic physical constraints of state attacks, enabling an attacker to generate LR samples that evade the BDD mechanism. This ensures that the defender can make informed decisions to mitigate the impact of more stealthy state adversarial attacks.

3) By combining SAGESAC with an alternating adversarial training framework, the proposed two-stage RoGDRL algorithm demonstrates exceptional robustness against state adversarial attacks and is highly competitive in enhancing operational flexibility compared with existing DRL algorithms.

The remainder of this paper is organized as follows. The mathematical model of flexibility gain improvement is formulated and converted into an SA-TZMG in Sections II and III, respectively. A novel two-stage RoGDRL algorithm is formulated in Section IV. Case study results are presented in Section V. The conclusions are shown in Section VI.

II. Mathematical Model of Flexibility Gain Improvement

A. Definition of Flexibility Gain

To address the flexibility scheduling problem in DNs, a flexibility gain indicator that encompasses node flexibility, branch transfer flexibility, and economic efficiency is first proposed. The flexibility gain during period t can be expressed as:

F_{t} = f_{N, t} + f_{B, t} + f_{C, t}

(1)

where f_N,_t is the node flexibility gain; f_B,_t is the branch transfer flexibility gain; and f_C,_t is the cost flexibility gain. The flexibility gain metric evaluates improvements in operational flexibility from multiple perspectives following the implementation of scheduling strategies. The sub-indicators for node, branch transfer, and cost flexibility gains share uniform dimensionality, allowing their aggregation into a comprehensive index through summation. A detailed description is provided below.

1)　Node Flexibility Gain

Node flexibility, which reflects the local states of flexibility demand and supply [

29], is fundamental to operational flexibility. Node voltage is a crucial assessment metric for node flexibility, with voltage deviations beyond permissible limits indicating an extreme lack of node flexibility [5]. Hence, the degree of improvement in nodal voltage deviation before and after implementing scheduling strategies is used to quantify the node flexibility gain.

f_{N, t} = \frac{\underset{i \in Ω_{i}}{m a x} (| 1 - U_{i, t}^{o} |) - \underset{i \in Ω_{i}}{m a x} (| 1 - U_{i, t}^{n} |)}{\underset{i \in Ω_{i}}{m a x} (| 1 - U_{i, t}^{o} |) + λ}

(2)

where $U_{i, t}^{o}$ and $U_{i, t}^{n}$ are the per-unit voltage values at node i during period t before and after the control, respectively; $Ω_{i}$ is the node set; and $λ$ is a small constant, which is set to be 10^-8 to avoid a division by zero [

30]. This approach ensures that the denominator of the formula does not become zero, even when

\underset{i \in Ω_{i}}{m a x} (|1 - U_{i, t}^{o}|)

is very small or zero, thus avoiding computational anomalies.

2)　Branch Transfer Flexibility Gain

Branch transfer flexibility signifies the ability of the DN to relocate local flexibility for spatial-temporal balancing [

4]. The branch capacity directly reflects the ability of the DN to balance flexibility supply and demand, i.e., the branch transfer flexibility [5]. Hence, this paper quantifies the branch transfer flexibility gain by calculating the proportion of the branch loading rate reduction before and after the implementation of the scheduling strategy.

f_{B, t} = \frac{\underset{i j \in Ω_{B}}{m e a n} (I_{i j, t}^{o} / I_{i j, m a x}) - \underset{i j \in Ω_{B}}{m e a n} (I_{i j, t}^{n} / I_{i j, m a x})}{\underset{i j \in Ω_{B}}{m e a n} (I_{i j, t}^{o} / I_{i j, m a x}) + λ}

(3)

where I $_{i j, t}^{o}$ and I $_{i j, t}^{n}$ are the currents of branch ij during period t before and after the control, respectively; $I_{i j, m a x}$ is the carrying capacity of branch ij; and $Ω_{B}$ is the branch set.

3)　Cost Flexibility Gain

Node and branch transfer flexibilities form the foundational framework for quantitative analysis of DN flexibility [

29]. Meanwhile, the economic efficiency often represents a critical consideration in time-series scheduling problems [2]. Hence, the cost flexibility gain is constructed to quantify the rate of change in operation cost, which includes network power loss, device power loss, PV power curtailment loss, and device operation costs.

\{\begin{array}{l} f_{C, t} = \frac{γ_{L} \sum_{i j \in Ω_{B}} (I_{i j, t}^{o})^{2} r_{i j, t}^{o} + γ_{p v} P_{o, t}^{c u r} - (γ_{L} P_{t}^{l o s s} + γ_{p v} P_{n, t}^{c u r} + γ_{A} Δ d_{t})}{γ_{L} \sum_{i j \in Ω_{B}} (I_{i j, t}^{o})^{2} r_{i j, t}^{o} + γ_{p v} P_{o, t}^{c u r} + λ} \\ s . t . P_{t}^{l o s s} = \sum_{i j \in Ω_{B}} (I_{i j, t}^{n})^{2} r_{i j, t}^{n} + \sum_{i \in Ω_{s o p}} P_{l o s s, i, t}^{s o p} \end{array}

(4)

where $γ_{L}$ is the electricity price; $γ_{p v}$ is the generation revenue of PV; $γ_{A}$ is the switching action cost; $r_{i j, t}^{o}$ and $r_{i j, t}^{n}$ are the resistances of branch ij during period t before and after the control, respectively; $Δ d_{t}$ is the number of actions for discrete devices; P $_{l o s s, i, t}^{s o p}$ is the power loss of converter i of the SOP during period t; P $_{o, t}^{c u r}$ and P $_{n, t}^{c u r}$ are the PV power curtailments during period t before and after the control, respectively; and $Ω_{s o p}$ is the SOP set.

Enhancing the cost flexibility gain can ensure low-cost operation while mitigating the phenomenon of PV power curtailment, thereby improving the utilization rate of PVs. This encourages the investment enthusiasm of DSOs and PV investors for further large-scale PV deployment in the future, thereby promoting shared interests among multiple stakeholders.

4)　Discussion on Relationship Between Flexibility Gain and Operational Flexibility

In this study, the relationships between the three sub-indicators of flexibility gain and the operational flexibility is outlined as follows.

1) Node flexibility gain quantitatively reflects the reduction in voltage deviation. A larger f_N,_t indicates a smaller node voltage deviation after implementing scheduling strategies. Nodes with sufficient flexibility can support the reduction of the voltage deviation to the desired range [

5]. Thus, increasing the node flexibility gain can improve the node flexibility.

2) The branch transfer flexibility gain quantitatively reflects the proportion of reduction in branch loading rate. A larger f_B,_t indicates a lower branch loading rate after implementing scheduling strategies. The lower the loading rate, the greater the remaining capacity of the branch available for transfer flexibility [

5]. Thus, enhancing the branch transfer flexibility gain can improve the ability of DN to support flexible transmission, i.e., branch transfer flexibility.

3) The cost flexibility gain quantitatively reflects the proportion of reduction in the system operation cost. A larger f_C,_t indicates a lower operation cost after implementing scheduling strategies. In DNs, utilizing or enhancing the operational flexibility incurs certain costs, which represent the value attribute of flexibility and are crucial for evaluating the availability and usability of scheduling strategies [

29]. Thus, by enhancing the cost flexibility gain, it is possible to achieve economical operation while increasing the flexibility of the node and branch.

Moreover, from the perspective of DN operational performance, improving the flexibility gain can reduce the energy loss cost and PV power curtailment, enhance the voltage stability, and mitigate the branch congestion. Traditionally, achieving these objectives simultaneously often requires a multi-objective optimization framework. However, the proposed flexibility gain indicator simplifies this into a single-objective optimization problem, diminishing the complexity inherent in the multi-objective scheduling problem.

B. Flexibility Gain Improvement Model Considering Temperature-dependent Resistance

The accuracy of flexibility gain calculations depends on the precision of power flow analysis. However, many existing scheduling methods overlook the characteristics of dynamic branch resistance changes during power flow analysis. This is because the resistance of metallic conductors changes with their temperature. Specifically, the relationship is governed by:

r_{i j, t} = r_{i j}^{c} [1 + ϖ (T_{i j, t}^{} - T_{c})]

(5a)

T_{i j, t}^{} = (r_{i j, t} - r_{i j}^{c}) ϖ^{- 1} (r_{i j}^{c})^{- 1} + T_{c}

(5b)

where r_ij_,_t is the resistance of branch ij during period t; r $_{i j}^{c}$ is the resistance of branch ij in the reference temperature $T_{c}$ ; T_ij_,_t is the conductor temperature of branch ij during period t; and $ϖ$ is the temperature constant.

The branch temperature is determined by weather factors and branch current, adhering to the steady heat balance function as specified in IEEE Std 738-2012 [

8]:

\{\begin{array}{l} q_{i j, t}^{s} + I_{i j, t}^{2} r_{i j, t} = q_{i j, t}^{c} + q_{i j, t}^{r} \\ s . t . q_{i j, t}^{s} = ρ S_{Q} (s i n δ_{t}) {\bar{A}}_{i j} \\ q_{i j, t}^{r} = 0.0178 D ε \frac{T_{i j, t}^{4} - T_{a, t}^{4}}{100^{4}} \\ q_{i j, t}^{c} = m a x {q_{i j}^{c 1} l_{a} (T_{i j, t}^{} - T_{a, t}), q_{i j}^{c 2} l_{a} (T_{i j, t}^{} - T_{a, t})} \end{array}

(6)

where q $_{i j, t}^{c}$ and q $_{i j, t}^{r}$ are the heat dissipations via air convection and surface radiation of branch ij, respectively; q $_{i j, t}^{s}$ is the calorific value of branch ij from the solar radiation; $I_{i j, t}^{}$ is the current of branch ij during period t; $S_{Q}$ is the solar radiation level; D is the diameter of the branch; $ρ$ is the solar absorptivity; $ε$ is the emissivity of the conductor; $T_{a, t}$ is the ambient temperature; ${\bar{A}}_{i j}$ is the projected area of the conductor per unit length of branch ij; $δ_{t}$ is the solar radiation angle; and q $_{i j, t}^{c 1}$ , q $_{i j, t}^{c 2}$ , and l_a are the convection heat loss coefficients. The details can be referred to in [

9].

The flexibility gain improvement model considering temperature-dependent resistance can be written as:

\{\begin{array}{l} \underset{u_{t}, s_{t}}{m a x} F (s_{t}) = f_{N, t} + f_{B, t} + f_{C, t} t = 1,2, \dots, N_{t} \\ \begin{matrix} s . t . & 0 = G (s_{t}, u_{t}) \\ 0 \leq H (s_{t}, u_{t}) \end{matrix} \end{array}

(7)

where $G (\cdot)$ and $H (\cdot)$ are the vectors of equality and inequality constraint equations, encompassing power balance, operational safety, equipment operational limits, and temperature-dependent resistance constraints [

14], [31];

N_{t}

is the number of scheduling periods; u_t is the control vector for topology, ESSs, SOPs, SVCs, and PVs during period t; and s_t is the vector of operation states, including the load demand, PV output, and weather factors, during period t.

Problem (7) can be transformed into a Markov decision process (MDP) and addressed using DRL algorithms. However, state adversarial attacks pose a substantial risk, as they can disturb the inputs to the DRL model. Such disturbances can alter the decisions of the model, impacting the results. LR attacks represent a common type of state adversarial attack for power systems.

C. Modeling of LR Attacks

Referring to [

20], LR attacks are based on the following assumptions. ① The measurement attack on balancing and zero-injection nodes is ignored due to its detectability and correctability. ② When the power output from the PV system is zero, the manipulation of the PV output is infeasible. ③ To conceal the attack from system operators, the false data injections do not exceed the normal data by

ξ

(percentage). Based on those, a valid LR attack during period t for the DN with a high penetration of PV can be modeled as:

Δ S B_{t} = - S F \cdot K D \cdot (Δ S_{t}^{p v} - Δ S_{t}^{l o a d})

(8a)

\{\begin{array}{l} 1^{T} Δ S_{t}^{p v} = 0 \\ 1^{T} Δ S_{t}^{l o a d} = 0 \end{array}

(8b)

\{\begin{array}{l} Δ S_{t}^{l o a d} \in [- ξ S_{t}^{l o a d}, ξ S_{t}^{l o a d}] \\ Δ S_{t}^{p v} \in [- ξ (S_{m a x}^{p v} - S_{t}^{p v}), ξ (S_{m a x}^{p v} - S_{t}^{p v})] \end{array}

(8c)

where $Δ S_{t}^{l o a d}$ and $Δ S_{t}^{p v}$ are the power injections of false loads and PVs, respectively; $Δ S B_{t}$ is the false branch power measurement injection; SF and KD are the shifting factor and load incidence matrices, respectively; S $_{t}^{l o a d}$ and S $_{t}^{p v}$ are the load demand and PV output matrices, respectively; and S $_{m a x}^{p v}$ is the PV capacity matrix. Equation (8a) ensures that the attacks can bypass the BDD mechanism, while (8b) ensures that the sum of power injections of false loads or PVs is equal to zero [

21]. The BDD is typically based on the principle of residual testing, where the residual r_e represents the Euclidean norm of the difference between the measurement vector and its estimation [22]. By comparing the residual r_e with a residual threshold

τ

, if

r_{e} ⩽ τ

, the state against attack can bypass the BDD; otherwise, the adversarial attack fails.

In the robust adversarial framework [

27], the attacker deploys state adversarial attacks, as described in (8), aiming to disrupt the input of the DRL controller to diminish flexibility gain. In response, the defender implements a robust DRL controller to maximize flexibility gain, counteracting the attacker’s efforts. This interaction forms the basis of a two-player zero-sum game, where one participant’s gain is equivalent to the other’s loss. To develop a robust DRL model for flexibility scheduling in the face of state attack, the competition between attackers and defenders is conceptualized as a TZMG within the context of state adversarial attacks.

III. Modeling of SA-TZMG

The MDP is a mathematical framework that provides a formalism for modeling sequential decision-making problems. TZMG is considered an extension of MDP to game-theoretic scenarios. It consists of six-tuple as follows:

<S, A^{D}, A^{A}, P, R, γ>

(9)

where S is the game state space; A^D and A^A are the action spaces for two players; P is the state transition function; R is the reward function; and $γ$ is the discount factor. In TZMG, while one player attempts to maximize the cumulative reward from the game, the other player seeks to minimize this value. In this work, the maximizer is represented as the defender, while the minimizer is defined as the attacker.

Notably, in the conventional TZMG model, actions of both defender and attacker influence the environment simultaneously, leading to the generation of rewards [

32]. This model, however, does not accurately reflect attack-defense interaction patterns under state adversarial conditions. Consequently, the TZMG is augmented through integration with SA-MDP, culminating in the developing of an SA-TZMG. This model more effectively reflects the interaction between the attacker and the defender. This interaction process and its effects on the environment are represented in Fig. 1.

Fig. 1 Interaction between players and its effects on environment.

As shown in Fig. 1, the attacker’s adversarial attack A $_{t}^{A}$ primarily targets the defender’s perception of the state rather than exerting a direct influence on the environment. The modeling process of SA-TZMG is as follows.

1) State space: the state of the DN is usually defined as the Euclidean space. However, stacking the features of nodes by a specific order may cause a loss of topology information and dependencies between nodes [

33]. Consequently, there is a critical need to devise a graph data structure that encapsulates spatiotemporal information, thereby more accurately mirroring the operation state of the system.

Graph data can be represented as $S = (Ω_{i}, Ω_{e}, X)$ , where $Ω_{e}$ is the set of edges; and X is the set of features, including the node feature matrix X_bus and the feature matrix X_edge. For the DN, X_edge is the adjacency matrix. Moreover, the node feature is the system operation state. In summary, the state during period t can be represented as $S_{t} = (Ω_{i}, Ω_{e}, X_{b u s, t}, X_{e d g e, t})$ . $X_{b u s, t}$ during period t can be written as:

X_{b u s, t} = [P_{i n j, t}, Q_{i n j, t}, T_{a, t}, V_{w, t}, A_{w, t}, Q_{s, t}] \in R^{n \times d}

(10)

where the operation state mainly comprises the active power injection of the node $P_{i n j, t}$ , reactive power injection of the node $Q_{i n j, t}$ , and weather factors, including wind speed $V_{w, t}$ , wind angle $A_{w, t}$ , temperature $T_{a, t}$ , and solar irradiance $Q_{s, t}$ ; and n and d are the numbers of rows and columns of the feature matrix, respectively.

Remarke: currently, the primary method for collecting weather information is through the automatic weather station (AWS). It can transmit weather data via wired or wireless communication and is highly automated, accurate, and reliable [

34]. Thus, a possible implementation of the proposed method is to deploy AWS in the DN to collect the required weather data, which are then transmitted to the control center via a communication network. Additionally, to further ensure the accuracy of data collection, it can be linked with geographic information systems. This linkage allows for precisely attributing measured weather information to the specific branches, thereby providing more accurate and timely data support [35].

2) Defender’s action: the defender’s action A^D is decomposed into a Cartesian product of two sub-spaces. The continuous sub-action space A_c is the control strategy of ESS, SVC, PV, and SOP. The discrete sub-action A_d is the control strategy for topology. the defender’s action during period t can be represented as:

A_{t}^{D} = {A_{d, t}, A_{c, t}} = {a_{t}^{d n r}, [a_{t}^{p v}, a_{t}^{s v c}, a_{t}^{s o p}, a_{t}^{e s s}]}

(11)

where a $_{t}^{p v}$ , a $_{t}^{s v c}$ , $a_{t}^{e s s} \in [- 1,1]$ are the control actions for PV, SVC, and ESS, respectively, detailed in [

14]; and a

_{t}^{d n r}

and a

_{t}^{s o p}

are the control actions for topology and SOP, respectively, detailed in [31].

3) Attacker’s action: the attacker’s action at each time step, which is a continuous variable, is defined as $a_{a t t, t} = [a_{a t t, l o a d, t}, a_{a t t, p v, t}] \in [- ξ 1^{T}, ξ 1^{T}]$ , where a_att,load,_t and a_att,pv,_t are the attack actions targeting the load and PV state variables, respectively. This study sets $ξ$ to be 0.3 [

21]. During period t, the attacker can alter the load demand and PV output, thereby modifying the apparent power of the node injection to disrupt the defender’s observed state S_t.

[Δ S_{t}^{l o a d}, Δ S_{t}^{p v}] = a_{a t t, t} [S_{t}^{l o a d}, S_{m a x}^{p v} - S_{t}^{p v}]^{T}

(12)

Diverging from existing adversarial attack modeling approaches, this study further constrains the attacker’s actual executed actions to ensure compliance with the LR mechanism as:

A_{t}^{A} = \{\begin{array}{l} [Δ S_{t}^{l o a d}, Δ S_{t}^{p v}] - [M (Δ S_{t}^{l o a d}) 1^{T}, M (Δ S_{t}^{p v}) 1^{T}] η_{t} = 0 \\ 0 η_{t} = 1 \end{array}

(13)

where $M (\cdot)$ is the operation of calculating the mean value; and $η_{t}$ is a Boolean variable. If attacker’s action satisfies the actual physical constraints in (8), $η_{t} = 0$ ; otherwise, $η_{t} = 1$ .

4) Reward: the reward in SA-TZMG is one of the critical factors determining the game result. It is represented as the immediate reward provided by the environment when the defender acts A^D and the attacker acts A^A. The reward during period t is expressed as the flexibility gain with a penalty term in the paper.

R_{t} (S_{t}, A_{t}^{D}, A_{t}^{A}) = F_{t} - κ_{t} ϑ

(14)

where $κ_{t} = 0$ if A $_{t}^{D}$ satisfies the operational security constraints, otherwise, $κ_{t} = 1$ ; and $ϑ$ is a negative constant.

Reference [

36] has proven that TZMG has a Nash equilibrium joint policy

(π_{θ}, π_{φ})

, where

π_{θ}

and

π_{φ}

are the policies of the defender and attacker, respectively. Nash equilibrium defines the highest payoff a defender can achieve when facing the most powerful attacker, which is equivalent to the max-min solution. In SA-TZMG, the defender still seeks to maximize its resilience against the most powerful attacker. Thus, the Nash equilibrium joint policy can be written as:

\{\begin{array}{l} \underset{π_{φ}}{m a x} \underset{π_{θ}}{m i n} 𝒱 (S_{t}, A_{t}^{D}, A_{t}^{A}) = \underset{π_{θ}}{m i n} \underset{π_{φ}}{m a x} 𝒱 (S_{t}, A_{t}^{D}, A_{t}^{A}) \\ s . t . 𝒱 (S_{t}, A_{t}^{D}, A_{t}^{A}) = E [\sum_{Δ t = 0}^{T} γ^{Δ t} R_{t + Δ t + 1}^{} (S_{t}, A_{t}^{𝔻}, A_{t}^{𝔸})] \\ A_{t}^{D} ~ π_{φ} (\cdot | S_{t}) \\ A_{t}^{A} ~ π_{θ} (\cdot | S_{t}) \end{array}

(15)

where $𝒱 (\cdot)$ is the expected cumulative reward function; T is the total number of optimization periods; and $E [[\cdot]$ is the expected value function. Traditional DRL algorithms mainly focus on solving single-agent MDP and are not designed to directly address max-min problems [

32]. To end this, the following section introduces a two-stage RoGDRL algorithm.

IV. Two-stage RoGDRL Algorithm

A. Framework of Two-stage RoGDRL Algorithm

In TZMG, once one player’s policy is fixed, the max-min problem becomes a single-agent MDP, and a deterministic policy is sufficient to achieve optimality [

32]. Thus, inspired by [32] and [37] on solving the max-min problem, this work proposes a two-stage RoGDRL algorithm to tackle the problem (15). It consists of Stage I, which focuses on attacker’s policy learning, and Stage II, which is dedicated to robust policy learning.

In Stage I, the training of the attacker aims to ascertain the optimal attacker’s policy $π_{θ}$ , keeping the defender’s policy $π_{φ}$ fixed and aiming to minimize the defender’s cumulative reward. It is framed as a constrained minimization problem:

\{\begin{array}{l} \underset{π_{θ}}{m i n} 𝒱 (S_{t}, A_{t}^{D}, A_{t}^{A}) \\ s . t . A_{t}^{D} ~ π_{φ} (\cdot | S_{t} + A_{t}^{A}) \end{array}

(16)

It is essential to highlight that $π_{φ}$ is pre-trained at this stage. The rationale behind pre-training $π_{φ}$ without the influence of adversarial attacks lies in its effectiveness in establishing an optimal initial exploration strategy for the attacker [

27].

In Stage II, the defender focuses on augmenting its resilience to state attacks by developing a robust defense policy, with the attacker’s policy held constant. This stage is characterized as a constrained maximization problem:

\{\begin{array}{l} \underset{π_{φ}}{m a x} 𝒱 ({\hat{S}}_{t}, A_{t}^{D}, A_{t}^{A}) \\ s . t . {\hat{S}}_{t} = S_{t} + (A_{t}^{A} \sim π_{θ} (\cdot | S_{t})) \end{array}

(17)

where Ŝ_t is the perturbed state after FDIA. The policy learning tasks for both stages can be addressed by employing the proposed SAGESAC. The specific algorithm for these two stages will be detailed below.

B. Stage I: Attacker’s Policy Learning

At this stage, state attack signals generated by the attacker are continuous variables. Thus, the soft actor-critic (SAC) is adopted as the foundational framework to learn $π_{θ}$ . The objective can be expressed as a maximizing problem with a negative reward function:

\{\begin{array}{l} J (π_{θ}) = m a x \sum_{t = 0}^{T} E [- R_{t}^{} (S_{t}, A_{t}^{D}, A_{t}^{A}) + α ℋ (π_{θ} (S_{t}))] \\ s . t . A_{t}^{D} ~ π_{φ} (\cdot | S_{t} + A_{t}^{A}) \\ ℋ (π_{θ} (S_{t})) = E_{A_{t}^{D} ~ π_{θ}} [- l n π_{θ} (A_{t}^{D} | S_{t})] \end{array}

(18)

where $ℋ (\cdot)$ is the policy entropy, which is a measure of the randomness in the action selection of the policy, encouraging exploration by penalizing certainty in action choices; $E_{A_{t}^{D} ~ π_{θ}}$ denotes the average calculation under all possible actions $A_{t}^{D}$ ; $α$ is the temperature parameter used to balance the relationship between the policy entropy and expected rewards; and $- l n π_{θ} (A_{t}^{D} | S_{t})$ is the negative value of the logarithmic probability of action $A_{t}^{D}$ given the state $S_{t}$ .

To exploit the critical attributes of the observation, this study introduces GraphSAGE as a feature extractor to efficiently encapsulate the characteristics of graph-structured states. Traditional feature extractor, named graph convolutional network (GCN), utilizes a transductive learning approach that requires static graph structures [

17]. Nonetheless, the correlation among neighboring nodes is subject to temporal fluctuations, attributed primarily to DN reconfiguration, a phenomenon that occurs routinely in the DN. In contrast, GraphSAGE, as one of the most advanced GNN frameworks, employs an inductive learning strategy capable of sampling and aggregating features from neighboring nodes of a node. This methodology makes GraphSAGE particularly advantageous for application within large-scale, dynamic graphs [28]. The critical steps of GraphSAGE are as follows.

1) Sampling from neighboring nodes. For each node i, a fixed number of neighboring nodes are randomly sampled from its neighboring node set $N (i)$ , thereby reducing the number of neighboring nodes to be processed and, consequently, the computational complexity of the model.

2) Aggregating features from neighboring nodes. Specific aggregators are used to aggregate features of the sampled neighboring nodes, obtaining a comprehensive representation of neighborhood features.

h_{𝒩 (i)}^{k} = A G G ({h_{m}^{k - 1}, \forall m \in 𝒩 (i)})

(19)

where AGG denotes the aggregation operation, and in practice, it can be various aggregators such as mean aggregator; $h_{𝒩 (i)}^{k}$ is the k^th aggregated neighborhood feature; and $h_{m}^{k - 1}$ is the ${(k - 1)}^{t h}$ aggregated neighborhood feature for node m.

3) Updating feature representation of node. $h_{𝒩 (i)}^{k}$ is combined with the feature of the current node $h_{i}^{k - 1}$ (e.g., through concatenation). The feature representation of the node is then updated using a neural network layer (e.g., a fully connected layer).

h_{i}^{k} = R e L U (w_{k} \cdot C O N C A T (h_{i}^{k - 1}, h_{𝒩 (i)}^{k}) + b_{k})

(20)

where w_k and b_k are the learnable coefficient matrices; $R e L U (\cdot)$ is the ReLU activation function; and $C O N C A T (\cdot)$ denotes the concatenation operation. In the subsequent stage of the GraphSAGE process, the newly generated features $h_{i}^{k}$ will be utilized. Following K aggregation layers, a feature vector $H = h_{i}^{K}$ , $i \in Ω_{b}$ is produced as the output.

As shown in Fig. 2, the proposed SAGESAC incorporates two GraphSAGE layers (SAGE 1 and SAGE 2) to modify the policy and critic networks within the SAC.

Fig. 2 Feature extraction process driven by GraphSAGE.

The output of the graph policy network can be parameterized by a Gaussian distribution $N (\cdot)$ , which can be denoted as:

π_{θ} (\cdot | S_{t}) = N (μ (S_{t}), σ^{2} (S_{t}))

(21)

where $μ (S_{t})$ and $σ (S_{t})$ are the mean and variance of the action for the policy network, respectively. Then, the mini-batch data from the replay buffer are sampled, and θ is typically updated using gradient descent. The loss function is expressed as:

J (π_{θ}) = E_{S_{t} ~ 𝒟, A_{t}^{A} ~ π_{θ}} [α ℋ (π_{θ} (S_{t})) - Q_{β} (S_{t}, A_{t}^{A})]

(22)

θ \leftarrow θ + τ \nabla_{θ} J (π_{θ})

(23)

where $J (π_{θ})$ is the loss function of the policy network; $\nabla_{θ} J (π_{θ})$ is the differential form of $J (π_{θ})$ ; $τ$ is the learning rate; $E_{S_{t} ~ 𝒟, A_{t}^{A} ~ π_{θ}}$ denotes the average calculation under all possible states $S_{t}$ and actions $A_{t}^{A}$ ; $𝒟$ is the replay buffer; and $Q_{β} (S_{t}, A_{t}^{A})$ is the Q-function value, which is parameterized by the graph critic network. The Q-function updates the parameters via the following method:

\{\begin{array}{l} J (Q_{β}) = E_{(S_{t}, A_{t}^{A}) ~ 𝒟} [(Q_{β} (S_{t}, A_{t}^{A} {) - y)}^{2}] \\ y = - R_{t} + γ E_{A_{t + 1}^{A} ~ π_{θ}} [{\tilde{Q}}_{\tilde{β}} (S_{t + 1}, A_{t + 1}^{A}) + α ℋ (π_{θ} (S_{t}))] \end{array}

(24)

where $J (Q_{β})$ is the loss function of the Q-function; $E_{(S_{t}, A_{t}^{A}) ~ 𝒟}$ denotes the average calculation under all possible states $S_{t}$ and actions $A_{t}^{A}$ ; $E_{A_{t + 1}^{A} ~ π_{θ}}$ denotes the average calculation under all possible actions $A_{t + 1}^{A}$ ; and ${\tilde{Q}}_{\tilde{β}} (S_{t + 1}, A_{t + 1}^{A})$ is the target Q-function value. Periodically, the parameters of the critic network are copied to the target critic network to stabilize learning.

β \leftarrow β + τ \nabla_{β} J (Q_{β})

(25)

\tilde{β} \leftarrow υ β + (1 - υ) β

(26)

where $β$ and $\tilde{β}$ are the sets of parameters for the critic and target critic networks, respectively; $\nabla_{β} J (Q_{β})$ is the differential form of $J (Q_{β})$ ; and $υ$ is the soft update coefficient. Following the training framework of the naive SAC, the attacker can learn an attacker’s policy $π_{θ}$ to achieve the worst-case performance under a limited state attack.

C Stage II: Robust Policy Learning

This stage addresses a robust policy learning problem in a hybrid discrete-continuous action space. It is worth noting that the robust policy learning aims to enhance decision-making resilience under state adversarial attacks. Thus, the perturbed state Ŝ_t is used to train the neural network. R_t obtained during the training process is the flexibility gain for A $_{t}^{D}$ performed by the DN environment under a clean state S_t.

The SAGESAC is extended by introducing two parallel graph policy networks to address policy learning issues in mixed discrete-continuous action spaces. One is designated for generating discrete actions, and the other is for generating continuous actions. Its objective function is still to maximize the sum of expected rewards and policy entropy. However, it uniquely accounts for the policy entropy of discrete and continuous actions. The objective function can be expressed as:

J (π_{φ}) = m a x \sum_{t = 0}^{T} E [R_{t} (S_{t}, A_{t}^{D}, A_{t}^{A}) + α ℋ (π_{φ_{d}} ({\hat{S}}_{t})) + α ℋ (π_{φ_{c}} ({\hat{S}}_{t}))]

(27)

where $π_{φ_{d}}$ is the discrete action policy network, employing the Gumbel-Softmax function for selecting discrete actions A_d,_t [

16];

π_{φ_{c}}

is the continuous action policy network, and its structure for outputting continuous actions

A_{c, t}

can be referenced in (21);

π_{φ} : = {π_{φ_{d}}, π_{φ_{c}}}

is the robust joint policy; and

A_{t}^{D} = [A_{d, t}, A_{c, t}]

is sampled from two policies. Regarding both the discrete and continuous action policy networks, their loss functions can be defined as:

\{\begin{array}{l} J (π_{φ_{d}}) = E_{{\hat{S}}_{t} ~ 𝒟, A_{t}^{D} ~ π_{φ}} [α ℋ (π_{φ_{d}} ({\hat{S}}_{t})) - Q_{ψ} ({\hat{S}}_{t}, A_{t}^{D})] \\ J (π_{φ_{c}}) = E_{{\hat{S}}_{t} ~ 𝒟, A_{t}^{D} ~ π_{φ}} [α ℋ (π_{φ_{c}} ({\hat{S}}_{t})) - Q_{ψ} ({\hat{S}}_{t}, A_{t}^{D})] \end{array}

(28)

where $E_{{\hat{S}}_{t} ~ 𝒟, A_{t}^{D} ~ π_{φ}}$ denotes the average calculation under all possible states ${\hat{S}}_{t}$ and actions $A_{t}^{D}$ ; and $Q_{ψ} ({\hat{S}}_{t}, A_{t}^{D})$ is the Q-function value for robust policy learning.

The gradient descent method is also employed to optimize the loss functions of the policy network, aiming to learn the optimal parameters as follows:

\{\begin{array}{l} φ_{d} \leftarrow φ_{d} + τ \nabla_{φ_{d}} J (π_{φ_{d}}) \\ φ_{c} \leftarrow φ_{c} + τ \nabla_{φ_{c}} J (π_{φ_{c}}) \end{array}

(29)

where $\nabla_{φ_{d}} J (π_{φ_{d}})$ and $\nabla_{φ_{c}} J (π_{φ_{c}})$ are the differential forms of $J (π_{φ_{d}})$ and $J (π_{φ_{c}})$ , respectively.

The critic network $Q_{ψ}$ is updated by minimizing the following loss function $J (Q_{ψ})$ :

\{\begin{array}{l} J (Q_{ψ}) = E_{({\hat{S}}_{t}, A_{t}^{D}) ~ 𝒟} [(Q_{ψ} ({\hat{S}}_{t}, A_{t}^{D} {) - y)}^{2}] \\ y = - R_{t} + γ E_{A_{t + 1}^{D} ~ π_{φ}} [{\tilde{Q}}_{\tilde{ψ}} ({\hat{S}}_{t + 1}, A_{t + 1}^{D}) + α ℋ (π_{φ} ({\hat{S}}_{t})] \end{array}

(30)

ψ \leftarrow ψ + τ \nabla_{ψ} J (Q_{ψ})

(31)

\tilde{ψ} \leftarrow υ ψ + (1 - υ) ψ

(32)

where $E_{({\hat{S}}_{t}, A_{t}^{D}) ~ 𝒟}$ denotes the average calculation under all possible states ${\hat{S}}_{t}$ and actions $A_{t}^{D}$ ; $E_{A_{t + 1}^{D} ~ π_{φ}}$ denotes the average calculation under all possible actions $A_{t + 1}^{D}$ ; $\nabla_{ψ} J (Q_{ψ})$ is the differential form of $J (Q_{ψ})$ ; ${\tilde{Q}}_{\tilde{ψ}} ({\hat{S}}_{t + 1}, A_{t + 1}^{D})$ is the target Q-function value for robust policy learning; and $ψ$ and $\tilde{ψ}$ are the sets of parameters for the critic network and target critic network, respectively. A robust scheduling policy under the attacker’s policy $π_{θ}$ can be obtained by optimizing $π_{φ_{d}}$ and $π_{φ_{c}}$ .

D. Alternate Training of Stage I and Stage II

Following the pre-training of the defender’s policy, a two-stage alternate training sequence is initiated. In the first stage, the pre-training defender’s parameters are held constant while the attacker’s parameters are optimized to learn the attacker’s policy. After completing C₁ training episodes, the process shifts to optimize the defender’s parameters, keeping the attacker’s parameters static, to develop a robust defense policy. After training the defender for C₂ episodes, this cycle is then repeated. This iterative training strategy ensures continuous improvement and adaptation of both agents. The alternate training process of RoGDRL is shown in Algorithm 1.

Algorithm 1 : alternate training process of RoGDRL
Input: number of alternate periods C, and numbers of episodes C₁ and C₂ for training Stages I and II
Output: parameters of attacker and defender
for alternate periods of $1,2, \dots, C$ do
Get the optimal defender policy
for episodes of $1,2, \dots, C_{1}$ do
for $t = 1,2, \dots, N_{t}$ do
Output the action $A_{t}^{A} ~ π_{θ} (\cdot \| S_{t})$ with fixed $π_{φ}$
Calculate the reward R_t
Store ${S_{t}, A_{t}^{A}, - R_{t}, S_{t + 1}}$ in buffer and update parameters of $π_{θ}$
end for
end for
for episodes of $1,2, \dots, C_{2}$ do
for $t = 1,2, \dots, N_{t}$ do
Output the action $A_{t}^{D} ~ π_{φ} (\cdot \| S_{t} + A_{t}^{A})$ with fixed $π_{θ}$
Calculate the reward R_t
Store ${{\hat{S}}_{t}, A_{t}^{A}, R_{t}, {\hat{S}}_{t + 1}}$ in buffer and update parameters of $π_{φ}$
end for
end for
end for

V. Case Study

This paper uses the load and PV data from Jinan, China in 2020 to generate plenty of load and PV profiles. Then, the resultant load and PV instances are normalized to match the scale of power demands in the simulated system to train the proposed method. The weather data of the region in 2020 are from Solcast [

38]. Next, the proposed algorithm is trained using Pytorch on the NVIDIA RTX 3090 GPU with 12 GB RAM. The modified IEEE 123-bus system in Fig. 3 is used to verify the proposed method. The range of the voltage is [0.93 p.u.,1.07 p.u.].

γ_{L}

and

γ_{p v}

are set to be 400 ¥/WMh and 800 ¥/WMh, respectively.

γ_{A}

is set to be ¥2. Detailed parameter settings of SOP, SVCs, ESSs, PVs, and hyperparameter settings for RoGDRL can be found in [39].

Fig. 3 Modified IEEE 123-bus system.

A. Analysis of Superiority of Proposed Algorithm

The proposed algorithm is compared with the existing GDRL algorithms, including graph attention soft actor-critic (GATSAC) [

10] and graph convolutional network soft actor-critic (GCNSAC) [14]. The cumulative reward curves for different algorithms are shown in Fig. 4. While SAGESAC, GATSAC, and GCNSAC are trained under nominal conditions, the proposed algorithm is specifically trained in environments with state adversarial attacks.

Fig. 4 Cumulative reward curves for different algorithms.

Figure 4 shows that in the final stages of training, the cumulative reward obtained by SAGESAC exhibits minor fluctuations around a fixed value, indicating gradual convergence of the proposed algorithm. This suggests that the algorithm has mastered a scheduling policy capable of improving flexibility gains. Moreover, compared with GATSAC and GCNSAC, the integration of GraphSAGE significantly enhances the feature recognition capability of the proposed algorithm in large-scale complex systems, enabling SAGESAC to achieve higher cumulative rewards.

Furthermore, the cumulative reward curves of the proposed algorithm demonstrate significant oscillations during the adversarial training process. At the stage where the attack strategy is being learned, a decrease in the cumulative reward curve indicates that the attacker’s adversarial strategy has successfully disrupted the defender’s decision-making process. On the contrary, an increase in the cumulative reward curve indicates that the defender is learning how to effectively counteract the attacker’s strategy, thus progressively enhancing the quality of its decision-making. This illustrates that the defender adjusts its strategy in response to adversarial challenges to maximize long-term rewards. As time progresses, the cumulative reward curve tends to stabilize, implying that the proposed algorithm becomes increasingly efficient and robust in counteracting the impacts of state adversarial attacks through adversarial training.

Notably, the final reward performance of the proposed algorithm, which operates in an adversarial training environment, is lower than that of SAGESAC and GATSAC, both of which operate in a clean environment, free from adversarial attacks. This difference arises because the proposed algorithm is designed to address the max-min problem, as described in (15), rather than solely maximizing rewards, in contrast to SAGESAC and GATSAC. By sacrificing some reward optimality, the proposed algorithm enhances its robustness against state attacks. Although its decision outcomes are not optimal, the proposed algorithm maintains commendable performance stability in the face of state adversarial attacks. This aspect will be explored further in subsequent analyses.

B. Analysis of Superiority of Proposed SA-TZMG Model

In adversarial training, a powerful and stealthy attacker is crucial, ensuring the defender can achieve optimal rewards in worst-case scenarios [

26]. Thus, to demonstrate the necessity of considering actual physical constraints of state attacks, three types of attack scenarios are established.

1) Scenario A: without considering the physical constraints and BDD mechanism.

2) Scenario B: without considering the physical constraints.

3) Scenario C: considering the physical constraints.

The attack vectors generated in Scenarios A and B are consistent, and the only difference is whether BDD is performed to eliminate anomalous attack vectors. The test rewards of three algorithms after encountering these three attack scenarios and the perturbation residual statistics for Scenarios A and C are shown in Fig. 5.

Fig. 5 Rewards of three algorithms in different attack scenarios and perturbation residual statistics for scenarios A and C. (a) Rewards. (b) Perturbation residual statistics.

As shown in Fig. 5(a), in Scenario A, the lack of the consideration for the physical constraints and BDD mechanism results in relatively larger state disturbances, which leads to a significant reduction in the rewards of the three algorithms. However, as shown in Fig. 5(b), the attack signals generated in Scenario A have a 62.2% probability of exceeding the residual threshold across all load levels. This implies that only 47.8% of the generated attacks can bypass the BDD mechanism. Since it is not possible to guarantee that all generated attacks are effective, this results in the highest reward for Scenario B in Fig. 5(a). In other words, the effect of state adversarial attacks is the weakest when actual physical constraints are not considered. In comparison, Scenario C enables all attack signals to bypass the BDD mechanism, thus ensuring the effectiveness of the attacks. It can be concluded that actual physical constraints enhance the stealth and precision of state adversarial attacks, making them more reflective of real-world attack scenarios.

The impact of state adversarial attacks on system operational performance is further analyzed through the following five cases.

1) Case 1: naive scheduling without considering attack.

2) Case 2: naive scheduling considering attack.

3) Case 3: robust scheduling without considering attack.

4) Case 4: robust scheduling considering attack.

5) Case 5: without control.

In this context, naive scheduling employs the SAGESAC to output scheduling strategies, while robust scheduling utilizes the proposed algorithm for its strategy output. The optimization results in different cases are shown in Table I.

TABLE I Comparison of Optimization Results In Different Cases

Case	Flexibility gain	Operation cost (¥)	The maximum voltage deviation	The maximum average loading rate
1	30.67	615.86	0.0588	0.4889
2	21.66	1511.01	0.0793	0.5311
3	27.38	749.66	0.0679	0.5233
4	28.55	650.86	0.0651	0.5116
5		3746.50	0.0909	0.5594

In Table I, the flexibility gain for Case 2 decreases by 29.37% due to the distortion of original state observations by state adversarial attacks, misleading the SAGESAC. This distortion has the effect of increasing operation costs, raising branch loading rates, and causing potential voltage violations. When comparing Case 4 with Case 1, it can be observed that the proposed algorithm demonstrates effective resistance to state adversarial attacks, with a reduced flexibility gain of only 6.91%. Furthermore, the ability of the proposed algorithm to adapt to unknown external attacks through adversarial training broadens the decision-making experience. For example, Case 3 shows relatively favorable decision outcomes compared with Case 5.

Comparison between Case 3 and Case 4 reveals that the proposed algorithm exhibits a 4.09% decrease in flexibility gain in scenarios without state attacks. This indicates that while adversarial training enhances the robustness of the proposed algorithm against state adversarial attacks, it may lead to overfitting of the neural network policy to adversarial features. Such overfitting results in a slight degradation of algorithmic decision-making performance when processing clean state data under normal (attack-free) conditions.

In summary, the proposed algorithm significantly enhances the robustness against state adversarial attacks while still maintaining a relatively high operational flexibility. Although this algorithm sacrifices some decision-making performance, it is justified because naive DRL algorithms can be severely compromised in the presence of state adversarial attacks.

C. Analysis of Effectiveness of Proposed Flexibility Scheduling Method

To demonstrate the effectiveness of the proposed method, this subsection first analyzes the flexibility gain on the test day and the corresponding changes in node voltage deviation, average branch loading rate, and operation cost. The operation data of PV and load on the test day, with a time resolution of one hour, are shown in Fig. 6. The operational performance of the DN before and after implementing the proposed method is presented in Fig. 7.

Fig. 6 Operation data of PV and load on test day.

Fig. 7 Operational performance of DN. (a) Flexibility gain. (b) The maximum voltage deviation. (c) Average branch loading rate. (d) Operation cost.

As shown in Fig. 7, the proposed method significantly reduces the maximum voltage deviation, average branch loading rate, and operation cost by enhancing the flexibility gain. In Fig. 7(a), the flexibility gain during periods of 19-24 hours exceeds that during periods of 1-7 hours. This is because ESSs, key devices that support system flexibility, implement an effective charging and discharging strategy to shift the PV output during the day to peak load demand periods at night. This strategy balances power supply and demand, thereby enhancing the operational flexibility of the DN. Notably, the flexibility gain significantly increases during periods of 8-9 hours and 17-18 hours. During these periods, the load fully absorbs the high PV output, thereby reducing the maximum voltage deviations, average loading rates, and operation cost. Consequently, the system has an inherent degree of flexibility, which is further enhanced by implementing effective scheduling strategies.

As shown in Fig. 7(b), periods of 11-15 hours exhibit the highest PV output, while periods of 20-24 hours have the highest load demand. In the absence of effective scheduling in the DN, an excess of net power at nodes leads to voltage deviations that exceed acceptable limits, indicating a lack of sufficient node flexibility. Conversely, the proposed method addresses voltage violations by enhancing the flexibility gain, thus endowing the DN with a more adequate node flexibility.

In Fig. 7(c), the average branch loading rate of the system is significantly reduced compared with that before the proposed method is implemented, by an average decrease of 25.1%. This indicates that by enhancing the flexibility gain, the proposed method enables the system to have sufficient branch transfer flexibility, thereby better balancing the power demand and supply across different nodes.

In Fig. 7(d), the operation cost of the system is significantly reduced compared with that before the proposed method is implemented, by an average decrease of 30.1%, especially during period of 11-15 hours when the PV generation is high. During these periods, the DN struggles to accommodate all the PV generation, resulting in curtailments of 0.19 MW, 0.75 MW, 0.89 MW, 0.86 MW, and 0.40 MW, which leads to higher PV curtailment cost. In contrast, the proposed method enhances the flexibility gain, thereby ensuring that the system has sufficient operational safety margins and improving its PV accommodation capacity. Additionally, the proposed method improves node and branch transfer flexibilities while maintaining lower operation cost, thereby comprehensively enhancing the operational level of the DN.

To provide a detailed analysis of how the proposed method enhances the flexibility of the DN, Fig. 8 presents the scheduling strategies of different flexible resources.

Fig. 8 Scheduling strategies of different controllable resources. (a) Active power of SOPs. (b) Reactive power of SOPs. (c) Reactive power of SVCs. (d) Active power of ESSs.

As shown in Fig. 8, the scheduling strategies for flexibility resources are closely related to the PV penetration rate in the DN. This is because variations in the PV penetration exacerbate the net load volatility, leading to mismatches between PV generation and load demand, which in turn results in an insufficient flexibility [

2].

In Fig. 8(a), due to the high number of PV installations at the end of the DN, the abundant PV output significantly exceeds the load demand during periods with high PV penetration. The SOP transfers active power from node 117 to node 56 to smooth power fluctuations as much as possible. During the early morning and nighttime, the SOP transfers a portion of the power required by end loads from node 56 to node 117. This demonstrates that through differentiated power transfer strategies, the SOP effectively addresses the issue of uneven spatial distribution of PV generation and load demand, thereby enhancing the flexibility of the DN.

In Fig. 8(b) and (c), SVCs and SOPs each provide local reactive power compensation through different mechanisms, thereby obviating the necessity for long-distance transmission of reactive power from the resource. These complementary strategies reduce power losses and improve voltage distributions.

In Fig. 8(d), during periods with high PV penetration, ESSs are charged to mitigate the supply-demand imbalance caused by excessive PV output. At night, ESSs are discharged to smooth the high load demand. This demonstrates that by implementing appropriate charging and discharging strategies for ESSs, the proposed method effectively addresses the temporal distribution imbalance between PV generation and load demand, thereby enhancing the flexibility of the DN. It is worth noting that during period of 3-9 hours, all ESSs reach their state of charge limits and cannot continue discharging, resulting in zero power output.

In summary, the proposed method effectively coordinates various controllable resources by maximizing flexibility gain. This alleviates the spatiotemporal mismatch between PV generation and load demand, thereby enhancing the flexibility of the DN.

To further illustrate the comprehensive enhancement of the operational efficiency of the DN through optimizing flexibility gain, three independent objectives, i.e., the maximum voltage deviation, operation cost, and average branch loading rate, are employed to formulate a multi-objective optimization model. The multi-objective particle swarm optimization (MOPSO) is used to generate the Pareto front, and the technique based on the order of preference by similarity to the ideal solution strategy is utilized to determine the optimal compromise solution. The maximum number of iterations is 500, with a population of 100. Optimization results are shown in Table II. The MOPSO results represent the statistical values computed from five independent runs. Time 14 and time 22 are identified as the positive peak and negative peak of the net load on the test day, respectively.

TABLE II Comparison of Optimization Results from Different Models

Time	Model	Operation cost (¥)	The maximum voltage deviation (p.u.)	Average branch loading rate (p.u.)	Test time (s)
14	MOPSO	78.23±8.21	0.062±0.005	0.54±0.04	731.64
14	Proposed	60.98	0.059	0.51	0.05
22	MOPSO	25.91±1.62	0.04±0.002	0.31±0.01	629.71
22	Proposed	22.14	0.035	0.22	0.05

Table II shows that the operation cost and average branch loading rate achieved with the proposed SA-TZMG model are significantly lower than those obtained using MOPSO. Although the maximum voltage deviation with the proposed SA-TZMG model at time 14 is 3.5% higher than that with the MOPSO, this deviation remains within a safe range. Furthermore, the efficiency of the proposed SA-TZMG model in deriving solutions outperforms that of the MOPSO, as evidenced by a significant decrease in test time. Therefore, it can be concluded that the proposed SA-TZMG model achieves better system performance by enhancing operational flexibility compared with traditional multi-objective optimization models.

C. Impact Analysis of Temperature-dependent Resistance

Weather factors and power flow are the key determinants of line resistance. Thus, we analyze the impact of dynamic weather and system power flow on the flexibility gain of the system over a year. In 2020, the varying ranges of air temperature, wind speed, wind direction, and solar radiation in Jinan, China were $- 13 - 38$ ℃, 0.1-9.3 m/s, 0°-360°, and 0-1002 J/m², respectively. The relative error results of the flexibility gain with and without considering weather factors are shown in Fig. 9.

Fig. 9 Relative error results of flexibility gain with and without considering weather factors.

Figure 9 illustrates that considering the temperature-dependent resistance results in notable variations of node flexibility gain, cost flexibility gain, and branch transfer flexibility gain. The relative error ranges for the node flexibility gain, cost flexibility gain, and branch transfer flexibility gain are 0.0129%-29.88%, 0.000412%-26.42%, and 0.000294%- 28.59%, respectively. This is primarily due to the change in the thermal equilibrium point of the conductor in response to changing weather conditions, leading to a dynamic transition of both the conductor temperature and resistance towards a new equilibrium point, thereby inducing variations in line resistance. The dynamic line resistance introduces a significant error into the power flow analysis, ultimately affecting the flexibility gain calculation and decision-making processes.

VI. Conclusion

This study introduces a flexibility scheduling method for DNs based on RoGDRL. A mathematical model for flexibility scheduling with temperature-dependent resistance constraints is initially constructed. Based on this, an SA-TZMG model is proposed, which enhances the safety and robustness of the flexibility scheduling method. Finally, a two-stage RoGDRL algorithm based on SAGESAC is designed to achieve robust DRL-based flexibility scheduling, employing an alternate training method through alternating attack and defense. Numerical analysis indicates that:

1) Compared with the traditional DRL-based optimization methods, the proposed method demonstrates stronger robustness against state adversarial attacks.

2) Enhancing flexibility gain can comprehensively improve the operational performance of the DN, thereby better adapting to the large-scale integration of PV.

3) Considering temperature-dependent resistance is crucial in the optimization process to accurately model the dynamic changes of the line resistance, significantly impacting the accuracy of decision-making.

There are several directions for future work. Firstly, additional flexibility analysis indicators could be integrated into the flexibility gain to further enhance flexibility scheduling performance. Additionally, the constructed state-adversarial model could be extended based on the Stackelberg game with incomplete information to address information asymmetry between the attacker and defender, considering attack resource constraints.

References

S. Zhang, S. Ge, H. Liu et al., “Region-based flexibility quantification in distribution systems: an analytical approach considering spatio-temporal coupling,” Applied Energy, vol. 355, p. 122175, Feb. 2024. [Baidu Scholar]

X. Yang, C. Xu, H. He et al., “Flexibility provisions in active distribution networks with uncertainties,” IEEE Transactions on Sustainable Energy, vol. 12, no. 1, pp. 553-567, Jan. 2021. [Baidu Scholar]

M. Rayati, M. Bozorg, R. Cherkaoui et al., “Distributionally robust chance constrained optimization for providing flexibility in an active distribution network,” IEEE Transactions on Smart Grid, vol. 13, no. 4, pp. 2920-2934, Jul. 2022. [Baidu Scholar]

H. Ji, C. Wang, P. Li et al., “Quantified analysis method for operational flexibility of active distribution networks with high penetration of distributed generators,” Applied Energy, vol. 239, pp. 706-714, Apr. 2019. [Baidu Scholar]

J. Jian, P. Li, H. Ji et al., “DLMP-based quantification and analysis method of operational flexibility in flexible distribution networks,” IEEE Transactions on Sustainable Energy, vol. 13, no. 4, pp. 2353-2369, Oct. 2022. [Baidu Scholar]

P. Li, Y. Wang, H. Ji et al., “Operational flexibility of active distribution networks: definition, quantified calculation and application,” International Journal of Electrical Power & Energy Systems, vol. 119, p. 105872, Jul. 2020. [Baidu Scholar]

Y. Su and J. Teh, “Two-stage optimal dispatching of AC/DC hybrid active distribution systems considering network flexibility,” Journal of Modern Power Systems and Clean Energy, vol. 11, no. 1, pp. 52-65, Jan. 2023. [Baidu Scholar]

IEEE Draft Standard for Calculating the Current-temperature Relationship of Bare Overhead Conductors, IEEE Std P738-2012 Draft 09 (Revision of IEEE Std 738-2006), pp. 1-67, 2012. [Baidu Scholar]

C. Rakpenthai and S. Uatrongjit, “Temperature-dependent unbalanced three-phase optimal power flow based on alternating optimizations,” IEEE Transactions on Industrial Informatics, vol. 20, no. 3, pp. 3619-3627, Mar. 2024. [Baidu Scholar]

Q. Xing, Z. Chen, T. Zhang et al., “Real-time optimal scheduling for active distribution networks: a graph reinforcement learning method,” International Journal of Electrical Power & Energy Systems, vol. 145, p. 108637, Feb. 2023. [Baidu Scholar]

Y. Gao, W. Wang, J. Shi et al., “Batch-constrained reinforcement learning for dynamic distribution network reconfiguration,” IEEE Transactions on Smart Grid, vol. 11, no. 6, pp. 5357-5369, Nov. 2020. [Baidu Scholar]

L. Zhang, H. Ye, F. Ding et al., “Increasing PV hosting capacity with an adjustable hybrid power flow model,” IEEE Transactions on Sustainable Energy, vol. 14, no. 1, pp. 409-422, Jan. 2023. [Baidu Scholar]

Z. Wu, Y. Li, W. Gu et al., “Multi-timescale voltage control for distribution system based on multi-agent deep reinforcement learning,” International Journal of Electrical Power and Energy Systems, vol. 147, p. 108830, May 2023. [Baidu Scholar]

D. Cao, J. Zhao, J. Hu et al., “Physics-informed graphical representation-enabled deep reinforcement learning for robust distribution system voltage control,” IEEE Transactions on Smart Grid, vol. 15, no. 1, pp. 233-246, Jan. 2024. [Baidu Scholar]

Y. Zhang, M. Yue, J. Wang et al., “Multi-agent graph-attention deep reinforcement learning for post-contingency grid emergency voltage control,” IEEE Transactions on Neural Networks and Learning Systems, vol. 35, no. 3, pp. 3340-3350, Mar. 2024. [Baidu Scholar]

R. Wang, X. Bi, and S. Bu, “Real-time coordination of dynamic network reconfiguration and volt-var control in active distribution network: a graph-aware deep reinforcement learning approach,” IEEE Transactions on Smart Grid, vol. 15, no. 3, pp. 3288-3302, May 2024. [Baidu Scholar]

T. Liu, A. Jiang, J. Zhou et al., “GraphSAGE-based dynamic spatial-temporal graph convolutional network for traffic prediction,” IEEE Transactions on Intelligent Transportation Systems, vol. 24, no. 10, pp. 11210-11224, Oct. 2023. [Baidu Scholar]

Y. Zheng, Z. Yan, K. Chen et al., “Vulnerability assessment of deep reinforcement learning models for power system topology optimization,” IEEE Transactions on Smart Grid, vol. 12, no. 4, pp. 3613-3623, Jul. 2021. [Baidu Scholar]

I. Ilahi, M. Usama, J. Qadir et al., “Challenges and countermeasures for adversarial attacks on deep reinforcement learning,” IEEE Transactions on Artificial Intelligence, vol. 3, no. 2, pp. 90-109, Apr. 2022. [Baidu Scholar]

P. Zhao, C. Gu, Y. Ding et al., “Cyber-resilience enhancement and protection for uneconomic power dispatch under cyber-attacks,” IEEE Transactions on Power Delivery, vol. 36, no. 4, pp. 2253-2263, Aug. 2021. [Baidu Scholar]

X. Wei, J. Lei, J. Shi et al., “A data-driven approach for quantifying and evaluating overloading dependencies among power system branches under load redistribution attacks,” IEEE Transactions on Smart Grid, vol. 15, no. 4, pp. 4050-4062, Jul. 2024. [Baidu Scholar]

L. Zeng, M. Sun, X. Wan et al., “Physics-constrained vulnerability assessment of deep reinforcement learning-based SCOPF,” IEEE Transactions on Power Systems, vol. 38, no. 3, pp. 2690-2704, May 2023. [Baidu Scholar]

J. Moos, K. Hansel, H. Abdulsamad et al., “Robust reinforcement learning: a review of foundations and recent advances,” Machine Learning and Knowledge Extraction, vol. 4, no. 1, pp. 276-315, Mar. 2022. [Baidu Scholar]

L. Pinto, J. Davidson, R. Sukthankar et al., “Robust adversarial reinforcement learning,” in Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, Aug. 2017, pp. 2817-2826. [Baidu Scholar]

Z. Ni and S. Paul, “A multistage game in smart grid security: a reinforcement learning solution,” IEEE Transactions on Neural Networks and Learning Systems, vol. 30, no. 9, pp. 2684-2695, Sept. 2019. [Baidu Scholar]

H. Zhang, H. Chen, C. Xiao et al., “Robust deep reinforcement learning against adversarial perturbations on state observations,” in Proceedings of International Conference on Learning Representation, Red Hook, USA, Dec. 2020. pp. 21024-21037. [Baidu Scholar]

L. Zeng, D. Qiu, and M. Sun, “Resilience enhancement of multi-agent reinforcement learning-based demand response against adversarial attacks,” Applied Energy, vol. 324, p. 119688, Oct. 2022. [Baidu Scholar]

W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” in Proceedings of Advances in Neural Information Processing Systems, Long Beach, USA, Dec. 2017, pp. 1025-1035. [Baidu Scholar]

C. Wang, P. Li, and H. Yu, “Development and characteristic analysis of flexibility in smart distribution network,” Automation of Electric Power Systems, vol. 42, no. 10, pp. 13-21, Sept. 2018. [Baidu Scholar]

C. Chen, L. Shen, F. Zou et al., “Towards practical Adam: non-convexity, convergence theory, and mini-batch acceleration,” The Journal of Machine Learning Research, vol. 23, no. 1, pp. 10411-10457, Jan. 2022. [Baidu Scholar]

Z. Yin, S. Wang, and Q. Zhao, “Sequential reconfiguration of unbalanced distribution network with soft open points based on deep reinforcement learning,” Journal of Modern Power Systems and Clean Energy, vol. 11, no. 1, pp. 107-119, Jan. 2023. [Baidu Scholar]

Y. Zhu and D. Zhao, “Online minimax Q network learning for two-player zero-sum Markov games,” IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 3, pp. 1228-1241, Mar. 2022. [Baidu Scholar]

W. Liao, B. Bak-Jensen, J. R. Pillai et al., “A review of graph neural networks and their applications in power systems,” Journal of Modern Power Systems and Clean Energy, vol. 10, no. 2, pp. 345-360, Mar. 2022. [Baidu Scholar]

A. Amin and M. Mourshed, “Weather and climate data for energy applications,” Renewable and Sustainable Energy Reviews, vol. 192, p. 114247, Mar. 2024. [Baidu Scholar]

S. Frank, J. Sexauer, and S. Mohagheghi, “Temperature-dependent power flow,” IEEE Transactions on Power Systems, vol. 28, no. 4, pp. 4007-4018, Nov. 2013. [Baidu Scholar]

L. S. Shapley, “Stochastic games,” Proceedings of the National Academy of Sciences of the United States of America, vol. 39, no. 10, pp. 1095-1100, Oct. 1953. [Baidu Scholar]

K. Shimizu and E. Aiyoshi, “Necessary conditions for min-max problems and algorithms by a relaxation procedure,” IEEE Transactions on Automatic Control, vol. 25, no. 1, pp. 62-66, Feb. 1980. [Baidu Scholar]

Solcast. (2023, Jul.). Solar API and solar weather forecasting tool. [Online]. Available: https://solcast.com.au [Baidu Scholar]

Google Drive. (2024, Apr.). Parameter settings of algorithms and controllable resources. [Online]. Available: Available: https://drive.google.com/file/d/1ZIW7zBRXtc-9yBuuOw5JjWpPsb57oaTY/view?usp=sharing&usp=embed_facebook [Baidu Scholar]

Address:No.19 Chengxin Avenue, Jiangning District, Nanjing 211106, China

E-mail: mpce@alljournals.cn

Tel:86-25-81093060

Fax:86-25-81093040

Home

Introduction

Editorial Board

For Author

Call For Papers

APC

Sponsor & Publisher