Abstract
The high proportion of renewable energy integration and the dynamic changes in grid topology necessitate the enhancement of voltage/var control (VVC) to manage voltage fluctuations more rapidly. Traditional model-based control algorithms are becoming increasingly incompetent for VVC due to their high model dependence and slow online computation speed. To alleviate these issues, this paper introduces a graph attention network (GAT) based deep reinforcement learning for VVC of topologically variable power system. Firstly, combining the physical information of the actual power grid, a physics-informed GAT is proposed and embedded into the proximal policy optimization (PPO) algorithm. The GAT-PPO algorithm can capture topological and spatial correlations among the node features to tackle topology changes. To address the slow training, the ReliefF-S algorithm identifies critical state variables, significantly reducing the dimensionality of state space. Then, the training samples retained in the experience buffer are designed to mitigate the sparse reward issue. Finally, the validation on the modified IEEE 39-bus system and an actual power grid demonstrates superior performance of the proposed algorithm compared with state-of-the-art algorithms, including PPO algorithm and twin delayed deep deterministic policy gradient (TD3) algorithm. The proposed algorithm exhibits enhanced convergence during training, faster solution speed, and improved VVC performance, even in scenarios involving grid topology changes and increased renewable energy integration. Meanwhile, in the adopted cases, the network loss is reduced by 6.9%, 10.8%, and 7.7%, respectively, demonstrating favorable economic outcomes.
WITH the increasing penetrations of renewable energy resources such as wind and solar, the randomness and fluctuation of these resources bring more uncertainty to power systems, resulting in more frequent and severe voltage fluctuations [
Many studies have been conducted to solve the voltage or reactive power optimization and control problem considering the uncertainties of renewable energy generation and load demand [
Unlike SP, RO, and IP, DRL is an artificial intelligence algorithm that does not rely on an accurate physical model. A VVC method based on the deep deterministic policy gradient (DDPG) algorithm is proposed in [
With consideration of the above problems, a graph attention network based PPO (GAT-PPO) algorithm for power systems with high proportion of wind power is proposed in this paper. Firstly, wind turbine nodes and load nodes with more active power are given larger weights. Then, the calculation of the attention coefficient in the GAT is improved based on the weights. Secondly, the improved ReliefF algorithm (defined as ReliefF-S algorithm hereafter) extracts the critical features affecting system stability. Then, these features are used as the state variables of the GAT-PPO algorithm to reduce the dimension of state space. Finally, the samples retained in the experience buffer are improved to mitigate the sparse reward problem. The optimization and control of voltage can be realized based on the above improvement strategy. The major contributions of this paper can be summarized as follows.
1) The proposed GAT-PPO algorithm integrates the physical information of the power system into the GAT, and it can give more attention to the important nodes during voltage regulation. Moreover, compared with the traditional DRL algorithms, the proposed GAT-PPO algorithm exhibits better transfer learning performance and greater adaptability to various grid topologies by integrating with GAT.
2) The critical state variables for DRL training are screened out based on the ReliefF-S algorithm, which can reduce the dimension of state space and improve the training efficiency of the algorithm. Additionally, in light of the practical issues of VVC, a reward function construction method based on the constraints first and objective later is proposed, providing a reference for related research.
3) Training samples retained in the experience buffer are improved to mitigate the sparse reward problem. In the improved training samples, some samples with large temporal difference errors are retained as positive experiences to guide the agent in training toward the correct direction. Meanwhile, a small number of samples that violate constraints are also retained, which act as negative experiences to warn the agent to avoid adopting the action strategies that violate constraints. An expanded boundary in terms of manageable grid topologies of the proposed GAT-PPO algorithm is found.
The rest of this paper is organized as follows. Section II introduces the DRL model for intraday VVC. The physics-informed GAT is presented in Section III. Section IV elaborates the state space reduction based on the ReliefF-S algorithm. Section V introduces intraday VVC based on GAT-PPO algorithm. Comparative studies are shown and discussed in Section VI. Finally, conclusions are drawn in Section VII.
The PPO algorithm is an improvement upon the trust region policy optimization (TRPO) algorithm, which is capable of handling continuous and discrete action spaces with good convergence [
In the paper, the state space, action space, and reward function of the DRL model for VVC are designed and defined. The concise schematic diagram of the proposed GAT-PPO algorithm is shown in

Fig. 1 Concise schematic diagram of proposed GAT-PPO algorithm.
The system operator or control program is set as an agent. The agent contains two neural networks: an improved GAT and an FCN. The power system dynamic simulator interacting with the agent is set as environment.
The grid state information in the model includes the wind turbine output, traditional unit output, load demand, voltage distribution, branch power distribution, reactive power output of dynamic reactive power compensation device, and adjacency matrix. The state space is as follows:
(1) |
where and are the sets of active and reactive power outputs of all wind turbines at time , respectively; and are the sets of active and reactive power outputs of traditional units at time , respectively; and are the sets of active and reactive power demands of all loads at time , respectively; is the set of voltage amplitudes of all nodes at time ; is the set of apparent power amplitudes of all branches at time ; is the set of reactive power outputs of all dynamic reactive power compensation devices at time ; and is the adjacency matrix of the system at time .
The action space represents the solution space. In this paper, the following VVC measures are considered: reactive power regulation of each static var generator (SVG) configured in the wind farm and reactive power regulation of each static var compensator (SVC). The action space is defined as:
(2) |
where is the set of reactive power variations of all SVGs at time ; and is the set of reactive power variations of all SVCs at time .
The reward is crucial in guiding the learning direction of the agent. In the study, a reward function construction method based on the constraints first and objective later is designed in the paper. The main idea is to divide the reward function according to the constraint conditions and the objective function. The objective function is considered only when the states satisfy the constraints. Otherwise, a penalty value is returned.
In this paper, the reward is designed based on the network loss, and the network loss is the objective function of the model. The smaller the network loss, the larger the reward. The mathematical expression is:
(3) |
where is the preset network loss at time ; is the actual network loss at time ; is the base capacity, typically 100 MVA; and is the reward adjustment coefficient.
In this paper, the penalty is designed based on system security and stability constraints. The constraints consist of four parts: branch power constraints, node voltage constraints, generator reactive power constraints, and generator active power constraints. To ensure the safe and stable operation of the power system, the values of these variables need to be within specified ranges. When designing the penalty, the more serious the violation, the larger the penalty. Inspired by [
(4) |
where , , and are the penalty factors of branch power violation, node voltage violation, unit active power violation, and unit reactive power violation, respectively; is the number of branches; is the number of nodes; is the number of generators; is the apparent power of branch at time ; is the maximum apparent power allowed by branch ; is the voltage of node at time ; and are the maximum and minimum voltages of node , respectively; is the active power of generator at time ; and are the maximum and minimum active power of generator , respectively; is the reactive power of generator at time ; and and are the maximum and minimum reactive power of generator , respectively. Each item in (4) is normalized to eliminate the inconsistency problem of dimension and magnitude, the values of which range from 0 to 1. Through multiple tests with different cases, the values of all four penalty factors are set to be 0.25 in the paper.
According to the designed reward and penalty, the mathematical expression of the total reward function is given as:
(5) |
where is the set of state constraints; is the penalty given for violation of constraints, and its purpose is to enable the agent to make decisions within constraints; and is the reward given when all constraints are satisfied, enabling the agent to find the optimal decision based on the feasible decisions.
According to the mathematical expression of the total reward function, it can be observed that when any constraint is violated, the agent gets a negative reward according to (4). The purpose of this setting is to guide the agent to give an action that satisfies all constraints. When all constraints are not violated, it can be observed from (4) and (5) that the closer the node voltage is to 1 p.u., the larger the positive reward value the agent receives under the same other constraints. The purpose of this setting is to guide the agent to learn an action that can obtain an ideal voltage distribution.
The GAT introduces the attention mechanism into the graph neural network, which obtains the overall information of the network from the local information by calculating the importance of adjacent nodes to the central node. The advantage of the GAT is that it does not require any kind of costly matrix operation or depends on knowing the graph structure upfront, making it directly applicable to inductive learning issues [
The expression of the attention coefficient is given as [
(6) |
where is the importance of node to node ; is the weight matrix; is the parameter of a single-layer feedforward neural network; LeakyReLU is the activation function; is the set of neighbor nodes of node ; , , and are the characteristics of nodes i, j, and k, respectively;and || represents the concatenation operation.
Extensive physical knowledge has been developed in power systems, and the application of the GAT in the field of electric power should be combined with the actual situation in the field. In power systems, the wind turbine nodes and load nodes are the key ones that affect voltage safety and stability in the power system with the integration of wind power. The higher the active power of these nodes, the more likely voltage safety and stability issues are to occur [
The weight coefficients of wind turbine and load nodes are defined as:
(7) |
where superscripts and denote the sets of wind turbine nodes and load nodes, respectively; and are the weight coefficients of wind turbine node and load node at time , respectively; and are the active power of wind turbine node and load node at time , respectively; and and are the minimum active power of all wind turbines and the minimum active power of all loads at time , respectively.
According to the above calculation method of weight coefficient, it can be observed that the larger the active power of node , the larger its weight coefficient value. For each load or wind turbine node whose active power is not equal to 0, the weight coefficient is multiplied by the corresponding attention coefficient . For other nodes, is multiplied by 0.9. Therefore, the key wind turbine and load nodes can be paid more attention through the improved GAT.
One of the important causes for slow training and non-convergence of reinforcement learning is that the dimension of its state space or action space is too large. To solve the problem, a state space reduction strategy is proposed. The key features that have a great influence on voltage stability are obtained through the method of key feature extraction, and then these key features are taken as state variables of reinforcement learning. The method can reduce the state space dimension and accelerate the convergence of the model.
The ReliefF algorithm is an efficient feature extraction algorithm that assigns weights to features based on their correlation with the labels. The feature whose weight is less than the setting threshold value will be removed, and then the optimal feature subset can be obtained [
To address the above problems, the Spearman correlation coefficient is adopted to improve the ReliefF algorithm as the ReliefF-S algorithm. The reason for using the Spearman correlation coefficient is that it does not require a specific distribution between variables. The specific methods of the improvement are described as follows.
Firstly, the Spearman correlation coefficient [
(8) |
(9) |
where and represent two samples, respectively; and represent the average values of the two samples, respectively; is the total number of features in the sample; is the feature; is the weight of ; and are the near-neighbor homogeneous samples and heterogeneous samples of sample , respectively; or represents the distance between samples and or on , respectively; is the number of iterations of the algorithm; is the number of near-neighbor homogeneous samples; is the probability of label ; is the label of sample ; is the probability that sample belongs to some kind of label; is the weight contribution of sample and near-neighbor homogeneous samples on ; and is the weight contribution of sample and all near-neighbor heterogeneous samples on .
Finally, the key factors affecting the system voltage stability are screened out based on the proposed ReliefF-S algorithm. These key factors are the state variables used in DRL training.
A lack of effective reward information will lead to slow learning or even failure to learn the optimal strategy. To mitigate the sparse reward problem, the samples retained in the experience buffer are improved in this paper. The designed experience buffer retains both samples with a large temporal difference error and a small number of samples that violate voltage constraints. The former is used as a positive experience to guide the agent to train in the right direction, while the latter is used as a negative experience to warn the agent to avoid action strategies that violate constraints.
This subsection introduces the structure and the training process of the proposed GAT-PPO algorithm.
The structure of the proposed GAT-PPO algorithm is shown in

Fig. 2 Structure of proposed GAT-PPO algorithm.
The value network maps the system state to the expected future cumulative rewards, which contains a state value layer. During training, the observed power system state variables are first input into the GAT to generate the node feature set. Then, the node feature set is input into the state value layer for training. Finally, the state value layer outputs the state value function.
The objective function of conventional policy gradient based DRL optimization is given as [
(10) |
where is the policy parameter; represents the empirical average over finite samples; and are the state and action at time , respectively; is a stochastic policy; and is an estimator of the advantage function at time .
The objective function of the value function can be formulated as:
(11) |
(12) |
where is the value function parameter; is the discount factor; is the state value function at time ; is the reward at time ; and is the target value of TD error, and the parameter can be updated by the stochastic gradient descent algorithm according to the gradient .
The training process of the proposed GAT-PPO algorithm is shown in

Fig. 3 Training process of proposed GAT-PPO algorithm.
The sequences of power system state variables are separately input into two action networks, resulting in two policy distributions. Among them, one is for the new policy and the other is for the old policy. According to the new and old policy distributions, the probabilities for selecting each action under both policies are computed separately. Then, the probability of the new policy is divided by the probability of the old policy to obtain the ratio of policy probability . The objective function value of the improved GAT-PPO algorithm is calculated using the advantage function and the ratio of policy probability. Subsequently, the negative of the objective function is taken as the loss function of the neural network. The parameters of the policy network are updated through backpropagation using the loss function. Finally, a new policy satisfying the clipping requirements is obtained.
The observed power system state variables are input into the value network to obtain the corresponding value function. The discounted reward is computed based on the discount reward calculation formula . Then, the advantage function is calculated. The value network parameters are updated by computing gradient with respect to the function in (11) and performing backward propagation.
When the loss function values of both the policy network and the value network are stable and close to a small value, and the moving average reward is positive and tends to stabilize, it indicates that the algorithm has converged.
In the paper, the modified IEEE 39-bus system and an actual power grid are analyzed as cases. The single-line diagram of the modified IEEE 39-bus system is shown in

Fig. 4 Single-line diagram of modified IEEE 39-bus system.
The simulation environment of power systems is provided by PSSE software in the paper. The annual operation data of an actual power system with the integration of wind power are scaled to the modified IEEE 39-bus system, generating numerous operation data. Meanwhile, the forecasting data of load and wind power are processed as follows. The sample data are generated by setting the forecasting errors from the operation data of the modified IEEE 39-bus system, with a maximum forecasting error of 20% for wind power and 15% for load power. To change the operation conditions during algorithm training, different grid topologies are selected in the two systems. The grid topology changes are achieved by disconnecting the following transmission lines one at a time. In the modified IEEE 39-bus system, four different grid topologies are selected: ① no disconnecting; ② the line between bus 5 and bus 6; ③ the line between bus 16 and bus 24; and ④ the line between bus 22 and bus 23. In the actual power grid, eight different grid topologies are randomly selected in the same way. Then, the DRL algorithm is trained based on these data.
The modified IEEE 39-bus system is used as a case to illustrate the screening of state variables. To obtain the sample data, the active power margin of the system is calculated by using the annual operation data in the modified IEEE 39-bus system. When the active power margin of the sample is larger than 10%, the sample is a voltage-stable sample. Otherwise, the sample is a voltage-unstable sample. In this paper, 16529 voltage-stable samples and 15951 voltage-unstable samples are obtained. Since the branch power is allowed to exceed the limit to a certain extent without affecting the voltage stability, the branch power can be screened. The weights of 34 branches are shown in Table I.
From bus | To bus | Weight | From bus | To bus | Weight |
---|---|---|---|---|---|
4 | 5 | 0.2176 | 1 | 2 | 0.1236 |
10 | 13 | 0.2073 | 2 | 25 | 0.1058 |
13 | 14 | 0.2048 | 4 | 14 | 0.1057 |
5 | 6 | 0.2047 | 23 | 24 | 0.1050 |
6 | 7 | 0.2008 | 8 | 9 | 0.1038 |
6 | 11 | 0.1902 | 9 | 39 | 0.1038 |
25 | 26 | 0.1840 | 26 | 29 | 0.1014 |
5 | 8 | 0.1764 | 26 | 28 | 0.0970 |
10 | 11 | 0.1762 | 2 | 3 | 0.0918 |
3 | 4 | 0.1687 | 26 | 27 | 0.0880 |
14 | 15 | 0.1653 | 3 | 18 | 0.0857 |
1 | 39 | 0.1434 | 17 | 18 | 0.0824 |
16 | 21 | 0.1420 | 17 | 27 | 0.0792 |
15 | 16 | 0.1418 | 28 | 29 | 0.0694 |
21 | 22 | 0.1373 | 16 | 19 | 0.0625 |
16 | 24 | 0.1284 | 22 | 23 | 0.0576 |
7 | 8 | 0.1245 | 16 | 17 | 0.0419 |
It can be observed from Table I that the weight values of the 34 branches vary greatly, with the maximum value being 5.2 times the minimum value. It indicates that different branches contribute differently to voltage stability. The average weight of 34 branches is 0.1299. Meanwhile, it is noticeable that the weight values are mostly concentrated above 0.1. Therefore, 0.1014 is selected as the weight threshold. Finally, the apparent power of 24 branches is retained as the state variables based on the weight threshold. Through the above processing, the state space dimension of DRL can be reduced by ten dimensions.
To illustrate the impact of grid topology changes on DRL training, meanwhile, in view of the issues studied in the paper, to identify the topological boundary that the GAT-PPO algorithm can handle, the output features of GAT in the modified IEEE 39-bus system are extracted for comparison.
In the paper, the grid topology is changed by disconnecting branches, and the GAT output under different grid topologies is shown in the matrix scatterplot in

Fig. 5 Matrix scatterplots under different grid topologies. (a) Original graph. (b) Local enlarged graph.
According to
According to
To verify the effectiveness of the proposed GAT-PPO algorithm, the comparative analyses are carried out from the perspectives of different algorithms and different voltage scenarios. In terms of algorithm comparison, the proposed GAT-PPO algorithm is compared with the PPO algorithm, TD3 algorithm, particle swarm optimization (PSO)-based SP algorithm, and genetic algorithm (GA)-based SP algorithm. Regarding voltage scenarios, the branch of bus 3-bus 18 is disconnected in the paper, which indicates that the grid topology is changed. Then, two scenarios of high voltage and low voltage are selected for comparison. In these two scenarios, the prediction data of wind power output and load demand are randomly generated according to their fluctuation range. The VVC based on the prediction data is applied to the corresponding actual scenario without random processing, thereby comparing the performance of the proposed algorithm under the uncertainties of wind power output and load demand. The comparative analyses include the following four aspects: training speed, importance features of nodes, VVC performance, and control performance under different grid topologies.
Training speed is an important index to measure the superiority of the proposed GAT-PPO algorithm. The faster the training speed, the more conducive to the online application of the proposed GAT-PPO algorithm. In the paper, the proposed GAT-PPO algorithm is compared with PPO algorithm and TD3 algorithm. Meanwhile, to verify the effectiveness of the dimension reduction strategy, the proposed GAT-PPO algorithm is compared with the GAT-PPO algorithm without dimension reduction (referred to as “GAT-PPO-wdr”).
The information entropy is used to measure the training speed of four kinds of DRL algorithms [

Fig. 6 Information entropy and reward curves of each algorithm in modified IEEE 39-bus system. (a) Information entropy curves. (b) Reward curves.
As depicted in
According to
Combined with the VVC problem in the paper, the nodes with larger active power in the wind turbine nodes and load nodes are more important. These nodes should be given higher priority during voltage regulation. Meanwhile, the higher-priority nodes also represent the primary nodes in the graph. To compare the importance features of the nodes in the graph, the low-voltage scenario is taken as a case, and the GCN-based PPO (GCN-PPO) algorithm is added for comparison. The first six load nodes with larger active power and the first three wind turbine nodes are selected for analysis in the adopted case. The comparison of node voltages under different algorithms is shown in Table II, where the load nodes and the wind turbine nodes are arranged in descending order according to the active power.
Node type | Bus | Active power (MW) | Voltage (p.u.) | |||
---|---|---|---|---|---|---|
GAT-PPO | PPO | TD3 | GCN-PPO | |||
Load node | 7 | 833.8 | 1.0092 | 1.0181 | 1.0336 | 1.0163 |
8 | 822.0 | 1.0038 | 1.0184 | 1.0309 | 1.0132 | |
20 | 680.0 | 1.0012 | 0.9982 | 0.9792 | 0.9901 | |
4 | 600.0 | 1.0068 | 1.0166 | 1.0103 | 1.0120 | |
16 | 329.0 | 0.9965 | 0.9893 | 0.9879 | 0.9897 | |
3 | 322.0 | 1.0006 | 1.0319 | 1.0028 | 1.0169 | |
Wind turbine node | 37 | 1395.0 | 1.0208 | 1.0329 | 1.0439 | 1.0392 |
39 | 1000.0 | 1.0250 | 1.0438 | 1.0326 | 1.0403 | |
38 | 830.0 | 1.0019 | 1.0110 | 1.0452 | 1.0104 |
According to Table II, compared with other algorithms, the proposed GAT-PPO algorithm can make the voltages of the primary nodes closer to 1 p.u., which indicates better voltage distribution. The GCN-PPO algorithm does not ensure that the voltage of all primary nodes is closer to 1 p.u. than those of the PPO and TD3 algorithms. The reason for this phenomenon is that the proposed GAT-PPO algorithm integrates the physical knowledge of power systems in voltage regulation, which gives higher priority to the primary nodes during voltage regulation. Therefore, the voltage of primary nodes is prioritized to be restored to normal.
The comparison of network loss for each algorithm is shown in Table III. The comparison diagrams of node voltages and voltage violation nodes for each algorithm are shown in Figs.

Fig. 7 Comparison of node voltages for each algorithm. (a) High-voltage case. (b) Low-voltage case.

Fig. 8 Comparison of voltage violation nodes for each algorithm. (a) High-voltage case. (b) Low-voltage case.

Fig. 9 Comparison of continuous 6-hour VVC.
Algorithm | Network loss (MW) | |
---|---|---|
High-voltage case | Low-voltage case | |
Original system | 78.127 | 170.689 |
GAT-PPO | 72.702 | 152.264 |
PPO | 75.539 | 157.883 |
TD3 | 76.650 | 156.288 |
PSO | 75.155 | 157.327 |
GA | 76.358 | 157.496 |
It can be observed from
It can be observed from
It can be observed from
Moreover, it can be observed that the proposed GAT-PPO algorithm can effectively cope with the uncertainty of the power system. The primary reason is that the agent has learned the patterns of changes in load demand and wind power output during the training process, and has mastered their probability distribution. Thus, the agent gives optimal control from the perspective of expectation.
To further verify the transfer learning capability of the proposed GAT-PPO algorithm under different grid topologies, the remaining alternating current (AC) branches are disconnected in turn, resulting in a total of 30 new grid topologies. In each grid topology, two cases involving high voltage and low voltage are selected first, and then the VVC performance of the proposed GAT-PPO algorithm and the PPO algorithm is compared. Since the power flow will not converge when branches of bus 1-bus 39, bus 2-bus 3, bus 3-bus 4, bus 2-bus 25, bus 8-bus 9, bus 9-bus 39, bus 15-bus 16, bus 16-bus 19, and bus 28-bus 29 are disconnected, 21×2 test scenarios are generated finally. The comparison of network loss difference under different grid topologies is shown in

Fig. 10 Comparison of network loss difference under different grid topologies.
According to
An actual power grid is adopted to verify the effectiveness of the proposed GAT-PPO algorithm in the large scale system. The actual power grid contains 222 bus nodes and 285 AC branches. The analyses will be conducted from three aspects: training speed, VVC performance, and control performance under different grid topologies.
For the actual power grid, the ReliefF-S algorithm is adopted to remove a total of 61 branches. As a result, the dimension of the state space has been reduced by 61 dimensions. The information entropy and reward curves of each algorithm in actual power grid are shown in

Fig. 11 Information entropy and reward curves of each algorithm in actual power grid. (a) Information entropy. (b) Reward.
It can be observed from
According to
A high-voltage case is used for comparison and analysis. In this case, one branch is disconnected, indicating the grid topology is changed. The comparison of node voltage obtained by each algorithm is shown in

Fig. 12 Comparison of node voltage obtained by each algorithm.
Algorithm | Network loss (MW) |
---|---|
Original system | 56.199 |
GAT-PPO | 51.847 |
PPO | 53.596 |
TD3 | 54.055 |
PSO | 55.388 |
GA | 54.163 |
According to
According to Table IV, it can be observed that the proposed GAT-PPO algorithm achieves the lowest network loss, which is 7.7% lower than that of the original system. This indicates that the proposed GAT-PPO algorithm has better economic efficiency. Therefore, the comprehensive performance of the VVC of the proposed GAT-PPO algorithm is superior.
To further validate the adaptability of the proposed GAT-PPO algorithm, the remaining AC branches are sequentially disconnected, and a total of 173 new grid topologies is eventually formed. Under each grid topology, high-voltage cases are selected, and then the VVC performance of the proposed GAT-PPO algorithm and the PPO algorithm is compared. The comparison of network loss difference under different grid topologies is shown in

Fig. 13 Comparison of network loss difference under different grid topologies in actual power grid.
According to
The performance of the DRL algorithm in this paper is affected by hyperparameters such as the discount factor and the number of neurons in the neural network. The value range of is usually 0.9-1. The larger the value of the more the longer-term considerations of the agent, and the training difficulty of the proposed GAT-PPO algorithm also increases. The smaller the value of , the more the agent focuses on immediate gains, and the training difficulty of the proposed GAT-PPO algorithm decreases. Therefore, it is important to choose an appropriate when training the agent. The reward curves of different discount factors are shown in

Fig. 14 Reward curves of different discount factors.
It can be observed from
In addition, the paper sets the number of neurons in the neural network to be 64 based on extensive testing. The number of neurons is also crucial for algorithm training. If the number of neurons is too small, such as 16, it may prevent the neural network from learning correctly. Conversely, if the number of neurons is too large, such as 256, it may lead to an excessive number of parameters that the neural network needs to train, thus increasing the learning difficulty and affecting the network generalization ability.
This paper proposes GAT-based deep reinforcement learning for VVC of topologically variable power system, which incorporates voltage stability characteristics of system nodes into the attention mechanism, prioritizing essential nodes during voltage regulation. Furthermore, the challenges of slow training and sparse reward in DRL are effectively mitigated through the ReliefF-S algorithm and the optimization of the experience buffer, respectively.
According to the results of case studies, it can be observed that the proposed GAT-PPO algorithm not only has rapid convergence speed and good adaptability to different grid topologies but also possesses a strong ability to cope with uncertainties. The proposed GAT-PPO algorithm reduces the network loss by 6.9%, 10.8%, and 7.7%, respectively, demonstrating favorable economic outcomes. The proposed GAT-PPO algorithm can obtain an agent with strong transfer learning capability using a small amount of grid topology data, without the need for data from all different grid topologies. Meanwhile, the design of the reward function, prioritizing constraints first and objective later, closely aligns with practical VVC challenges. Additionally, the proposed GAT-PPO algorithm showcases an expanded boundary in terms of manageable grid topologies. In summary, the proposed GAT-PPO algorithm has better VVC performance, which strongly supports engineering applications.
The proposed GAT-PPO algorithm encounters two limitations: the curse of dimensionality when facing large-scale power grids, and the performance degradation due to the data quality of sensors. To overcome the above-mentioned flaws, the future research directions include: ① the multi-agent DRL algorithms will be explored to tackle the extensive and complex power grids; and ② during algorithm training, the data missing situations should be considered. Methods to mitigate the impact of data missing will be adopted to improve the proposed GAT-PPO algorithm, thereby continuously enhancing its robustness. Additionally, measures such as enhancing signal reception strength and improving transmission methods can be adopted to alleviate transmission issues with sensors.
References
B. She, F. Li, H. Cui et al., “Fusion of microgrid control with model-free reinforcement learning: review and vision,” IEEE Transactions on Smart Grid, vol. 14, no. 4, pp. 3232-3245, Jul. 2023. [Baidu Scholar]
M. Abdelghany, V. Mariani, D. Liuzza et al., “A unified control platform and architecture for the integration of wind-hydrogen systems into the grid,” IEEE Transactions on Automation Science and Engineering, vol. 21, no. 3, pp. 4042-4057, Jul. 2023. [Baidu Scholar]
Y. Chi, A. Tao, X. Xu et al., “An adaptive many-objective robust optimization model of dynamic reactive power sources for voltage stability enhancement,” Journal of Modern Power Systems and Clean Energy, vol. 11, no. 6, pp. 1756-1769, Nov. 2023. [Baidu Scholar]
M. Savargaonkar, I. Oyewole, A. Chehade et al., “Uncorrelated sparse autoencoder with long short-term memory for state-of-charge estimations in lithium-ion battery cells,” IEEE Transactions on Automation Science and Engineering, vol. 21, no. 1, pp. 15-26, Jan. 2024. [Baidu Scholar]
C. Lei, S. Bu, Q. Wang et al., “Look-ahead rolling economic dispatch approach for wind-thermal-bundled power system considering dynamic ramping and flexible load transfer strategy,” IEEE Transactions on Power Systems, vol. 39, no. 1, pp. 186-202, Jan. 2024. [Baidu Scholar]
K. Xie, J. Dong, C. Singh et al., “Optimal capacity and type planning of generating units in a bundled wind-thermal generation system,” Applied Energy, vol. 164, pp. 200-210, Feb. 2016. [Baidu Scholar]
M. Abdelghany, A. Al-Durra, D. Zhou et al., “Optimal multi-layer economical schedule for coordinated multiple mode operation of wind-solar microgrids with hybrid energy storage systems,” Journal of Power Sources, vol. 591, pp. 1-16, Jan. 2024. [Baidu Scholar]
Y. Li, W. Li, W. Yan et al., “Probabilistic optimal power flow considering correlations of wind speeds following different distributions,” IEEE Transactions on Power Systems, vol. 29, no. 4, pp. 1847-1854, Jul. 2014. [Baidu Scholar]
Y. Xu, Z. Dong, and R. Zhang, “Multi-timescale coordinated voltage/var control of high renewable-penetrated distribution systems,” IEEE Transactions on Power Systems, vol. 32, no. 6, pp. 4398-4408, Nov. 2017. [Baidu Scholar]
M. Lubin, Y. Dvorkin, and S. Backhaus, “A robust approach to chance constrained optimal power flow with renewable generation,” IEEE Transactions on Power Systems, vol. 31, no. 5, pp. 3840-3849, Sept. 2016. [Baidu Scholar]
P. Li, C. Zhang, Z. Wu et al., “Distributed adaptive robust voltage-var control with network partition in active distribution networks,” IEEE Transactions on Smart Grid, vol. 11, no. 3, pp. 2245-2256, May 2020. [Baidu Scholar]
F. Mráz, “Calculating the exact bounds of optimal values in LP with interval coefficients,” Annals of Operations Research, vol. 81, pp. 51-62, Jun. 1998. [Baidu Scholar]
C. Zhang, H. Chen, Z. Liang et al., “Reactive power optimization under interval uncertainty by the linear approximation method and its modified method,” IEEE Transactions on Smart Grid, vol. 9, no. 5, pp. 4587-4600, Sept. 2018. [Baidu Scholar]
B. Zhang and Y. Gao, “Data-driven voltage/var optimization control for active distribution network considering PV inverter reliability,” Electric Power Systems Research, vol. 224, pp. 1-14, Nov. 2023. [Baidu Scholar]
K. Xiong, D. Cao, G. Zhang et al., “Coordinated volt/var control for photovoltaic inverters: a soft actor-critic enhanced droop control approach,” International Journal of Electrical Power & Energy Systems, vol. 149, pp. 1-13, Jul. 2023. [Baidu Scholar]
R. Huang, Y. Chen, T. Yin et al., “Learning and fast adaptation for grid emergency control via deep meta reinforcement learning,” IEEE Transactions on Power Systems, vol. 37, no.6, pp. 4168-4178, Nov. 2022. [Baidu Scholar]
Q. Ma and C. Deng, “Simplified deep reinforcement learning based volt-var control of topologically variable power system,” Journal of Modern Power Systems and Clean Energy, vol. 11, no. 5, pp. 1396-1404, Sept. 2023. [Baidu Scholar]
R. Hossain, Q. Huang, and R. Huang, “Graph convolutional network-based topology embedded deep reinforcement learning for voltage stability control,” IEEE Transactions on Power Systems, vol. 36, no. 5, pp. 4848-4851, Sept. 2021. [Baidu Scholar]
S. Song, Y. Jung, G. Jang et al., “Proximal policy optimization through a deep reinforcement learning framework for remedial action schemes of VSC-HVDC,” International Journal of Electrical Power & Energy Systems, vol. 150, pp. 1-10, Aug. 2023. [Baidu Scholar]
L. Yin, S. Luo, Y. Wang et al., “Coordinated complex-valued encoding dragonfly algorithm and artificial emotional reinforcement learning for coordinated secondary voltage control and automatic voltage regulation in multi-generator power systems,” IEEE Access, vol. 8, pp. 180520-180533, Oct. 2020. [Baidu Scholar]
P. Veličković, G. Cucurull, A. Casanova et al., “Graph attention networks,” in Proceedings of 6th International Conference on Learning Representations (ICLR), Vancouver, Canada, May 2018, pp. 1-12. [Baidu Scholar]
E. Vittal, M. O’Malley, and A. Keane, “A steady-state voltage stability analysis of power systems with high penetrations of wind,” IEEE Transactions on Power Systems, vol. 25, no. 1, pp. 433-442, Feb. 2010. [Baidu Scholar]
O. Reyes, C. Morell, and S. Ventura, “Scalable extensions of the ReliefF algorithm for weighting and selecting features on the multi-label learning context,” Neurocomputing, vol. 161, pp. 168-182, Aug. 2015. [Baidu Scholar]
W. Zhang, Z. Wei, B. Wang et al., “Measuring mixing patterns in complex networks by Spearman rank correlation coefficient,” Physica A –Statistical Mechanics and Its Applications, vol. 451, pp. 440-450, Jun. 2016. [Baidu Scholar]
G. Calviño, J. Olivares, and F. Estrada, “Information entropy and fragmentation functions,” Nuclear Physics A, vol. 1036. pp. 1-17, Aug. 2023. [Baidu Scholar]