Optimal Power Dispatch of Active Distribution Network and P2P Energy Trading Based on Soft Actor-critic Algorithm Incorporating Distributed Trading Control

Yongjun Zhang; Jun Zhang; Guangbin Wu; Jiehui Zheng; Dongming Liu; Yuzheng An

网刊加载中。。。

使用Chrome浏览器效果最佳，继续浏览，你可能不会看到最佳的展示效果，

确定继续浏览么?

复制成功，请在其他浏览器进行阅读

Optimal Power Dispatch of Active Distribution Network and P2P Energy Trading Based on Soft Actor-critic Algorithm Incorporating Distributed Trading Control PDF

- ORCID：
Yongjun Zhang ¹
✉
- ORCID：
Jun Zhang ¹
✉
- ORCID：
Guangbin Wu ²
✉
- ORCID：
Jiehui Zheng ¹
✉
- ORCID：
Dongming Liu ¹
✉
- ORCID：
Yuzheng An ¹
✉

1. School of Electric Power, South China University of Technology, Guangdong Key Laboratory of Clean Energy Technology, Guangzhou 510641, China； 2. Customer Service Center of Guangdong Power Grid Corporation, Foshan, China

Updated：2025-03-26

DOI：10.35833/MPCE.2024.000471

OUTLINE

Abstract

Peer-to-peer (P2P) energy trading in active distribution networks (ADNs) plays a pivotal role in promoting the efficient consumption of renewable energy sources. However, it is challenging to effectively coordinate the power dispatch of ADNs and P2P energy trading while preserving the privacy of different physical interests. Hence, this paper proposes a soft actor-critic algorithm incorporating distributed trading control (SAC-DTC) to tackle the optimal power dispatch of ADNs and the P2P energy trading considering privacy preservation among prosumers. First, the soft actor-critic (SAC) algorithm is used to optimize the control strategy of device in ADNs to minimize the operation cost, and the primary environmental information of the ADN at this point is published to prosumers. Then, a distributed generalized fast dual ascent method is used to iterate the trading process of prosumers and maximize their revenues. Subsequently, the results of trading are encrypted based on the differential privacy technique and returned to the ADN. Finally, the social welfare value consisting of ADN operation cost and P2P market revenue is utilized as a reward value to update network parameters and control strategies of the deep reinforcement learning. Simulation results show that the proposed SAC-DTC algorithm reduces the ADN operation cost, boosts the P2P market revenue, maximizes the social welfare, and exhibits high computational accuracy, demonstrating its practical application to the operation of power systems and power markets.

Keywords

Optimal power dispatch; peer-to-peer (P2P) energy trading; active distribution network (ADN); distributed trading; soft actor-critic algorithm; privacy preservation

I. Introduction

WITH the increasing penetration of distributed energy resources (DERs), battery energy storage (BES), and adjustable loads, the distribution networks face operational problems such as overloading, voltage overruns, and network losses. Under the unified management of distribution system operator (DSO), the active distribution network (ADN) [

1], [2] regulates the active and reactive power outputs of various types of discrete devices (such as on-load tap changers (OLTCs) and capacitor banks (CBs)) and continuous devices (such as DERs and static var generators (SVGs)). This is achieved through the implementation of reasonable energy management strategies, in order to ensure the safe and efficient operation of distribution network [3], [4].

Consequently, the optimal power dispatch problems for ADNs are usually formulated as mixed-integer nonlinear models [

4], [5]. However, the solution is dependent on the accuracy of the network topology models and is not applicable to ADNs with rapidly changing structures [6]. Therefore, some scholars have adopted reinforcement learning methods to solve the ADN optimization problems in a time-efficient and model-free manner. References [7] and [8] employ the deep Q-network (DQN) and proximal policy optimization (PPO), respectively, to explore the optimal control strategies for discrete and continuous devices in ADNs. PPO is able to reduce the variance during the training process more effectively but requires multiple collections of the same data sample. The deep deterministic policy gradient (DDPG) improves the sample utilization efficiency and exploration effects by adopting an actor-critic framework and adding random noises [9], yet it is quite sensitive to hyperparameters [10], [11]. The soft actor-critic (SAC) provides a smoother training process and lower variance through its entropy regularization and dual Q-network structure, facilitating more stable and effective learning in the context of complex and dynamic environments in ADNs [12].

Some devices such as DER and BES in ADNs may belong to independent individuals with different interest claims [

13]. In the case where the DSO centralizes the dispatch of all the energy resources, the optimal outcome from a benefit-optimization perspective for some individuals may not be achieved [14], and it will not be feasible to leverage the individual motivation to participate in the operation regulation. Fortunately, the emerging peer-to-peer (P2P) energy trading provides a solution to this challenge. In a fully incentivized P2P market, the owners of the assets (called prosumers) can actively participate in the energy regulation of the local distribution network by selling electricity or reducing demand, thereby maximizing their revenue and mitigating the peak demand and operating costs of the distribution network [15].

The effectiveness of P2P markets has been extensively studied and validated [

16]. Depending on the manner of coordination among participants, the P2P market mechanism can be classified into centralized and decentralized schemes.

In the centralized scheme, a central entity (such as P2P operator or DSO) is responsible for coordinating energy trading and benefit distribution, with the advantage of maximizing the social welfare [

17]. However, as the number of DERs and prosumers increases, the operator may face problems such as data pressure, computational curse of dimensionality, and user information leakage [15]. In the decentralized scheme, the prosumers are able to decide the transaction parameters by themselves and complete the information interaction and energy trading process, which has the advantage of decision independence and strong privacy protection [18] but may lead to non-optimal social welfare.

In recent years, there has been an increase in research on P2P markets. At the level of information interaction and market operation, most studies have primarily employed block chain [

19], [20], auctions [21], and game theoretic approaches for pricing and trading energy. Specifically, to explore the competitive relationship between DSOs and prosumers, the models of non-cooperative game and auction strategies are employed to evaluate the profits of P2P energy trading [22]-[24]. Meanwhile, the research efforts [25]-[27] focus on developing methods to fairly distribute benefits within communities, utilizing cooperative game concepts and predefined rules.

Furthermore, the P2P markets encompass energy trading at the information layer, which requires secure transmission at the physical layer of the distribution network. A fully decentralized two-loop algorithm is proposed in [

28] to coordinate P2P energy trading with voltage regulation capability. Similarly, considering the distribution network constraints, the method in [29] proposes a trading strategy based on an alternating direction multiplier method and bidding auction.

However, the existing studies generally need to consider the control of the device governed by DSOs. The lack of transparency regarding the respective behaviors of DSO and prosumers may result in problems such as voltage overruns and network loss increase in the distribution network [

14]. As a result, DSOs may need to take more conservative and stringent measures to maintain grid security, leading to a further reduction in social welfare.

Although these studies provide valuable insights, they are constrained by several limitations, such as difficulties in privacy protection, ignoring distribution network constraints, and an insufficient consideration of the control of devices in ADN, as shown in Table I.

TABLE I Comparisons of Considered Factors in Different References

Reference	Privacy protection	Distribution network constraints	Control of devices
[19], [20]	$\sqrt$	$-$	$-$
[22], [24]	$-$	$\sqrt$	$-$
[23], [25]-[27]	$-$	$-$	$-$
[24], [28], [29]	$\sqrt$	$\sqrt$	$-$
[4]-[7], [12], [30], [31]	$-$	$\sqrt$	$\sqrt$
This paper	$\sqrt$	$\sqrt$	$\sqrt$

Note: the symbol $\sqrt$ represents that the corresponding factor is considered; and the symbol $-$ represents that the corresponding factor is not considered.

With all the above, this paper establishes a soft actor-critic algorithm incorporating distributed trading control (SAC-DTC) based on data-driven (deep reinforcement learning (DRL) algorithm) and physical modeling (information-driven distributed algorithm) [

32]-[34], which can be applied to coordinate the ADN and P2P markets. The main contributions of this paper are as follows.

1) The coordinated optimization for the power dispatch of ADN and P2P energy trading is constructed as a Markov decision process (MDP) and formulated as a social welfare maximization problem. The agent can explore the dispatch strategy that minimizes the ADN operation cost and creates an environment conducive to conducting P2P energy trading under the stochastic and uncertain conditions.

2) This paper proposes an SAC-DTC algorithm based on data-driven and physical modeling to solve the above problems. This proposed SAC-DTC algorithm utilizes differential privacy noise to protect users’ information and price signals to effectively guide users’ behavior, thus coupling the coordinated optimization process of ADN and P2P markets, and ultimately reducing the ADN operation cost and increasing the P2P market revenue.

3) The proposed SAC-DTC algorithm is superior in real-time optimization and operation processes of power systems because of its fast computation speed and small node voltage error of the obtained results.

The remainder of this paper is organized as follows. Section II introduces the framework of distribution network that contains both ADN and P2P markets. Section III formulates the optimal power dispatch model of ADN and P2P energy trading model. The proposed SAC-DTC algorithm based on data-driven and physical modeling is presented in Section IV to coordinate the ADN and local P2P market. Section V conducts empirical case studies to evaluate the effectiveness of the proposed SAC-DTC algorithm. Finally, Section VI concludes this paper.

II. Framework of Distribution Network Containing Both ADN and P2P Markets

As shown in Fig. 1, the proposed framework are applied to distribution networks, where both DSO management areas and autonomous operation areas of prosumers exist. There exists a node set $N_{B u s}$ and a branch set $N_{B r a n c h}$ in the distribution network. At each node i, there are two principal elements: local device managed by the DSO for network loss reduction and voltage control, and agents of prosumers who have been accredited by the DSO.

Fig.1 Proposed framework applied to distribution network. (a) Overall framework. (b) Control areas of DSO and prosumers.

1) DSO: as shown in the red part of Fig. 1(b), node i in this radial ADN contains various DERs such as wind turbines (WTs), photovoltaics (PVs), BESs, SVGs, CBs, and conventional loads. The DSO is tasked with managing the power equipment of ADN. It regulates the active and reactive power outputs to meet electricity demands at the minimum operation costs while ensuring system security.

2) Prosumers: as shown in the blue part of Fig. 1(b), node $i + 1$ contains distributed prosumer agents. Prosumers are categorized into two distinct non-empty subsets: producers (with a total number of $N_{S}$ ) selling power and consumers (with a total number of $N_{B}$ ) purchasing power. Each prosumer coordinates its energy trading with other market participants in the P2P market of ADN, aiming to fulfill individual objectives, adhere to system security constraints, and maximize profits.

The behavior of both DSO and prosumers causes changes in the network losses and node voltages of ADN. Therefore, an efficient coordination and control process between DSOs and prosumers is required to avoid problems such as over-regulation. The optimization process of the whole system seeks to minimize the ADN operation costs of and maximize the profits of all individuals in the P2P market, which is ultimately regarded as a social welfare maximization problem.

III. Problem Formulation

In a radial ADN connected to the external grid, the DSO is responsible for regulating the device in the ADN to ensure that the ADN meets the needs of all users while maintaining a safe and stable operating condition. All two-way users constitute a local P2P energy trading market, where each user can trade electricity and transmit it through the distribution network subjected to safety constraints.

A. Optimal Power Dispatch Model of ADN

1) The objective function for the optimal power dispatch of ADN is to minimize the regulation costs of OLTC, CB, and BES, costs of network losses, and cost of wind power and PV power curtailment, which is formulated as:

\begin{array}{l} m i n C_{t}^{A D N} = [C_{D E R} \sum_{i = 1}^{N_{D E R}} (P_{i, t}^{D E R} - P_{i, t}^{D E R, p r e})^{2} + C_{C B} \sum_{i = 1}^{N_{C B}} T_{i, t}^{C B, l o s s} + \\ C_{O L T C} \sum_{i = 1}^{N_{O L T C}} T_{i, t}^{O L T C, l o s s} + C_{N E T} \sum_{i = 1}^{N_{l i n e}} P_{i, t}^{N E T, l o s s} + C_{B E S} \sum_{i = 1}^{N_{B E S}} (P_{i, t}^{B E S})^{2}] Δ t \end{array}

(1)

\{\begin{array}{l} T_{i, t}^{C B, l o s s} = | T_{i, t}^{C B} - T_{i, t - 1}^{C B} | \\ T_{i, t}^{O L T C, l o s s} = | T_{i, t}^{O L T C} - T_{i, t - 1}^{O L T C} | \end{array}

(2)

where $C_{t}^{A D N}$ is the total ADN operation cost at time t; $C_{D E R}$ is the unit cost of wind power and PV power curtailment; $C_{C B}$ and $C_{O L T C}$ are the unit regulation costs of CB and OLTC, respectively; $C_{N E T}$ is the grid electricity price; $C_{B E S}$ is the unit loss cost of BES; $T_{i, t}^{C B}$ and $T_{i, t}^{O L T C}$ are the tap positions of CB and OLTC at node i at time t, respectively; $T_{i, t}^{C B, l o s s}$ and $T_{i, t}^{O L T C, l o s s}$ are the switching losses of CB and OLTC at node i at time t, respectively [

35], [36];

P_{i, t}^{N E T, l o s s}

is the loss of ADN at node i at time t;

P_{i, t}^{B E S}

is the active power of BES at node i at time t;

P_{i, t}^{D E R}

and

P_{i, t}^{D E R, p r e}

are the active power output of DERs and its predicted value at node i at time t, respectively; and

N_{C B}

N_{O L T C}

N_{B E S}

N_{D R E}

, and

N_{l i n e}

are the numbers of CBs, OLTCs, BESs, DREs, and lines, respectively.

2) The following constraints must be included in the optimization model to ensure the safe operation of the ADN with the P2P energy trading process.

P_{b + 1, t}^{N E T} = P_{b, t}^{N E T} + P_{i, t}^{B E S} + P_{i, t}^{D E R} + P_{i, t}^{B S} - P_{i, t}^{l o a d} - P_{b, b + 1, t}^{N E T, l o s s}

(3)

Q_{b + 1, t}^{N E T} = Q_{b, t}^{N E T} + Q_{i, t}^{C B} + Q_{i, t}^{S V G} + Q_{i, t}^{D E R} + Q_{i, t}^{B S} - Q_{i, t}^{l o a d} - Q_{b, b + 1, t}^{N E T, l o s s}

(4)

V_{m i n} \leq V_{i, t}^{N E T} \leq V_{m a x}

(5)

V_{i, t}^{N E T} - V_{i - 1, t}^{N E T} = (r_{b} P_{b, t}^{N E T} + x_{b} Q_{b, t}^{N E T}) / V_{b a s e}

(6)

where $P_{b, t}^{N E T}$ and $Q_{b, t}^{N E T}$ are the inflow active and reactive power of branch b at time t, respectively; $P_{b, b + 1, t}^{N E T, l o s s}$ and $Q_{b, b + 1, t}^{N E T, l o s s}$ are the active and reactive power losses from branches b to $b + 1$ at time t, respectively; $P_{i, t}^{B S}$ and $Q_{i, t}^{B S}$ are the active and reactive power of prosumers at node i at time t, respectively; $Q_{i, t}^{C B}$ , $Q_{i, t}^{S V G}$ , and $Q_{i, t}^{D E R}$ are the reactive power of CB, SVG, and DER at node i at time t, respectively; $P_{i, t}^{l o a d}$ and $Q_{i, t}^{l o a d}$ are the active and reactive power of conventional loads at node i at time t, respectively; $V_{i, t}^{N E T}$ is the voltage amplitude at node i at time t; $V_{m i n}$ and $V_{m a x}$ are the minimum and maximum voltage levels of ADN, respectively; $r_{b}$ and $x_{b}$ are the resistance and reactance of branch b, respectively; and $V_{b a s e}$ is the voltage reference value.

The OLTC, CB, SVG, DER, and ESS have their own constraints, which are depicted as (7)-(15), among which (7)-(9) are the operational constraints for the OLTC and CB; (10) and (11) are the operational constraints for the SVG and DER, respectively; and (12)-(15) are the constraints for the ESS.

\{\begin{array}{l} V_{1, t}^{N E T} = V_{b a s e}^{O L T C} + T_{t}^{O L T C} Δ V^{O L T C} \\ T_{m i n}^{O L T C} \leq T_{t}^{O L T C} \leq T_{m a x}^{O L T C} \end{array}

(7)

\{\begin{array}{l} Q_{i, t}^{C B} = T_{i, t}^{C B} Δ Q^{C B} \\ T_{m i n}^{C B} \leq T_{i, t}^{C B} \leq T_{m a x}^{C B} \end{array}

(8)

\{\begin{array}{l} \sum_{t = 1}^{24} | T_{t}^{O L T C} - T_{t - 1}^{O L T C} | \leq N_{m a x}^{O L T C} \\ \sum_{t = 1}^{24} | T_{i, t}^{C B} - T_{i, t - 1}^{C B} | \leq N_{m a x}^{C B} \end{array}

(9)

Q_{m i n}^{S V G} \leq Q_{i, t}^{S V G} \leq Q_{m a x}^{S V G}

(10)

\{\begin{array}{l} 0 \leq P_{i, t}^{D E R} \leq P_{i, t, m a x}^{D E R} \\ 0 \leq Q_{i, t}^{D E R} \leq Q_{i, t, m a x}^{D E R} \end{array}

(11)

P_{i, t}^{B E S} = ω_{i, t}^{B C} P_{i, t}^{B C} + ω_{i, t}^{B D} P_{i, t}^{B D}

(12)

ω_{i, t}^{B C} + ω_{i, t}^{B D} \leq 1 ω_{i, t}^{B C}, ω_{i, t}^{B D} \in {0,1}

(13)

E_{i, t}^{B E S} = E_{i, t - 1}^{B E S} + P_{i, t}^{B C} η - P_{i, t}^{B D} / η

(14)

\{\begin{array}{l} 0 \leq P_{i, t}^{B C} \leq P_{i, m a x}^{B C} \\ 0 \leq P_{i, t}^{B D} \leq P_{i, m a x}^{B D} \\ E_{i, t, m i n}^{B E S} \leq E_{i, t}^{B E S} \leq E_{i, t, m a x}^{B E S} \end{array}

(15)

where $V_{b a s e}^{O L T C}$ is the base voltage of OLTC; $Δ V^{O L T C}$ is the voltage change per tap of OLTC; $N_{m a x}^{O L T C}$ is the maximum number of OLTC operations; $T_{t}^{O L T C}$ is the tap position of OLTC at time t, and $T_{m i n}^{O L T C}$ and $T_{m a x}^{O L T C}$ are its lower and upper bounds, respectively; $Δ Q^{C B}$ is the reactive power change per tap of CB; $N_{m a x}^{C B}$ is the maximum number of CB operations; $T_{i, t}^{C B}$ is the tap position of CB at node i at time t, and $T_{m i n}^{C B}$ and $T_{m a x}^{C B}$ are its lower and upper bounds, respectively; $Q_{m i n}^{S V G}$ and $Q_{m a x}^{S V G}$ are the minimum and maximum reactive power of SVG, respectively; $P_{i, t, m a x}^{D E R}$ and $Q_{i, t, m a x}^{D E R}$ are the maximum active and reactive power of DER at node i at time t, respectively; $E_{i, t}^{B E S}$ is the capacity of BES at node i at time t; $η$ is the charging/discharging efficiency; $P_{i, t}^{B C}$ and $P_{i, t}^{B D}$ are the charging and discharging power of BES at node i at time t, respectively, and $ω_{i, t}^{B C}$ and $ω_{i, t}^{B D}$ are their Boolean variables; $E_{i, t, m a x}^{B E S}$ and $E_{i, t, m i n}^{B E S}$ are the upper and lower bounds of the capacity of BES at node i at time t, respectively; and $P_{i, m a x}^{B C}$ and $P_{i, m a x}^{B D}$ are the maximum charging and discharging power of BES at node i, respectively.

B. P2P Energy Trading Model

P2P energy trading entities need a model for maximizing revenue internally. Prosumers have increasing marginal costs of electricity generation when they act as producers and decreasing marginal benefits of electricity use when they act as consumers. Therefore, the producers’ and sellers’ electricity consumption behaviors can be characterized by a quadratic function [

37]. The total revenue of prosumers is composed of three terms: the power utility benefit of prosumers, the active electricity cost, and the reactive electricity cost.

m a x \sum_{i = 1}^{N_{B} + N_{S}} U_{i, t}^{P 2 P} = \sum_{i = 1}^{N_{B} + N_{S}} (u_{i, t} - δ_{i, t}^{P L M P} P_{i, t}^{B S} - δ_{i, t}^{Q L M P} Q_{i, t}^{B S})

(16)

u_{i, t} = ε_{i, t} (P_{i, t}^{B S})^{2} + β_{i, t} P_{i, t}^{B S} + τ_{i, t} (Q_{i, t}^{B S} - Q_{i, t - 1}^{B S})^{2}

(17)

where $U_{i, t}^{P 2 P}$ is the total revenue of prosumer at node i at time t; $u_{i, t}$ is the function of power utility benefits of prosumer at node i at time t; $ε_{i, t}$ , $β_{i, t}$ , and $τ_{i, t}$ are the power utility parameters of prosumers, which are private information; and $δ_{i, t}^{P L M P}$ and $δ_{i, t}^{Q L M P}$ are the marginal tariffs for active and reactive power at node i at time t, respectively.

In addition, the trading results need to satisfy the ADN security constraints as well as the market supply and demand balance constraints, which are shown as:

\sum_{i = 1}^{N_{B} + N_{S}} (P_{i, t}^{B S} + P_{i, t}^{B S, B E S}) - \sum_{b = 1}^{N_{l i n e}} P_{b, b + 1, t}^{P 2 P, l o s s} = 0

(18)

\sum_{i = 1}^{N_{B} + N_{S}} Q_{i, t}^{B S} - \sum_{b = 1}^{N_{l i n e}} Q_{b, b + 1, t}^{P 2 P, l o s s} = 0

(19)

V_{m i n} \leq Δ V_{i, t}^{P 2 P} \leq V_{m a x}

(20)

\{\begin{array}{l} \underset{̲}{P} \leq P_{i, t}^{B S} \leq \bar{P} \\ \underset{̲}{Q} \leq Q_{i, t}^{B S} \leq \bar{Q} \end{array}

(21)

where $P_{i, t}^{B S, B E S}$ is the active power of the prosumer’s own BES; $P_{b, b + 1, t}^{P 2 P, l o s s}$ and $Q_{b, b + 1, t}^{P 2 P, l o s s}$ are the network active and reactive power losses from branches b to $b + 1$ at time t caused by the P2P energy trading, respectively; $Δ V_{i, t}^{P 2 P}$ is the amount of voltage amplitude change caused by the P2P energy trading; $\bar{P}$ and $\underset{̲}{P}$ are the upper and lower limits of active power regulation for prosumers, respectively; and $\bar{Q}$ and $\underset{̲}{Q}$ are the upper and lower limits of reactive power regulation for prosumers, respectively.

The ADN cannot access the specific power consumption information of prosumers for privacy protection and market fairness. Therefore, we decompose the original problem into multiple subproblems, thus facilitating the subsequent solution using a distributed approach.

The changes in active and reactive power for each prosumer impact the network losses and nodal voltages. Consequently, we incorporate all constraints into the electricity efficiency function for prosumer $w_{i, t}$ and differentiate it to determine the marginal tariffs for active and reactive power [

33].

\begin{array}{l} w_{i, t} = u_{i, t} + μ_{i, t}^{V m i n} (Δ V_{i, t}^{P 2 P} - V_{m i n}) + μ_{i, t}^{V m a x} (V_{m a x} - Δ V_{i, t}^{P 2 P}) + \\ μ_{i, t}^{P} [\sum_{i = 1}^{N_{B} + N_{S}} (P_{i, t}^{B S} + P_{i, t}^{B S, B E S}) - \sum_{b = 1}^{N_{l i n e}} P_{b, b + 1, t}^{P 2 P, l o s s}] + \\ μ_{i, t}^{Q} (\sum_{i = 1}^{N_{B} + N_{S}} Q_{i, t}^{B S} - \sum_{b = 1}^{N_{l i n e}} Q_{b, b + 1, t}^{P 2 P, l o s s}) + μ_{i, t}^{P m a x} (P_{i, t}^{B S} - \underset{̲}{P}) + \\ μ_{i, t}^{P m i n} (\bar{P} - P_{i, t}^{B S}) + μ_{i, t}^{Q m a x} (Q_{i, t}^{B S} - \underset{̲}{Q}) + μ_{i, t}^{Q m i n} (\bar{Q} - Q_{i, t}^{B S}) \end{array}

(22)

\{\begin{array}{l} δ_{i, t}^{P L M P} = \partial w_{i, t} / \partial P_{i, t}^{B S} \\ δ_{i, t}^{Q L M P} = \partial w_{i, t} / \partial Q_{i, t}^{B S} \end{array}

(23)

where $μ_{i, t}^{V m a x}$ and $μ_{i, t}^{V m i n}$ are the dual variables corresponding to the upper and lower voltage constraints at node i at time t, respectively; $μ_{i, t}^{P}$ and $μ_{i, t}^{Q}$ are the dual variables corresponding to the active and reactive power balance constraints at node i at time t, respectively; $μ_{i, t}^{P m a x}$ and $μ_{i, t}^{P m i n}$ are the dual variables corresponding to the upper and lower active power constraints at node i at time t, respectively; and $μ_{i, t}^{Q m a x}$ and $μ_{i, t}^{Q m i n}$ are the dual variables corresponding to the upper and lower reactive power constraints at node i at time t, respectively.

IV. Proposed SAC-DTC Algorithm

During the ADN dispatching and P2P energy trading, if we do not consider the impact on the system, we may reach a trading and controlling result that violates the system operation constraints, ultimately leading to device failure or system instability. Therefore, we propose the SAC-DTC algorithm to coordinate the optimization process between the ADN and the P2P market to achieve the global optimum within a solution space that ensures the voltage levels safety. The objective is to minimize the ADN operation cost (including regulation costs of device and costs of network loss, etc.) and maximize the P2P market revenue, while ensuring the safe operation of the system.

The proposed SAC-DTC algorithm is a new type of algorithm by combining DRL algorithm and distributed control computing. The structure of the proposed SAC-DTC algorithm is shown in Fig. 2. It should be noted that the proposed SAC-DTC algorithm continues the learning process during the online operation.

Fig. 2 Structure of proposed SAC-DTC algorithm.

The optimization process of ADN and P2P market can be modeled using the MDP, as shown in Fig. 3.

Fig. 3 Optimization process of ADN and P2P market using MDP.

First, the agent gives the optimal action $a$ of each device in the ADN based on the local state $s$ . Then, it calculates the network loss and node voltage in the ADN and issues the information to the P2P market. Subsequently, the prosumers adjust the output according to their interests and return the profits to ADN after differential privacy encryption processing. Finally, ADN calculates the reward value R based on (1) and (16), and then puts the data into the experience buffer pool $𝒟$ to update the network parameters.

R = \sum_{i = 1}^{N_{B} + N_{S}} U_{i, t}^{P 2 P} - C_{t}^{A D N}

(24)

The MDP consists of five key elements: state space s, action space a, state transfer probability $P$ , reward function $R$ , and discount factor $γ$ , represented by $<s, a, P, R, γ>$ .

A. Continuous-discrete Hybrid SAC

For the reinforcement learning in continuous-discrete hybrid action space, assuming that there are n discrete devices, each with m_n actions, the output action dimension of the state-action value function Q will be $\prod_{i = 1}^{n} m_{i}$ . The action dimension will grow exponentially as the number of devices n increases. If a separate Q value is estimated for each possible combination of actions, the data required to be calculated and stored will grow rapidly and fall into a curse of dimensionality. Therefore, inspired by [

38], we use a separate head for each discrete device. The head is only responsible for calculating the Q value associated with the device actions, which is expressed as:

Q (s, a) = c_{0} (s) + {\sum_{i = 1}^{n} c}_{i} (s) Q_{i} (s, a)

(25)

where $Q_{i}$ is the Q value of device i; and $c_{0}$ and $c_{i}$ are the shared base value and the state parameters of device i, respectively.

During the training process, the formula for calculating the network target value $y$ is:

\begin{array}{l} y = R + γ [c_{0} (s^{'}) + \sum_{i = 1}^{n} c_{i} (s^{'}) Q_{t a r g e t, i}^{d i s c} (s^{'}, a_{d i s c, i}^{'}) + Q_{t a r g e t}^{c o n t} (s^{'}, a_{c o n t}^{'}) - \\ α (l o g π_{c o n t} (a_{c o n t}^{'} | s^{'}) + \sum_{i = 1}^{n} π_{d i s c, i} (a_{d i s c, i}^{'} | s^{'}) l o g π_{d i s c, i} (a_{d i s c, i}^{'} | s^{'}))] \end{array}

(26)

where $s^{'}$ is the new state; $α$ is the temperature parameter used to control the contribution of entropy in the policy update; $a_{c o n t}^{'}$ and $a_{d i s c, i}^{'}$ are the continuous and discrete actions of the new state, respectively; $Q_{t a r g e t}^{c o n t}$ and $Q_{t a r g e t, i}^{d i s c}$ are the state-action value functions; and $π_{c o n t}$ and $π_{d i s c, i}$ are the strategy functions.

The parameters of the critic network are updated by minimizing the mean square error $J_{Q}$ between the predicted Q value of the critic network $Q_{φ_{c r i t i c}}$ and the target value y. Then, the parameters of actor network are updated by minimizing the loss function $J_{π}$ :

J_{Q} = \underset{(s, a, R, s^{'}) ~ 𝒟}{E} (\frac{1}{2} (Q_{φ_{c r i t i c}} {(s, a) - y)}^{2})

(27)

J_{π} = \underset{s ~ 𝒟, a ~ π_{φ_{a c t o r}}}{E} (α l o g π_{φ_{a c t o r}} (a | s) - m i n Q_{φ_{c r i t i c}} (s, a))

(28)

where $π_{φ_{a c t o r}}$ represents the probability of taking action $a$ given state $s$ under the policy parameterized by $φ_{a c t o r}$ .

Finally, the training network is slowly tracked by a soft update method:

φ_{t a r g e t} \leftarrow θ φ + (1 - θ) φ_{t a r g e t}

(29)

where $φ = φ_{a c t o r}$ or $φ_{c r i t i c}$ is the training network parameter; $φ_{t a r g e t}$ is the target network parameter; and $θ$ is the soft update rate.

B. Distributed Trading Control (DTC)

For the prosumers at each node, adjusting the active and reactive power during the energy trading process will bring changes to their benefits or costs as well as the node voltage and network loss. Therefore, in this paper, based on the dual ascent method of sensitivity calculation [

33], we calculate the revenues of prosumers during the energy trading process and assess the impact on the ADN operation cost. The change of state of ADN is linearized as:

Δ Z (s_{t}) = [\begin{matrix} g_{V} \\ g_{P} \\ g_{Q} \end{matrix}] [Δ P_{t}^{B S} Δ Q_{t}^{B S}]

(30)

\underset{}{g_{V}} = [\begin{matrix} \frac{\partial V_{1, t}}{\partial P_{1, t}} & \dots & \frac{\partial V_{1, t}}{\partial P_{N_{B} + N_{S}, t}} & \frac{\partial V_{1, t}}{\partial Q_{1, t}} & \dots & \frac{\partial Q_{1, t}}{\partial Q_{N_{B} + N_{S}, t}} \\ ⋮ & ⋮ & ⋮ & ⋮ \\ \frac{\partial V_{N_{n o d e}, t}}{\partial P_{1, t}} & \dots & \frac{\partial V_{N_{n o d e}, t}}{\partial P_{N_{B} + N_{S}, t}} & \frac{\partial V_{N_{n o d e}, t}}{\partial Q_{1, t}} & \dots & \frac{\partial Q_{N_{n o d e}, t}}{\partial Q_{N_{B} + N_{S}, t}} \end{matrix}]

(31)

\underset{}{g_{P}} = [\begin{matrix} \frac{\partial P_{1, t}}{\partial P_{1 o s s, t}} & \dots & \frac{\partial P_{N_{B} + N_{S}, t}}{\partial P_{1 o s s, t}} & \frac{\partial P_{1, t}}{\partial Q_{1 o s s, t}} & \dots & \frac{\partial P_{N_{B} + N_{S}, t}}{\partial Q_{1 o s s, t}} \end{matrix}]

(32)

\underset{}{g_{Q}} = [\begin{matrix} \frac{\partial Q_{1, t}}{\partial P_{1 o s s, t}} & \dots & \frac{\partial Q_{N_{B} + N_{S}, t}}{\partial P_{1 o s s, t}} & \frac{\partial Q_{1, t}}{\partial Q_{1 o s s, t}} & \dots & \frac{\partial Q_{N_{B} + N_{S}, t}}{\partial Q_{1 o s s, t}} \end{matrix}]

(33)

Δ P_{t}^{B S} = {[\begin{matrix} Δ P_{1, t}^{B S} & Δ P_{2, t}^{B S} & \dots & Δ P_{N_{B} + N_{S}, t}^{B S} \end{matrix}]}^{T}

(34)

Δ Q_{t}^{B S} = {[\begin{matrix} Δ Q_{1, t}^{B S} & Δ Q_{2, t}^{B S} & \dots & Δ Q_{N_{B} + N_{S}, t}^{B S} \end{matrix}]}^{T}

(35)

where $Δ Z (s_{t})$ is the state change matrix function; $g_{V} \in R^{ℕ_{n o d e} \times 2 (ℕ_{B} + ℕ_{S})}$ , $g_{P} \in R^{1 \times 2 (ℕ_{B} + ℕ_{S})}$ , and $g_{Q} \in R^{1 \times 2 (ℕ_{B} + ℕ_{S})}$ are the linear mapping functions for the node voltage, active power loss of ADN, and reactive power loss of ADN, respectively; and $Δ P_{t}^{B S}$ and $Δ Q_{t}^{B S}$ are the vectors of active power and reactive power adjustments during energy trading for the prosumers, respectively.

The mapping function can be fitted based on a neural network, but this requires a separate neural network for each variable and constraint, which will also fall into the curse of dimensionality. Therefore, in this paper, we utilize the sensitivity matrix as an equivalent alternative to the mapping function and validate the accuracy of the solution. The original problem (16)-(23) in the P2P market is transformed into a quadratic programming problem as:

\{\begin{array}{l} m i n U_{t}^{P 2 P} = - \frac{1}{2} x^{T} M x - H^{T} x \\ s . t . A x ⩽ B \end{array}

(36)

\{\begin{array}{l} x = [\begin{array}{l} P_{t}^{B S} + Δ P_{t}^{B S} \\ Q_{t}^{B S} + Δ Q_{t}^{B S} \end{array}] \\ M = [m_{P} m_{Q}]^{T} \\ H = [h_{P} h_{Q}]^{T} \\ A = [g_{V} - g_{V} g_{P} g_{Q} 1_{1 \times 4 (N_{B} + N_{S})}]^{T} \\ B = {[\bar{V} \underset{̲}{V} 0 0 \bar{P} \underset{̲}{P} \bar{Q} \underset{̲}{Q}]}^{T} \end{array}

(37)

where the matrix parameters $m_{P}$ , $m_{Q}$ , $h_{P}$ , and $h_{Q}$ are extracted from the objective function for prosumers shown in (17); $\bar{V}$ and $\underset{̲}{V}$ are the upper and lower matrices of node voltages, respectively; $\bar{P}$ and $\underset{̲}{P}$ are the matrices of upper and lower active power for prosumers, respectively; and $\bar{Q}$ and $\underset{̲}{Q}$ are the matrices of upper and lower reactive power for prosumers, respectively.

The dual function is:

\underset{x \in R^{n}}{i n f} \{\frac{1}{2} x^{T} M x + H x + μ^{T} (A x - B)\}

(38)

The lower definitive bound for this problem is taken at $x = - M^{- 1} (H + A^{T} μ)$ . By disregarding the constant term and changing the sign of the objective function, the maximization problem is transformed into a minimization problem to obtain the dyadic problem as:

\{\begin{array}{l} m i n d = \frac{1}{2} μ^{T} A M^{- 1} A^{T} μ + (B + A M^{- 1} {H)}^{T} μ \\ s . t . μ = [μ_{i, t}^{V m a x}, μ_{i, t}^{V m i n}, μ_{i, t}^{P}, μ_{i, t}^{Q}, μ_{i, t}^{P m a x}, μ_{i, t}^{P m i n}, μ_{i, t}^{Q m a x}, μ_{i, t}^{Q m i n}]^{T} \geq 0 \end{array}

(39)

where $μ$ is the vector of Lagrangian multipliers associated with the constraints.

When the original problem is convex, we can find the gradient of A for its dual problem and obtain:

\nabla d = A M^{- 1} A^{T} μ + B + A M^{- 1} H

(40)

d (μ') \leq d (μ) + (\nabla d {(x))}^{T} (μ' - μ) + \frac{L}{2} ‖ μ' - μ ‖^{2}

(41)

For any two points in the dual function d, the value of function d is at least the linear approximation minus a quadratic term, which depends on the distance between the two points and Lipschitz constant. Lipschitz constant should be the largest eigenvalue of $A M^{- 1} A^{T}$ . This is because in the quadratic functions, the largest eigenvalue of the matrix determines the maximum curvature. According to [

33], the generalized Lipschitz constant matrix

L ≽ A M^{- 1} A^{T}

is used to determine the step size in the dual ascent method, where

L

is set to be a diagonal matrix. By minimizing the trace of

L

, the semi-positive definite programming problem can be solved.

After solving the dual problem (40), the optimal power for each prosumer is obtained, which is then substituted into (16) to obtain the maximum welfare for each prosumer $U_{i, t}^{P 2 P *}$ . To protect the privacy of prosumers, a differential privacy technique is used. This involves adding random noise to the data through Laplace-distributed sampling $L a p (\cdot)$ , as expressed in (42). The noise is then returned to the agent for learning as part of the reward.

U_{t}^{P 2 P} = \sum_{i = 1}^{N_{B} + N_{S}} (U_{i, t}^{P 2 P *} + L a p (\frac{Δ f}{ϵ}))

(42)

where $Δ f$ is the sampling sensitivity, representing the maximum variation that $U_{i, t}^{P 2 P *}$ may experience; and $ϵ$ is the privacy strength parameter, whose value is smaller for stronger privacy protection.

Since the noise is random and its mathematical expectation is 0, the effects of the noise are canceled when aggregating large amounts of data. This ensures that the statistical estimation of total P2P market revenues remains accurate.

The detailed calculation procedure of the proposed SAC-DTC algorithm is explained in Algorithm 1.

Algorithm 1 : detailed calculation procedure of proposed SAC-DTC algorithm
S1:Initialize $φ_{a c t o r}, φ_{c r i t i c}$ , $φ_{t a r g e t}$ , $θ$ , $D = \emptyset$ , $μ = 0$ , time step $Δ t = 1$ hour, and the maximum time step $T = 24$ hours
S2: Repeat
S3: for $t = 1 : Δ t : T$ do
S4: $a$ ~ $π (a \|s)$
S5: Calculate power flow
S6: Release $V_{i, t}^{N E T}$ , $g_{V}$ , $g_{P}$ , $g_{Q}$ , and locational marginal price (LMP) to prosumers
S7: Solve (16) for each prosumer
S8: Update $μ$ and LMP
S9: Update $a, s^{'}$ , and $R$ , and store $[s, a, R, s^{'}]$ in $D$
S10: end for
S11: Update $φ_{a c t o r}$ and $φ_{c r i t i c}$ using (25)-(27)
S12: Update $φ_{t a r g e t}$ using (28)
S13: end

V. Case Studies

A. System Setting

This paper evaluates the proposed SAC-DTC algorithm using the IEEE 33-node system. We assumes that five prosumers participate in the P2P energy trading, and the basic parameters of the utility function can be found in [

33]. The training process involves base loads at all nodes of this distribution network, with load data values originating from a regional grid in southern China over a time span of 1000 randomly selected days. According to [39], the upper and lower limits of node voltage amplitude are set to be 1.06 and 0.94 p.u., respectively.

Three operation models are set up to compare the effectiveness in reducing the ADN operation cost and improving the P2P market revenue.

Model 1: without considering voltage constraints, the ADN operation cost is minimized as the objective function for optimization, and the P2P market is optimized with the objective function of maximizing the operation revenue.

Model 2: based on Model 1, the system voltage constraints are further considered, and the P2P market is optimized for operation based on the method in [

33].

Model 3: as illustrated in Section III, the voltage constraints are considered and the total social welfare of the sum of P2P market revenue and ADN operation cost is taken as the objective function, the joint optimization is run using the proposed SAC-DTC algorithm.

B. Convergence Performance

There have been several studies applying DRL algorithms to the power system domain. In this subsection, we focus on comparing the SAC algorithm with the widely-used DDPG and PPO algorithms. All the three DRL algorithms utilize an actor-critic architecture. The DDPG algorithm employs a deterministic strategy network (actors) to directly predict actions and evaluates the expected returns of these actions through a value network (critics). In contrast, the PPO algorithm ensures the stability and convergence of policy updates by introducing a clip loss function that limits the magnitude of these updates, while the SAC algorithm encourages broader exploration by increasing policy entropy. The hyperparameters are shown in Tables II-IV. The Ornstein-Uhlenbeck noises are provided in [

10] and [41].

TABLE II Common Hyperparameters for Three DRL Algorithms

Hyperparameter	Value	Hyperparameter	Value
Architecture of actor and critic networks	[256, 256]	Activation function	ReLU
Optimizer	Adam	Discount factor	0.99
Actor learning rate	1×10^-3	$T$	24 hours
Critic learning rate	5×10^-4	$Δ t$	1 hour
Minibatch size	64	Evaluation frequency	3

Figure 4 demonstrates the training performance using SAC-DTC, DDPG-DTC, and PPO-DTC algorithms in the IEEE 33-node system, where the shaded area represents the range of fluctuation of these algorithms over the course of multiple training sessions. It can be observed that the proposed SAC-DTC algorithm performs better in reducing the ADN operation cost. As for the P2P market revenue, all the three algorithms show similar convergence, mainly attributed to the effectiveness of DTC. Overall, the proposed SAC-DTC algorithm outperforms both DDPG-DTC and PPO-DTC algorithms regarding the training speed and final results, indicating its potential advantages in power system optimization.

Fig. 4 Training performance using SAC-DTC, DDPG-DTC, and PPO-DTC algorithms in IEEE 33-node system. (a) Total reward. (b) ADN operation cost. (c) P2P market revenue.

TABLE III Independent Hyperparameters for SAC and DDPG Algorithms

Hyperparameter	Value
Hyperparameter	SAC algorithm	DDPG algorithm
Target network update rate	0.005	0.005
Replay buffer size	5×10⁵	5×10⁵
Entropy coefficient	Auto
Noise type		Ornstein-Uhlenbeck

TABLE IV Independent Hyperparameters for PPO Algorithms

Hyperparameter	Value
Value function coefficient	0.5
Generalized advantage estimation Lambda	0.95
Clip ratio	0.2
Number of epochs	3
Gradient clipping	0.1

C. AND Operation Costs and P2P Market Revenue

The results of the three operation models are presented in Table V.

TABLE V Results of Three Operation Models

Model	ADN operation cost (CNY)	P2P market revenue (CNY)	Number of voltage violations	The maximum voltage difference (p.u.)
Model 1	1054	6491	188	0.12350
Model 2	1615	5561	0	0.07934
Model 3	1481	6283	0	0.07254

Figure 5 shows the node voltage comparisons of the three operation models. Figure 6 illustrates the comparison of ADN operation costs and P2P market revenues with the three operation models.

Fig. 5 Node voltage comparison of three operation models. (a) Model 1. (b) Model 2. (c) Model 3.

Fig. 6 Comparison of ADN operation costs and P2P market revenues.(a) ADN operation cost. (b) P2P market revenue.

From a system security perspective, during hours 8-20, Model 1 exhibits the largest voltage fluctuation deviation, with several node voltages crossing the lower limit at various time points. However, during other periods, the system does not experience voltage crossings. Model 2 and 3 are able to operate safely throughout all periods because the voltage constraints are considered in the optimization process of the ADN and P2P markets. In Model 2, the optimization process of ADN and P2P markets operates independently, and the lower bound of system voltage is generally higher than that in Model 3, but the maximum voltage variation is greater.

Additionally, Fig. 7 shows the comparison results of LMP.

Fig. 7 Comparison results of LMPs. (a) PLMP in Model 1. (b) QLMP in Model 1. (c) PLMP in Model 2. (d) QLMP in Model 2. (e) PLMP in Model 3. (f) QLMP in Model 3.

From Figs. 6, 7(a), and 7(b), it can be observed that when the voltage constraints are not considered in Model 1, ADN does not need to regulate the actions. Each prosumer only needs to fine-tune its output value according to the active and reactive power balance constraints of P2P market. Hence, the differences in the LMPs for the active power and reactive power, i.e., PLMP and QLMP, respectively, of each node are minor, and all PMLPs are positive. Producers 1 and 2 have negative active power values, absorb energy from the P2P market, and pay for the cost of electricity consumption. Consumers 3-5 have positive active power values, supply energy to the P2P market, and receive revenues from electricity sales. At this point, ADN has the lowest cost, and P2P market has the highest revenue.

As shown in Fig. 7(c) and (d), during hours 0-7 and 21-24, the energy trading between producers and consumers is constrained by the voltage limitations in Model 2. The PMLP decreases, which reduces the size of the energy trading between producers and consumers. During hours 8-20, the PLMP and QLMP of consumers increase significantly due to the voltage limitation constraints, leading consumers to reduce their power consumption. The PLMP and QLMP of producers decrease dramatically to negative values due to the reduced power consumption of consumers. To maintain the power balance, the PLMP guides the producer to reduce the amount of electricity sale using a negative price signal. The effectiveness of the LMP mechanism can be illustrated by comparing the changes in PLMP and QLMP in Model 1 and Model 2. The P2P market can utilize economic instruments to efficiently dispatch the active and reactive power for each prosumer, thereby mitigating the voltage crossing the lower limit. The results of Model 1 and Model 2 in Fig. 6(a) and (b) are almost the same during hours 0-7 and 21-24. This is because the network constraints are met in both models. However, in Model 2, when there is a voltage overrun during hours 8-20, due to the lack of complete information in the ADN and P2P markets, the two parties can only supervise their internal devices independently to ensure the safe operation. This leads to an increase in the ADN operation cost by 53.2% and an increase in the P2P regulation cost by 14.3% compared with Model 1.

In Model 3, based on the proposed SAC-DTC algorithm, the encrypted information can be shared between the ADN and the prosumers. The system security regulation cost can be effectively shared with the ADN and each prosumer. As can be observed in Fig. 7(e) and (f), the PLMP and QLMP changes of prosumers in Model 3 are much less drastic than those in Model 2, which are similar during hours 0-7 and 21-24. However, during hours 8-20, the PLMP and QLMP of consumers in Model 3 are overall lower than those in Model 2, indicating that consumers are able to purchase electricity in the P2P market at a lower cost. Producers, on the other hand, have an overall increase in PLMP and QLMP, indicating that producers can supply electricity to the P2P market at a higher price and make higher profits. The ADN operation costs increase during certain time periods due to the earlier adjustment of device. On the premise of ensuring the system safe operation, compared with Model 2, the ADN operation cost of Model 3 is reduced by 8.3%, the P2P market revenue increases by 12.9%, and the maximum voltage difference is minimized, making the system operate more stable. The accumulated savings in ADN operation costs for the whole year amount to 49000 CNY, and the P2P market revenue increases by 264000 CNY.

Overall, the joint optimization of ADN and P2P markets can reduce the feeder voltage drop and avoid violating the voltage constraints. Meanwhile, the economic cost paid by the market members to ensure system security in Model 3 is much smaller than that in Model 2 and close to that in Model 1. For all members in the ADN, the system security status should be the primary. Therefore, this paper concludes that trading a smaller economic cost for safer system operation is reasonable.

D. Comparison of Proposed SAC-DTC Algorithm with Centralized Algorithm

In order to verify the accuracy and scalability of the proposed SAC-DTC algorithm, its computational results are compared with those of the mixed-integer second-order cone programming (MISOCP) based centralized algorithm in IEEE 33-, 69-, and 136-node systems, with the specific settings shown in Table VI. The MISOCP model is a mathematical optimization technique that integrates integer variables into the second-order cone programming (SOCP) model, making it particularly suitable for complex power system applications. The validity and accuracy of the MISOCP model have been widely demonstrated, with a speedup ratio of about six times that of the traditional optimal power flow (OPF) model for small-scale test systems [

9]. Since the SOCP model is convex, the global optimal solution can be guaranteed, further enhancing the reliability of the MISOCP model in practical applications. The comparison in ADN operation costs and P2P market revenues calculated by the proposed SAC-DTC algorithm and the MISOCP-based centralized algorithm are shown in Table VII. From the perspective of ADN operation costs and P2P market revenues, while the strategy optimization of the proposed SAC-DTC algorithm for complex systems may converge toward the global optimum [10], it exhibits some discrepancies compared with the results from MISOCP-based centralized algorithm. However, with the increase in the number of devices and prosumers in P2P market and ADN, the resources and computation time using MISOCP-based centralized algorithm increase exponentially, and it cannot meet the requirements of timeliness in the electricity market. In addition, all prosumers must communicate bi-directionally and share sufficient information with the central organization, which places high demands on the communication system and does not guarantee data privacy [28].

TABLE VI Specific Setting of IEEE 33-, 69-, and 136-node Systems

System	Number of prosumers	Number of CBs	Number of SVGs	Number of DERs	Number of ESSs
IEEE 33-node	5	2	2	2	1
IEEE 69-node	28	4	5	2	1
IEEE 136-node	40	6	8	8	2

TABLE VII Comparison in ADN Operation Costs, P2P Market Revenues, and Computation Time

Algorithm	System	ADN operation cost (CNY)	P2P market revenue (CNY)	Computation time (s)
MISOCP-based centralized algorithm	IEEE 33-node	5752	24556	20.90
	IEEE 69-node	26248	73784	143.00
	IEEE 136-node	45380	207560	501.00
Proposed SAC-DTC algorithm	IEEE 33-node	6072	24508	4.27
	IEEE 69-node	27108	73743	14.50
	IEEE 136-node	47937	207440	33.80

In the IEEE 33-, 69-, and 136-node systems, the ADN operation costs obtained by the proposed SAC-DTC algorithm are slightly higher than those by the MISOCP-based centralized algorithm, while the P2P market revenues are almost the same. Based on the characteristics of distributed computation, the proposed SAC-DTC algorithm can effectively protect the privacy information, and the computation speed is 4.9, 9.8, and 14.8 times faster than that of MISOCP-based centralized algorithm in IEEE 30-, 69-, 136-node systems, respectively.

In addition, the linearization of voltage mapping in the proposed SAC-DTC algorithm may introduce some errors in the final results. Therefore, we perform power flow calculations using the proposed SAC-DTC algorithm and MISOCP-based centralized algorithm, and compare the node voltages. As shown in Table VIII and Fig. 8, the maximum error in voltage magnitude is within 0.3% and the average error is not larger than 0.08% for the power flow calculation, which indicates that the proposed SAC-DTC algorithm has a higher computation accuracy.

TABLE VIII Analysis of Error in Voltage Magnitude

System	Error in voltage magnitude (%)
System	Maximum	Minimum	Average
IEEE 33-node	0.249	1.10×10^-4	0.0281
IEEE 69-node	0.269	1.60×10^-4	0.0528
IEEE 136-node	0.258	1.51×10^-3	0.0735

Fig. 8 Error in voltage magnitude. (a) IEEE 33-node system. (b) IEEE 69-node system. (c) IEEE 136-node system.

Therefore, the proposed SAC-DTC algorithm is more suitable for the fast-changing operation of ADN and P2P markets to meet the real-time demand.

VI. Conclusion

In this paper, an SAC-DTC algorithm based on data-driven and physical modeling is proposed to tackle the coordinated optimization problem of ADN and P2P energy trading, which is analyzed via simulation based on the real-world dataset. The results show that the proposed SAC-DTC algorithm can effectively reduce the ADN operation cost and increase the P2P market revenue under the network security constraints. Specifically, the conclusions can be summarized as follows.

1) Compared with mainstream DDPG algorithms with the same network structure, the agents trained by the proposed SAC-DTC algorithm perform better in terms of the training speed and convergence results.

2) Considering the network security constraints, the proposed SAC-DTC algorithm for coordinated optimization can reduce the ADN operation cost by 8.3% and increase the P2P market revenue by 12.9% on average.

3) In the IEEE 33-, 69-, and 136-node systems, the proposed SAC-DTC algorithm effectively protects the privacy of prosumers although the ADN operation cost is slightly higher compared with the traditional MISOCP-based centralized algorithm. The computation speed is 4.9, 9.8, and 14.8 times faster, and the voltage magnitude error is no more than 0.08% on average.

Future work will investigate additional scenarios, including the integration of electrical, thermal, and cooling energy systems for consumers. Moreover, efforts will be made to deploy larger-scale networks utilizing multiple agents to manage complex coordination tasks involving both discrete and continuous actions. Additionally, there will be a focus on optimizing the linearization process to further enhance accuracy.

References

K. H. M. Azmi, N. A. M. Radzi, N. A. Azhar et al., “Active electric distribution network: applications, challenges, and opportunities,” IEEE Access, vol. 10, pp. 134655-134689, Dec. 2022. [Baidu Scholar]

Z. Yang, H. Li, and H. Zhang, “Dynamic collaborative pricing for managing refueling demand of hydrogen fuel cell vehicles,” IEEE Transactions on Transportation Electrification, vol. PP, no. 99, pp. 1-1, Mar. 2024. [Baidu Scholar]

S. Gorbachev, A. Mani, L. Li et al., “Distributed energy resources based two-layer delay-independent voltage coordinated control in active distribution network,” IEEE Transactions on Industrial Informatics, vol. 20, no. 2, pp. 1220-1230, Feb. 2024. [Baidu Scholar]

Z. Deng, M. Liu, H. Chen et al., “Optimal scheduling of active distribution networks with limited switching operations using mixed-integer dynamic optimization,” IEEE Transactions on Smart Grid, vol. 10, no. 4, pp. 4221-4234, Jul. 2019. [Baidu Scholar]

H. Zhu and H. Liu, “Fast local voltage control under limited reactive power: optimality and stability analysis,” IEEE Transactions on Power Systems, vol. 31, no. 5, pp. 3794-3803, Sept. 2016. [Baidu Scholar]

H. Liu and W. Wu, “Online multi-agent reinforcement learning for decentralized inverter-based volt-var control,” IEEE Transactions on Smart Grid, vol. 12, no. 4, pp. 2980-2990, Jul. 2021. [Baidu Scholar]

Q. Yang, G. Wang, A. Sadeghi et al., “Two-timescale voltage control in distribution grids using deep reinforcement learning,” IEEE Transactions on Smart Grid, vol. 11, no. 3, pp. 2313-2323, May 2020. [Baidu Scholar]

W. Shi, D. Zhang, X. Han et al., “Coordinated operation of active distribution network, networked microgrids, and electric vehicle: a multi-agent PPO optimization method,” CSEE Journal of Power and Energy Systems, doi: 10.17775/CSEEJPES.2022.05640 [Baidu Scholar]

M. Mansourlakouraj, M. Gautam, H. Livani et al., “Multi-stage volt/var support in distribution grids: risk-aware scheduling with real-time reinforcement learning control,” IEEE Access, vol. 11, pp. 54822-54838, May 2023. [Baidu Scholar]

A. R. Sayed, C. Wang, H. I. Anis et al., “Feasibility constrained online calculation for real-time optimal power flow: a convex constrained deep reinforcement learning approach,” IEEE Transactions on Power Systems, vol. 38, no. 6, pp. 5215-5227, Nov. 2023. [Baidu Scholar]

D. Cao, W. Hu, X. Xu et al., “Deep reinforcement learning based approach for optimal power flow of distribution networks embedded with renewable energy and storage devices,” Journal of Modern Power Systems and Clean Energy, vol. 9, no. 5, pp. 1101-1110, Sept. 2021. [Baidu Scholar]

H. Liu, W. Wu, and Y. Wang, “Bi-level off-policy reinforcement learning for two-timescale volt/var control in active distribution networks,” IEEE Transactions on Power Systems, vol. 38, no. 1, pp. 385-395, Jan. 2023. [Baidu Scholar]

K. Schmitt, R. Bhatta, M. Chamana et al., “A review on active customers participation in smart grids,” Journal of Modern Power Systems and Clean Energy, vol. 11, no. 1, pp. 3-16, Jan. 2023. [Baidu Scholar]

W. Tushar, T. K. Saha, C. Yuen et al., “Peer-to-peer trading in electricity networks: an overview,” IEEE Transactions on Smart Grid, vol. 11, no. 4, pp. 3185-3200, Jul. 2020. [Baidu Scholar]

W. Tushar, C. Yuen, T. K. Saha et al., “Peer-to-peer energy systems for connected communities: a review of recent advances and emerging challenges,” Applied Energy, vol. 282, p. 116131, Jan. 2021. [Baidu Scholar]

Y. Zou, Y. Xu, X. Feng et al., “Transactive energy systems in active distribution networks: a comprehensive review,” CSEE Journal of Power and Energy Systems, vol. 8, no. 5, pp. 1302-1317, Sept. 2022. [Baidu Scholar]

D. Han, L. Wu, X. Ren et al., “Calculation model and allocation strategy of network usage charge for peer-to-peer and community-based energy transaction market,” Journal of Modern Power Systems and Clean Energy, vol. 11, no. 1, pp. 144-155, Jan. 2023. [Baidu Scholar]

T. AlSkaif, J. L. Crespo-Vazquez, M. Sekuloski et al., “Blockchain-based fully peer-to-peer energy trading strategies for residential energy systems,” IEEE Transactions on Industrial Informatics, vol. 18, no. 1, pp. 231-241, Jan. 2022. [Baidu Scholar]

F. Luo, Z. Y. Dong, G. Liang et al., “A distributed electricity trading system in active distribution networks based on multi-agent coalition and blockchain,” IEEE Transactions on Power Systems, vol. 34, no. 5, pp. 4097-4108, Sept. 2019. [Baidu Scholar]

X. Yang, G. Wang, H. He et al., “Automated demand response framework in ELNs: decentralized scheduling and smart contract,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 50, no. 1, pp. 58-72, Jan. 2020. [Baidu Scholar]

J. Zheng, Z. Liang, Y. Li et al., “Multi-agent reinforcement learning with privacy preservation for continuous double auction-based P2P energy trading,” IEEE Transactions on Industrial Informatics, vol. 20, no. 4, pp. 6582-6590, Apr. 2024. [Baidu Scholar]

L. Chen, N. Liu, and J. Wang, “Peer-to-peer energy sharing in distribution networks with multiple sharing regions,” IEEE Transactions on Industrial Informatics, vol. 16, no. 11, pp. 6760-6771, Nov. 2020. [Baidu Scholar]

L. Wang, Y. Zhang, W. Song et al., “Stochastic cooperative bidding strategy for multiple microgrids with peer-to-peer energy trading,” IEEE Transactions on Industrial Informatics, vol. 18, no. 3, pp. 1447-1457, Mar. 2022. [Baidu Scholar]

J. Li, C. Zhang, Z. Xu et al., “Distributed transactive energy trading framework in distribution networks,” IEEE Transactions on Power Systems, vol. 33, no. 6, pp. 7215-7227, Nov. 2018. [Baidu Scholar]

W. Tushar, B. Chai, C. Yuen et al., “Energy storage sharing in smart grid: a modified auction-based approach,” IEEE Transactions on Smart Grid, vol. 7, no. 3, pp. 1462-1475, May 2016. [Baidu Scholar]

W. Lee, L. Xiang, R. Schober et al., “Direct electricity trading in smart grid: a coalitional game analysis,” IEEE Journal on Selected Areas in Communications, vol. 32, no. 7, pp. 1398-1411, Jul. 2014. [Baidu Scholar]

N. Liu, X. Yu, C. Wang et al., “Energy sharing management for microgrids with PV prosumers: a Stackelberg game approach,” IEEE Transactions on Industrial Informatics, vol. 13, no. 3, pp. 1088-1098, Jun. 2017. [Baidu Scholar]

Y. Liu, C. Sun, A. Paudel et al., “Fully decentralized P2P energy trading in active distribution networks with voltage regulation,” IEEE Transactions on Smart Grid, vol. 14, no. 2, pp. 1466-1481, Mar. 2023. [Baidu Scholar]

Y. Jia, C. Wan, and B. Li, “Strategic peer-to-peer energy trading framework considering distribution network constraints,” Journal of Modern Power Systems and Clean Energy, vol. 11, no. 3, pp. 770-780, May 2023. [Baidu Scholar]

Y. Zhou, B. Zhang, C. Xu et al., “A data-driven method for fast AC optimal power flow solutions via deep reinforcement learning,” Journal of Modern Power Systems and Clean Energy, vol. 8, no. 6, pp. 1128-1139, Nov. 2020. [Baidu Scholar]

D. Cao, J. Zhao, W. Hu et al., “Data-driven multi-agent deep reinforcement learning for distribution system decentralized voltage control with high penetration of PVs,” IEEE Transactions on Smart Grid, vol. 12, no. 5, pp. 4137-4150, Sept. 2021. [Baidu Scholar]

P. Giselsson, “Improved dual decomposition for distributed model predictive control,” IFAC Proceedings Volumes, vol. 47, no. 3, pp. 1203-1209, Oct. 2014. [Baidu Scholar]

C. Feng, B. Liang, Z. Li et al., “Peer-to-peer energy trading under network constraints based on generalized fast dual ascent,” IEEE Transactions on Smart Grid, vol. 14, no. 2, pp. 1441-1453, Mar. 2023. [Baidu Scholar]

A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,” SIAM Journal on Imaging Sciences, vol. 2, no. 1, pp. 183-202, Jan. 2009. [Baidu Scholar]

Y. Zhang and Z. Ren, “Optimal reactive power dispatch considering costs of adjusting the control devices,” IEEE Transactions on Power Systems, vol. 20, no. 3, pp. 1349-1356, Aug. 2005. [Baidu Scholar]

Z. Li, L. Wu, and Y. Xu, “Risk-averse coordinated operation of a multi-energy microgrid considering voltage/var control and thermal flow: an adaptive stochastic approach,” IEEE Transactions on Smart Grid, vol. 12, no. 5, pp. 3914-3927, Sept. 2021. [Baidu Scholar]

X. Chang, Y. Xu, H. Sun et al., “Privacy-preserving distributed energy transaction in active distribution networks,” IEEE Transactions on Power Systems, vol. 38, no. 4, pp. 3413-3426, Jul. 2023. [Baidu Scholar]

P. Sunehag, G. Lever, A. Gruslys et al. (2017, Jun.). Value-decomposition networks for cooperative multi-agent learning. [Online]. Available: https://arxiv.org/abs/1706.05296 [Baidu Scholar]

Z. Zhang, C. Dou, D. Yue et al., “Regional coordinated voltage regulation in active distribution networks with PV-BESS,” IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 70, no. 2, pp. 596-600, Feb. 2023. [Baidu Scholar]

Y. Zhang, Y. Han, D. Liu et al., “Low-carbon economic dispatch of electricity-heat-gas integrated energy systems based on deep reinforcement learning,” Journal of Modern Power Systems and Clean Energy, vol. 11, no. 6, pp. 1827-1841, Nov. 2023. [Baidu Scholar]

R. S. Sutton and A. G. Barto, (2024, Apr.). Reinforcement learning: an introduction. [Online]. Available: https://books.google.com/books?hl=en&lr=&id=uWV0DwAAQBAJ&oi=fnd&pg=PR7&dq=info:t8N5xiW9 bXoJ:scholar.google.com&ots=mjoHs_Z0k1&sig=CKvWTrZ0FoBPRCmO4-Yoo4uv5z0 [Baidu Scholar]

Address:No.19 Chengxin Avenue, Jiangning District, Nanjing 211106, China

E-mail: mpce@alljournals.cn

Tel:86-25-81093060

Fax:86-25-81093040

Home

Introduction

Editorial Board

For Author

Call For Papers

APC

Sponsor & Publisher