Journal of Modern Power Systems and Clean Energy

ISSN 2196-5625 CN 32-1884/TK

网刊加载中。。。

使用Chrome浏览器效果最佳,继续浏览,你可能不会看到最佳的展示效果,

确定继续浏览么?

复制成功,请在其他浏览器进行阅读

Dynamic Nonlinear Droop-based Fast Frequency Regulation for Power Systems with Flexible Resources Using Meta-reinforcement Learning Approach  PDF

  • Yuxin Ma 1 (Student Member, IEEE)
  • Zechun Hu 1 (Senior Member, IEEE)
  • Yonghua Song 2 (Fellow, IEEE)
1. Department of Electrical Engineering, Tsinghua University, Beijing 100084, China; 2. State Key Laboratory of Internet of Things for Smart City, University of Macau, Macau, China

Updated:2025-03-26

DOI:10.35833/MPCE.2024.000062

  • Full Text
  • Figs & Tabs
  • References
  • Authors
  • About
CITE
OUTLINE

Abstract

The increasing penetration of renewable energy resources and reduced system inertia pose risks to frequency security of power systems, necessitating the development of fast frequency regulation (FFR) methods using flexible resources. However, developing effective FFR policies is challenging because different power system operating conditions require distinct regulation logics. Traditional fixed-coefficient linear droop-based control methods are suboptimal for managing the diverse conditions encountered. This paper proposes a dynamic nonlinear P-f droop-based FFR method using a newly established meta-reinforcement learning (meta-RL) approach to enhance control adaptability while ensuring grid stability. First, we model the optimal FFR problem under various operating conditions as a set of Markov decision processes and accordingly formulate the frequency stability-constrained meta-RL problem. To address this, we then construct a novel hierarchical neural network (HNN) structure that incorporates a theoretical frequency stability guarantee, thereby converting the constrained meta-RL problem into a more tractable form. Finally, we propose a two-stage algorithm that leverages the inherent characteristics of the problem, achieving enhanced optimality in solving the HNN-based meta-RL problem. Simulations validate that the proposed FFR method shows superior adaptability across different operating conditions, and achieves better trade-offs between regulation performance and cost than benchmarks.

I. Introduction

WITH the rapid advancement of the global power system transformation, the traditional synchronous generators in power systems are gradually being replaced by renewable energy resources such as solar and wind energy. This shift results in lower system inertia and reduced primary frequency regulation (PFR) reserves, which threaten power system frequency security [

1]. Additionally, the intermittency and uncertainty associated with wind and solar generation further enhanced the difficulties of frequency control. Traditional frequency support methods, which rely solely on traditional frequency regulation resources, are insufficient for ensuring the safe operation of the power system with high penetration of renewable energy resources. Consequently, it becomes essential to utilize emerging flexible resources such as wind and solar energy resources [2], battery energy storage [3], hybrid energy storage [4], and electric vehicle aggregators [5] to enhance the frequency support and improve the transient frequency dynamics of power systems.

Due to their mechanical characteristics, synchronous generators primarily achieve PFR through fixed-coefficient linear droop control. In contrast, flexible resources, connected to the grid via inverters, offer faster and more precise frequency response [

6]. This enhanced control flexibility enables the development of customized frequency regulation standards for these resources. As a result, many transmission system operators have designed fast frequency regulation (FFR) services that utilize flexible resources to deliver rapid proportional or step frequency responses [7]. For instance, the enhanced frequency response service in UK requires the providers, predominantly storage assets, to respond proportionally to the system frequency in 1 s or less after the frequency falls out of the deadband, while the response time of the traditional PFR resources is around 10 s [8]. In the Texas power system, FFR resources provide step responses within 0.25 s once the frequency falls below 59.85 Hz [9]. In addition, the existing research has developed modified P-f droop-based control methods for flexible resource-based FFR. For instance, the variable P-f droop-based control is proposed in [10], which consists of two fixed droop coefficients activated at different frequency levels. In [11], the linear P-f droop-based FFR signals are decomposed into low- and high-frequency components and delivered to different flexible resources. In addition to linear and piece-wise linear control methods, some nonlinear FFR strategies have been designed for flexible resources in [12]-[14] to achieve improved control performance.

The above-mentioned FFR services all adopt static control laws with fixed droop curves, which lack adaptability to varying operating conditions. Considering the superior control flexibility of new resources, some dynamic FFR strategies have been proposed to enhance transient frequency dynamics and improve the cost-efficiency of frequency regulation. An asymmetric droop coefficient optimization method is proposed in [

15] to realize robust and cost-efficient FFR provided by wind turbines and demand response resources. The droop coefficients can be dynamically updated in a centralized manner but at a limited rate due to heavy communication and computational burdens. Hierarchical FFR schemes proposed in [16]-[18] also require high-quality communication and online optimization.

Some existing studies leverage reinforcement learning (RL) methods to develop dynamic FFR policies for flexible resources. Well-trained RL controllers can avoid online optimization and reduce the computational burden during practical implementation. Reference [

19] proposed an RL-based distributed update policy for adjusting the inertia and droop coefficients of multiple virtual synchronous generators to suppress power oscillations under various disturbance sizes. However, this policy still requires communication with adjacent nodes. Reference [20] proposed an RL-based FFR controller for battery energy storage systems that relies solely on local frequency measurements. Although the methods in [19] and [20] enhance control flexibility, they cannot guarantee system stability, which is a common challenge in applying RL methods in power system control problems. Reference [12] developed an RL-based static FFR method that ensures the frequency stability through a single-input-single-output neural network structure. However, over-strict network structure constraints, such as the single-layer requirement and the single-input limit, restrict the generalization of this static method to a dynamic type.

Existing RL-based FFR methods typically assume that system frequency dynamics can be modeled as a single Markov decision process (MDP). However, these dynamics actually vary significantly with the size of load disturbances. Given the randomness and diversity of load disturbances in actual power systems, it is more appropriate to consider the optimal FFR problem as achieving fast adaption to any MDP sampled from a distribution. To date, traditional RL algorithms often solve each MDP independently and can hardly realize the rapid adaption required in the FFR context. Meta-reinforcement learning (meta-RL) is a promising method to solve this problem, whose core idea is to learn data-efficient RL algorithms capable of producing policies that adapt well to various MDPs with minimal data [

21]. Various meta-RL algorithms [22], [23] have been proposed and applied across different domains, including power system operation and control. For instance, [24] proposed an optimal load frequency control method for interconnected microgrid using a meta-RL framework, and [25] focused on meta-RL-based grid voltage emergency control. However, these methods often lack theoretical guarantees for frequency or voltage stability. Applying meta-RL to the optimal FFR problem requires careful considerations to ensure frequency stability.

In summary, research gaps can be summarized as follows. Firstly, existing FFR methods are predominantly based on linear static droop control schemes or dynamic approaches burdened by heavy computation or communication demands. These methods fail to fully utilize the potential of flexible resources and lack adaptability to varying sizes of random load disturbances. Secondly, while RL methods offer potential for adaptive FFR with low computational burden during implementation, their effectiveness is limited by imperfect problem formulations in existing literature and concerns about stability guarantees. To address these gaps, this paper develops a dynamic nonlinear P-f droop-based FFR method using a newly established meta-RL approach to ensure both adaptability and stability. The proposed FFR method is applicable to various flexible resources integrated into power systems through power electronic inverters, presenting a possible solution for enhancing frequency stability in future power systems with high penetration of inverter-based generation. The main contributions can be summarized as follows.

1) The dynamic nonlinear FFR optimization problem is formulated as a frequency stability-constrained meta-RL problem, which leverages flexible resources to achieve stable FFR with fast adaptation to randomly varying load disturbances.

2) A hierarchical neural network (HNN) structure is proposed to parameterize dynamic nonlinear droop-based FFR policies with a theoretical frequency stability guarantee, converting the proposed meta-RL problem into a more tractable form.

3) A two-stage algorithm is specifically designed to solve the HNN-based meta-RL problem with enhanced optimality.

4) Simulations demonstrate that the proposed method provides FFR policies with superior adaptability, achieving a better balance between frequency quality and regulation cost compared with benchmark methods.

The rest of this paper is organized as follows. Section II describes the system model for controller optimization and simulation and the system model for theoretical analysis. Section III first models the optimal FFR as a stochastic optimization and then reformulates it into a constrained meta-RL problem. The HNN architecture is proposed in Section IV, and Section V presents the two-stage algorithm to solve the HNN-based meta-RL problem. Numerical simulation results are presented in Section VI. Finally, conclusions are drawn in Section VII.

II. System Model

A. System Model for Controller Optimization and Simulation

Considering that a control area may contain numerous flexible resources, this paper adopts the centralized optimization and distributed execution scheme for convenience of application and supervision in practical power systems. During the optimization stage, we design an aggregated FFR controller, denoted as u, based on the system frequency response (SFR) model of the target control area, as illustrated in Fig. 1, where synchronous generators and flexible resources in the target control area are aggregated into equivalent blocks, respectively. The analytical approach for the model aggregation can be found in [

26].

Fig. 1  Block diagram of target control area.

All variables in Fig. 1 represent deviations. ω denotes the center-of-inertia (CoI) frequency. pv, pt, pm, and pinv denote the governor valve displacement, power deviation during steam reheat, mechanical output of generators, and flexible resource output, respectively. ppfr denotes the PFR output of synchronous generators. The control flexibility of flexible resources enables the design of a sophisticated logic for u to achieve desired control performance. l denotes the net load disturbance consisting of renewable power generation fluctuations, load variations, and tie-line power deviations. Tg, Tr, Tch, and Tinv denote the time constants of the equivalent governor, reheater, turbine, and inverter, respectively. Fhp is the fraction of total turbine power. M and D denote the system inertia and load-damping coefficient, respectively. Synchronous generators are required to perform traditional PFR with a fixed linear droop coefficient 1/R. In addition, a proportional-integral (PI) type automatic generation controller (AGC) is considered, with integral gain Ki and proportional gain Kp. The AGC operates in flat frequency control mode, with the area control error (ACE) calculated as sace=βω, where β denotes the frequency bias parameter. The command generated by AGC is denoted as sagc, which is allocated to generators and flexible resources according to their participation factors αg and αinv.

The system dynamics can be represented as a set of state-space functions as:

x=pv,pt,pm,pinv,ω,ωdt (1a)
ω˙dt=ωω˙=1Mpm+pinv-Dω-l (1b)
p˙t=FhpTgppfr+αgsagc+Tg-FhpTrTrTgpv-1Trpt (1c)
p˙inv=1Tinv-u-pinv+αinvsagc (1d)
p˙v=1Tg-ppfr+αgsagc-pv (1e)
p˙m=1Tchpt-pmsagc=-Kpβω-Kiβωdt (1f)
ppfr=1Rmaxω-ωdb,0+minω+ωdb,0 (1g)

where x is the state vector; and ωdb is the deadband width for generators.

B. System Model for Theoretical Analysis

In this paper, the aggregated FFR controller designed in subsequent sections takes only local available information as inputs. During the application, the aggregated controller is decomposed into distributed controllers by multiplying different participation factors depending on the regulation capacity of each flexible resource. Distributed controllers work with the locally measured frequency, which can be different with the CoI frequency considered in the SFR model. Consequently, the transient frequency stability analysis should consider the specific network structure and frequency differences across the target control area, such that the frequency stability is guaranteed during the practical operation.

We denote the target control area by an undirected connected graph 𝒱,, where 𝒱 is the set of lossless buses indexed by i or j{1,2,,n}, and is the set of transmission lines indexed by i,ji,j|i,j𝒱,ij. Each bus is equipped with an equivalent generator and an equivalent flexible resource unit aggregated from the connected resources. System dynamics model in [

12] is used for theoretical stability analysis, which can be formulated as the following state-space functions:

θ˙i=ωi (2a)
ω˙i=1Mi-li-Di+1Riωi-ui-j=1nBijsinθi-θj (2b)

where ωi, θi, ui, li, Mi, Di, and Ri are the local frequency, phase angle, distributed FFR control signal, net load disturbance, system inertia, load-damping coefficient, and droop coefficient of synchronous generator of bus i, respectively; and Bij is the susceptance of line i,j. All variables in (2) represent deviations from their nominal values. Note that the AGC is omitted in (2) because it operates at a slower pace in practical power systems and therefore has limited effect on the transient frequency stability. The generator dynamics are simplified as a classical second-order model widely used in existing literature. The inverter dynamics are omitted for its much smaller time constant than the generator.

A static droop controller for flexible resources without linearity requirement can be denoted as uiωi, taking only local frequency measurement as input. Theorem 1 gives a sufficient condition for the frequency stability of system (2) under uiωi, which will be applied in the subsequent dynamic controller optimization.

Theorem 1 [  

12]   Suppose the controller uiωi, i {1,2,,n}, is monotonically increasing with respect to the local frequency ωi, and the phase angles at the equilibrium satisfy θi*-θj*0,π/2 for all buses i connected to j, then the system (2) exists a unique equilibrium that is locally exponentially stable.

Proofs can be found in [

12]. According to [12], the phase angle constraint θi*-θj*0,π/2 is satisfied under most of the practical operating conditions. Therefore, the monotonicity of all flexible resource controllers can be considered as a sufficient condition for the system frequency stability, regardless of the power network topology. This topology-independent sufficient condition indicates that it is a practical and scalable method to first optimize an aggregated FFR droop curve based on the SFR model (1), and then decompose the curve by multiplying different positive participation factors. The distributed execution of these decomposed controllers will guarantee the system frequency stability as long as the aggregated FFR droop curve is monotonic w.r.t. the system frequency.

III. Optimal Control Problem Formulation

In this section, we first describe the optimal FFR problem under random load disturbances from the perspective of stochastic optimization in Section III-A. Then, we show that this classical formulation can be tricky to solve if the control logic is complex. To address this, we reformulate the problem as a set of MDPs in Section III-B. Finally, in Section III-C, we formulate a frequency stability-constrained meta-RL problem to solve these MDPs.

A. Stochastic Optimization of FFR Controller

In this subsection, we formulate the optimal FFR problem as a stochastic optimization. To be specific, the frequency quality and regulation cost are balanced through a weighted sum type objective function, and the controller u is defined as a function of local measurements, including the system frequency, to facilitate distributed execution:

maxuElJ=-j1-j2-j3s.t.  j1=q1t=0Tut       j2=q2t=0Tωt2       j3=q3maxt{1,2,,T}ωt2      u̲uu¯      systemdynamics1      frequencystabilityguarantee (3)

where J is the objective consisting of three terms j1, j2, and j3, which denote the control cost, the summed square error of CoI frequency deviations, and the CoI frequency nadir (or peak), respectively; q1, q2, and q3 are the weight coefficients; El is the expectation taken with respect to the random variable l, and l follows a distribution ; T is the duration when the frequency is outside the frequency deadband after each disturbance; t is the index of timesteps with small intervals such as 0.1 s; and u̲ and u¯ are the total upward and downward regulation capacities of flexible resources in the target control area, respectively.

This optimization formulation casts the optimal FFR problem as an infinite-dimensional optimization, making it challenging to solve. Traditional linear droop control methods simplify the problem by assuming that u is a linear function of the system frequency, i.e., u=kω, where a single coefficient k is tuned to handle all scenarios. This reduction transforms the infinite-dimensional problem into a one-dimensional problem. However, this simplification leads to suboptimal performance for the following reasons. First, the linearity specification restricts the control flexibility. Flexible resources can provide nonlinear frequency responses, which have been shown in [

12] to outperform linear approaches. Second, using a static k to handle all scenarios may be insufficient for balancing frequency deviation and regulation cost across different operating conditions. Intuitively, a gentler droop curve is preferable for small load disturbances to avoid unnecessary power output adjustments of flexible resources, thus keeping frequency deviations within an acceptable range at a low cost. When large disturbances occur, however, steeper droop curves are needed to quickly arrest the frequency and ensure system frequency stability. A static control law represents a compromise for all possible scenarios, aiming for high performance on average. However, it may not be optimal for every specific situation, leaving significant room for improvement.

B. MDP Formulation

To address the above concerns, this paper removes the static linear type restriction and instead optimizes dynamic nonlinear controllers that can adapt rapidly to each specific disturbance event encountered during operation, although the disturbance sizes cannot be directly observed. To manage the infinite-dimensional challenge, we first reformulate the FFR optimization as a set of MDPs.

For any fixed load disturbance l, the FFR process can be formulated as an MDP denoted as a 5-tuple 𝒮,𝒜,r,P,γ [

27]. 𝒮 is the continuous state space. The state vector at timestep t can be denoted as st=ωt,ωt-1,ωdt,pm,t,pv,t,pt,t,pinv,t. 𝒜 is the continuous action space. In this problem, the action at𝒜 taken at timestep t is the FFR signal utu̲,u¯. r:𝒮×𝒜R is the reward function as shown in (4), which maps a state-action pair to a real number. P:𝒮×𝒜Δ𝒮 is the transition kernel, i.e., the system dynamics represented as (1), which maps a state-action pair to a probability distribution over the state space Δ𝒮. γ0,1 is a discount factor.

rt=-q1ut-q2ωt2-max0,ωt2-ωt-12 (4)

The FFR controller can be denoted as a policy ua|s:𝒮×𝒜R+, which maps states to action probabilities. We consider policies uϕ parameterized by neural network parameters ϕ. A policy can interact with the MDP and collect episodes τ=st,at,rtt=0T of length T. This paper defines an episode as a duration that starts when a load disturbance l occurs and the system frequency deviates from a specific deadband, i.e., 0.015 Hz, and ends when the frequency is restored within the deadband.

Considering the stochastic load disturbances, the FFR optimization problem is actually a set of MDPs. Assume that the load disturbance l occurring in different episodes follows a distribution . Then, during each episode, the controller encounters an MDP M sampled from a distribution with shared 𝒮,𝒜,r,γ, but with different dynamics P.

RL algorithms are widely used to find an optimal policy u for an MDP, which maximizes the expected accumulated return within an episode Et=0Tγtrt based on the collected episodes. An RL algorithm can be defined as a function (5) [

21], which maps the dataset 𝒟=τhH consisting of H episodes of the target MDP to policy parameters ϕΦ.

f𝒟:𝒮×𝒜×RTHΦ (5)

In traditional RL algorithms, f is typically chosen as classical RL algorithms, such as deep Q-learning (DQN) [

28], deep deterministic policy gradient (DDPG) [29], and proximal policy optimization (PPO) [30], to learn the optimal policy parameters ϕ. These algorithms solve each MDP independently, requiring the controller to go through numerous episodes with the same l to collect sufficient training data. However, in practical power systems, l is random and non-repetitive, necessitating rapid adaption within each single episode, which is a capability that traditional RL algorithms struggle to achieve.

C. Frequency Stability-constrained Meta-RL Problem

To achieve fast adaption to each disturbance event without destabilizing the system, we formulate a frequency stability-constrained meta-RL problem. Instead of a static policy uϕ, we optimize a parameterized RL algorithm that can quickly learn the optimal uϕ for each MDP sampled from the distribution , which lasts for only one episode. With the objective to maximize the expected return during the whole life of the dynamic policy uϕ, the stability-constrained meta-RL model can be formulated as (6), which includes two simultaneous learning loops.

maxθEMEt=0Tγtrt|fθ,uϕ,Ms.t.stabilityguarantee (6)

where EM denotes the expectation taken with respect to M; and fθ is an RL algorithm parameterized by θ. The outer loop learns fθ, while the inner loop, which shares a similar mechanism with traditional RL algorithms, applies the algorithm fθ to dynamically update the control policy uϕ based on the interacting experience with MDPs. An update at timestep t of an episode can be expressed as:

ϕfθ𝒟=si,ai,rii=0t (7)

where the dataset 𝒟 is collected within the current episode under M, and it is reset at the beginning of a new episode. An ideal fθ must be data-efficient to enable effective adaption within each episode.

Based on this meta-RL framework, we introduce non-linearity through neural network-based inner-loop policy uϕ and achieve dynamic control logic adjustment with the outer-loop RL algorithm fθ, which is capable of rapid adaption.

IV. HNN Architecture

Due to the frequency stability constraint in the stability-constrained meta-RL model (6), existing approaches, such as those in [

22] and [23], which are aimed at general unconstrained meta-RL problems, are not directly applicable. Representing hard constraints in a form compatible with the RL framework can be challenging. These constraints are often addressed using penalty terms in the reward function, which may not always ensure strict compliance. In this section, we construct an HNN to parameterize fθ and uϕ in (6) as an event-triggered RL algorithm and a nonlinear droop-based control policy, respectively. This construction ensures that a sufficient condition for system frequency stability is always satisfied. By reformulating the frequency stability constraint in (6) as a network constraint and a trigger condition, (6) is made tractable.

A. HNN Structure

In (6), each MDP M differs in load disturbance l, leading to different dynamics P. However, different dynamics P also share many similarities such as the generator and inverter dynamics, indicating that optimal policies of different M may also share common features. Accordingly, we divide the policy parameters ϕ into fixed network parameters ϕf and variable external parameters ϕv. Specifically, we model the common parts of different policies with the bottom neural network parameterized by ϕf, and represent an RL algorithm fθ with another top neural network, which adapts ϕv as a variable input of the policy network. The two parts form an HNN structure, as illustrated in Fig. 2.

Fig. 2  HNN structure with stability guarantee.

The bottom neural network named executor can be expressed as uω;ϕ, which takes the frequency ω as input and produces the aggregated FFR signal u. As common parameters of all policies, ϕ f is optimized during training and then fixed during implementation, while ϕv is always updated by the top neural network fθ during both stages. The executor uω;ϕ is designed as an unconstrained monotonic neural network (UMNN) [

31] to introduce monotonicity, which can be expressed as:

fω;ϕ=uω;ϕω>0uω;ϕ=0ωfx;ϕdx (8)

where fω;ϕ is a neural network with the input ω and parameters ϕ.

First, the partial derivative of u w.r.t. ω, which is a scalar function, is parameterized as the neural network fω;ϕ, whose output is forced to be positive through the exponential linear unit (ELU) increased by 1. The output control signal u is then calculated as the integral of the positive partial derivative. In this way, the parameterized policy uω;ϕ is always monotonically increasing w.r.t. the system frequency ω. Namely, the executor can be considered as a cluster of monotonic droop controllers indexed by ϕv with zero output at ω=0. Note that the network constraint (8) poses no limitation on the structure of the bottom neural network with parameters ϕf, which can be arbitrarily complex, as long as we set a positive activation function for the final layer and add an integral layer after that.

Once the top neural network updates the output, the bottom neural network executes a different monotonic droop curve indexed by the new ϕv. Therefore, the top neural network is named as the selector. While the executor updates the output at each timestep t, the selector works in an event-triggered mode, with the timestep of the kth trigger denoted as tk. The detailed explanation is deferred to Section IV-B. The input otk of the selector is an observation of the system states at timestep tk, which is chosen as ωtk,ωtk-1,ωtk-ωtk-1,max0τtkωτ,ϕtk-1v. The top neural network is designed as a recurrent neural network (RNN). The first layer comprises gate recurrent units (GRUs) [

32], which introduces recurrency to store historical observation and action information in the hidden state htk. h0 is initialized as zeros at the beginning of each episode. The following multi-layer perceptron (MLP) learns valuable features from the historical information and produces ϕtkv accordingly, selecting the droop curve that best adapts the current operating conditions. It is worth noting that the GRU and MLP structures presented here are empirically proven to perform well in our case, but are not mandatory. The top neural network can be structured arbitrarily without constraints.

B. Unrolled Structure and Decision Process

Constrained by (8), if we fix the output ϕv of the top neural network, the proposed HNN degenerates to a static monotonic controller. Based on this characteristic, we set the selector to work in an event-triggered mode with the following triggering condition:

tk+1=minttk+1,tk+2,ωt>ωtk (9)

That is to say, the selector is triggered if and only if the frequency deviation gets worse.

Under the triggering condition (9), the selector dynamically adjusts the droop curve selection according to its observations during the frequency arrest stage. Then, the bottom neural network keeps executing the selected static droop curve until the frequency is settled and recovered, or another disturbance occurs, inducing a larger frequency deviation and triggering the selector to update ϕv. In any case, the whole network stays static and monotonic after the system frequency reaches the nadir or peak, which satisfies the sufficient condition for frequency stability described in Theorem 1.

The unrolled structure of the proposed HNN is given in Fig. 3 to illustrate the decision process of the top neural network in the event-triggered mode.

Fig. 3  Unrolled structure of proposed HNN.

At each evenly-spaced timestep t, ωt is measured, and the action at, i.e., the control signal ut, is updated by the executor based on ϕtv provided by the selector. A reward rte for the single timestep t is then obtained from the environment.

As for the selector, Fig. 3 shows the situation where the selector is triggered at t0=0 and t1=3. The reward for each trigger rs is defined as the accumulated individual rewards re until the next trigger. For example, the first trigger generates a selection ϕ0v lasting for three timesteps, so the corresponding reward is calculated as r0s=t=02γtrte. Limited by space, only five timesteps of a certain episode are presented in Fig. 3. In the subsequent time, the selector will still be triggered whenever the frequency deteriorates.

Figure 4 shows the control logic comparison of the proposed method with two benchmark FFR methods, i.e., static linear droop control method (denoted as method 1) and static nonlinear droop control method (denoted as method 2). In Fig. 4(c), the dashed curves in different colors visualize the control logics of the executor under three different ϕv. The black and blue curves with arrows show two possible dynamic control logics during load disturbance events with different sizes and directions.

Fig. 4  Control logic comparison of different methods. (a) Method 1. (b) Method 2. (c) Proposed method.

The former analysis indicates that the network constraint (8) and the trigger condition (9) constitute a sufficient but not necessary condition for frequency stability. Consequently, the stability-constrained meta-RL problem (6) can be conservatively reformulated as follows.

maxθEMEt=0Tγtrt|fθ,uϕ,M (10a)
s.t. (8), (9) (10b)

Compared with (6), the stability constraint is replaced by network shape and trigger condition constraints that are much easier to handle.

V. Solution Algorithm

The HNN-based meta-RL model (10) enables the optimization of a dynamic droop-based controller with a stability guarantee. Next, the goal is to solve the proposed HNN-based meta-RL problem. Inspired by [

33], this section proposes an effective two-stage algorithm to solve (10) through any classical RL algorithm. Unlike the algorithm in [33], which targets adaptation over many episodes (e.g., tens of episodes), the proposed algorithm focuses on achieving much faster adaptation within every single episode.

We view the interaction process from different perspectives and reuse the experience collected by the HNN-based controller. From the view of the selector fθ, the executor actions at and rewards rte can be considered as a part of the environment dynamics. The training data collected during an episode for updating θ include the selector’s observation, action, and the reward for each trigger k, which can be denoted as 𝒟s=otk,ϕtkv,rtksk=1K, where K is the total trigger number of the selector within an episode. Then, from the view of the executor, the decision process of the selector can be treated as environment transitions. The system frequency and the selector’s action constitute the executor’s observation σt=ωt,ϕtkv. The training data for the executor can be expressed as 𝒟e=σt,at,rtt=0T. After collecting the interaction experience of multiple episodes, any off-the-shelf RL algorithms can be used to train the network by mapping the experience buffers 𝒟s and 𝒟e to new parameters θ and ϕ f, respectively. However, we observed that simultaneous training of both selector and executor from randomly initialized θ and ϕ f leads to poor performance.

To optimize the training process and achieve high performance, we propose a two-stage algorithm, which is summarized in Algorithm 1, along with the implementation process. Hyper-parameters i and j are the indices for the neural network updates and episodes, respectively, with a total number of I and J. Their superscripts e and u distinguish the executor and united training stages.

Algorithm 1  : HNN-based meta-RL for optimal FFR

Initialize: θ, ϕf

Executor training:

 for ie=0,1,,Ie do

  Initialize an empty executor experience buffer De

    for je=0,1,,Je do

   Sample an MDP MlM, and fix ϕv=l

   Collect T timesteps of experience using uϕ

    end for

  Update ϕf based on De

 end for

United training:

 for iu=1,2,,Iu do

  Initialize an empty executor experience buffer De

  Initialize an empty selector experience buffer Ds

    for ju=1,2,,Ju do

   Sample an MDP MlM

   Collect T timesteps of experience using fθ and uϕ

    end for

  Update ϕ f based on De, and update θ based on Ds

 end for

Implementation:

 if ω>ωdb then

  Begin an FFR episode, and initialize ϕv=0 and h0=0

    for timestep t=0,1, do

   Get an observation o

      if ω<ωdb then

        Break

      else

        if condition (9) is satisfied then

     Select ϕv,hfθo,h

        end if

     Execute a=uω;ϕv,ϕ f

      end if

    end for

 end if

1) Executor training stage

At the first stage, only the executor is trained to get a cluster of diversified droop curves. Since the load disturbance l is a key parameter for distinguishing different MDPs, we block the selector and set the selection ϕv to be l. Note that although the disturbance l cannot be measured during the application, it is available during training and is exclusively used at the executor training stage. Only executor experience 𝒟e is collected at this stage, based on which ϕ f is iteratively updated.

2) United training stage

The selector network fθ is activated at this stage, generating ϕv as the input of the executor trained at the first stage. The whole HNN interacts with the environment. The experience collected at this stage is reused to generate both 𝒟s and 𝒟e, and parameters θ and ϕ f are simultaneously updated.

3) Implementation

The implementation part in Algorithm 1 serves as a summary of the controller decision process introduced in Section IV-B. It’s worth noting that, although the two training stages take hours, the time required for control signal calculation during the implementation is only a matter of milliseconds. This makes it highly suitable for practical online applications in the context of FFR. Detailed time consumption data can be found in Section VI.

The executor training state before the united training has been empirically validated to improve the final performance significantly. Through Algorithm 1, we learn a parameterized RL algorithm fθ capable of fast adaption through classical RL algorithms. Detailed simulation results are provided in Section VI to show the effectiveness of Algorithm 1.

VI. Case Studies

A. Simulation Settings

The effectiveness of the proposed HNN-based meta-RL model and the solution algorithm is validated via numerical simulations. The block diagram of the simulation system is shown in Fig. 1. The simulation system is constructed on the Python platform using the OpenAI Gym framework. The system parameters are listed in Table I.

TABLE I  System Parameters
ParameterValueParameterValueParameterValue
M 9.2 s D 2.0 p.u. Tg 0.1
Tr 12 s Tch 0.3 s Fhp 0.2
R 0.07 Tinv 0.2 s Kp 0.15
Ki 0.015 αg 0.5 αinv 0.5
ωdb 0.03 β 24

The control interval of the optimized FFR controller is set to be 0.1 s. For more realistic simulations of practical systems, AGC in Fig. 1 is set to update the control signal every 4 s with a transmission delay of 1.5 s. The frequency deadband for flexible resource-based FFR is set to be ±0.015 Hz. The selector in Fig. 2 is designed as a 16-unit GRU layer and an MLP composed of two fully connected 32-unit layers. The executor is designed as two fully connected 16-unit layers before the integral layer. The parameters required in Algorithm 1 are set to be Ie=500, Je=15, Iu=3000, Ju=15, and T=2400. The widely used PPO algorithm [

30] is leveraged to update the network parameters. Discount factor γ in (6) is set to be 0.999. The disturbance l of different MDPs is set to uniformly distributed within the range 0.01,0.1. The total FFR capacity of flexible resources is ±0.08 p.u., and the total PFR capacity of generators is ±0.07 p.u.. The weight coefficients (3) are chosen as q1=0.1, q2=0.125, and q3=5. A single NVIDIA Quadro P2200 GPU with 5 GB memory is used to train the HNN.

B. Result Analysis

The time required for the executor training and united training stages is 2 hours and 10 hours on average, respectively. During the implementation stage, the calculation time for the selector and the executor is 0.3 ms and 0.7 ms on average, respectively, which is fast enough for practical online applications.

Time-domain simulations on the system illustrated in Fig. 1 are performed using the well-trained HNN-based controller. The dynamics of FFR signals u and frequencies ω under step load disturbances l of sizes 0.01 p.u., 0.04 p.u., 0.07 p.u., and 0.1 p.u. are shown in Fig. 5.

Fig. 5  Dynamics of FFR signals and frequencies under step load disturbances of different sizes. (a) FFR signals. (b) Frequencies.

Figure 5(a) shows the dynamics of FFR signals u for flexible resources w.r.t. system frequency. For each disturbance size, the solid line shows the trajectories of u during the frequency arrest period before the system frequency ω reaches the nadir. The dashed line illustrates the droop curve during the frequency rebound and recovery periods. The frequency nadir is marked by the triangle in Fig. 5(b). Note that the deadband of FFR is not shown in Fig. 5(a) for simplicity and clarity, but considered during simulation by resetting u as 0 when ω<0.015 Hz. The trajectories of u validate the adaptability of the proposed method. To balance the control cost and frequency deviations, the proposed method executes steeper curves under larger disturbances to arrest the system frequency and avoid a catastrophic frequency nadir. In contrast, gentler curves are applied during relatively minor disturbance event to suppress frequency deviation within an acceptable range at a moderate control cost. Figure 5(b) shows that the system frequency is quickly arrested within 1 to 4 s and then recovered to the nominal value under the joint action of both primary and secondary frequency regulations.

To further show the adaptability of the proposed method, it is tested under consecutive step disturbances. Specifically, a 0.04 p.u. load disturbance and a 0.06 p.u. load disturbance occur at t=0 and t=30 s, respectively. The dynamics of FFR signals and frequencies under the consecutive step disturbances are shown in Fig. 6.

Fig. 6  Dynamics of FFR signals and frequencies under consecutive step disturbances. (a) FFR signals. (b) Frequencies.

The curves in Fig. 6 are divided into four pieces in different colors. The blue piece depicts the dynamics from the beginning of the first disturbance to the first frequency nadir ω2. During this period, the selector and the executor are both actuated. Then, the orange piece shows the dynamics during the period when the frequency rebounds to ω1 at t=30 s and falls again to ω2 after the occurrence of the second disturbance. According to the triggering condition (9), the selector is deactivated during this period because the frequency has not deteriorated. A fixed nonlinear droop curve is executed as shown in Fig. 6(a). The green piece denotes the frequency arrest period from ω2 to ω3. Here, the selector is actuated again to choose steeper droop curves that can better adapt to the frequency dynamics after the occurrence of the second disturbance. Then, the newly chosen droop curve in red is executed until the frequency is recovered to the nominal value. The piece-wise dynamics in Fig. 6 show that the proposed method can switch working states reasonably based on the triggering condition (9). This switching mode not only ensures the transient frequency stability of the system but also enables the controller to adapt to a wider range of operating conditions.

C. Method Comparison

This subsection compares the performance of the proposed method with the two benchmark FFR methods. Method 1 is static linear droop control with a typical droop value of 1%, whose droop curve is shown in Fig. 7(a). Method 2 is static nonlinear droop control trained by the standard RL algorithm PPO without incorporating meta-learning techniques. It is parameterized by a UMNN network that is the same as the selector of the proposed HNN to ensure the frequency stability. The same reward function (4) is employed for training. This control method takes frequency ω as the single input, resulting in a static nonlinear droop curve, as depicted in Fig. 7(b).

Fig. 7  Droop curves of two benchmark FFR methods for flexible resources. (a) Droop curve of method 1. (b) Droop curve of method 2.

The optimal control objective value J in (3) and the proportion of the control cost term j1 under various step load disturbances are listed in Table II. The objective value J is largely affected by the disturbance size l. To better show the relative performance of different methods, we define a performance metric as:

P=J-Jm1/Jm1 (11)
TABLE II  Performance and Control Cost Comparisons of Different Methods
l (p.u.)Method 1Method 2Proposed
Jj1 (%)Jj1 (%)Jj1 (%)
0.01 -0.22 78 -0.22 78 -0.15 40
0.02 -0.49 68 -0.48 67 -0.42 39
0.03 -0.80 60 -0.80 60 -0.76 41
0.04 -1.16 53 -1.16 53 -1.15 43
0.05 -1.58 48 -1.57 48 -1.58 45
0.06 -2.06 44 -2.04 44 -2.04 46
0.07 -2.61 40 -2.56 41 -2.54 46
0.08 -3.25 36 -3.16 38 -3.08 46
0.09 -3.98 33 -3.83 35 -3.67 45
0.10 -4.80 31 -4.58 33 -4.31 44

where Jm1 is the objective value of method 1. The numerator is an absolute value because the objective values are all negative. The performance of different methods under various load disturbances is plotted in Fig. 8.

Fig. 8  Performance of different methods under various load disturbances.

From Fig. 8, method 2 and the proposed method perform better than method 1 in all cases. As shown in Fig. 7(b), the droop curve of method 2 becomes steeper as the frequency deviations get larger, which can be considered as a generalization of the piece-wise linear droop control method in [

10]. However, such bending in the droop curve has limited improvement in the performance due to its static feature. The proposed method can dynamically modify the droop curve to realize adaptability to a greater extent. As shown in Fig. 5(a), the dynamics of FFR signals in different cases can be different even at a same frequency deviation level. After a larger disturbance, the frequency response is faster from the beginning of the event instead of accelerating after the frequency deviation reaches a high level. Consequently, the proposed method achieves the best performance in almost all cases.

Compared with other methods, the proportion of j1 obtained by the proposed method is higher under larger disturbances and lower under smaller disturbances, as shown in Table II. Such results indicate that the proposed method can reasonably balance the control cost and frequency deviations case by case to achieve higher control performance.

D. Algorithm Comparison

The proposed algorithm has an executor training stage before the united training. To validate the effectiveness of the proposed algorithm, this subsection compares the performance of the proposed algorithm and another algorithm performing united training only (denoted as algorithm 2). The performance comparison of different algorithms is shown in Fig. 9.

Fig. 9  Performance comparison of different algorithms.

It can be observed from Fig. 9 that the proposed algorithm with the executor training stage outperforms algorithm 2 in most cases. Intuitively, the executor training stage helps the executor acquire a cluster of meaningful skills. In comparison, performing united training from the beginning may cause insufficient or meaningless exploration and lead to poor training effect.

E. Sensitivity Analysis

The objective of the optimal control problem is formulated as the weighted sum of different terms in (3) to balance the control cost and frequency deviations. Different values of weight coefficients q1, q2, and q3 in (3) result in different trade-offs. This subsection takes the coefficient q1 as an example to show the impact of weight coefficients on the optimization results of the proposed method. The value of q1 is set to be 0.4, 0.1, and 0.025, respectively. The dynamics of frequencies ω and FFR signals u after step load disturbances with size l=0.1 p.u. and l=0.05 p.u. are plotted in Fig. 10.

Fig. 10  Dynamics of frequencies and FFR signals after step load disturbances with size l=0.1 p.u. and l=0.05 p.u.. (a) Dynamics of frequencies with l=0.1 p.u.. (b) Dynamics of FFR signals with l=0.1 p.u.. (c) Dynamics of frequencies with l=0.05 p.u.. (d) Dynamics of FFR signals with l=0.05 p.u..

A larger q1 value denotes a higher cost of flexible resource-based FFR service. As shown in Fig. 10, the proposed method optimized with a higher q1 value tends to utilize less frequency regulation resources at the cost of larger frequency deviations. Consequently, the transmission system operators should fine-tune the weight coefficients according to the actual regulation cost of flexible resources and requirements for frequency quality based on numerical simulations before practical implementations.

F. Method Applicability in Other System Types

Although the SFR model depicted in Fig. 1 incorporates only two types of frequency regulation resources, the proposed method is applicable to larger load frequency control systems with diverse resource types. To validate such applicability, we modify the SFR model in Fig. 1 and conduct simulations under the same settings as introduced in Section VI-A. This modified SFR model incorporates an additional type of frequency regulation resource, namely an aggregated non-reheat generator, into the original system model by substituting the synchronous generator block in Fig. 1 with Fig. 11. Tg,nr and Tch,nr are the time constants of the equivalent governor and turbine, respectively, for the aggregated non-reheat generator. The proportion of reheat and non-reheat generators can be adjusted by modifying the values of Kr and Knr, respectively. In this case study, we set Kr=Knr=0.5.

Fig. 11  Block diagram of reheat and non-reheat generators.

We also compare the proposed method with the two benchmark methods as detailed in Section VI-C. Method 1 maintains its typical droop value of 1%. Method 2 and the proposed method undergo training using PPO and the proposed algorithm, respectively, under the modified SFR model. The control objective value J under various load disturbances are presented in Table III. Additionally, the relative performance of three different methods, calculated using (11), is illustrated in Fig. 12. Based on the simulation results, the proposed method shows significant superiority over the benchmarks as in Section VI-C, validating its adaptability to different types of power systems.

TABLE III  Control Performance in Modified SFR Model Under Various Load Disturbances
l (p.u.)J
Method 1Method 2Proposed
0.01 -0.28 -0.23 -0.11
0.02 -0.57 -0.49 -0.33
0.03 -0.89 -0.79 -0.64
0.04 -1.25 -1.15 -1.03
0.05 -1.66 -1.56 -1.47
0.06 -2.11 -2.02 -1.97
0.07 -2.61 -2.53 -2.51
0.08 -3.15 -3.10 -3.09
0.09 -3.73 -3.82 -3.72
0.10 -4.41 -4.69 -4.43

Fig. 12  Performance comparisons of different methods in modified SFR model.

VII. Conclusion

This paper investigates the flexible resource-based FFR optimization problem considering the guarantee of system frequency stability. A new meta-RL approach is proposed to realize dynamic nonlinear P-f droop-based FFR with rapid adaptability to different operating conditions.

We first formulate a frequency stability-constrained meta-RL problem, then reformulate it into a more tractable HNN-based form with the well-designed network constraint and trigger condition. A two-stage algorithm is proposed to enhance the optimality in solving the HNN-based meta-RL problem. Simulation results validate that the proposed method can adapt rapidly to different operating conditions with the system frequency stability guaranteed. Compared with benchmarks including static linear control and static nonlinear control methods, the proposed method achieves better trade-offs between frequency quality and regulation cost. Future research directions include the coordinated FFR optimization of multiple inter-connected control areas and the differentiated utilization of heterogeneous flexible resources in FFR.

References

1

R. W. Kenyon, M. Bossart, M. Marković et al., “Stability and control of power systems with high penetrations of inverter-based resources: an accessible review of current knowledge and open questions,” Solar Energy, vol. 210, pp. 149-168, Nov. 2020. [Baidu Scholar] 

2

J. Boyle, T. Littler, S. M. Muyeen et al., “An alternative frequency-droop scheme for wind turbines that provide primary frequency regulation via rotor speed control,” International Journal of Electrical Power & Energy Systems, vol. 133, p. 107219, Dec. 2021. [Baidu Scholar] 

3

F. Sattar, S. Ghosh, Y. J. Isbeih et al., “A predictive tool for power system operators to ensure frequency stability for power grids with renewable energy integration,” Applied Energy, vol. 353, p. 122226, Jan. 2024. [Baidu Scholar] 

4

M. H. Marzebali, M. Mazidi, and M. Mohiti, “An adaptive droop-based control strategy for fuel cell-battery hybrid energy storage system to support primary frequency in stand-alone microgrids,” Journal of Energy Storage, vol. 27, p. 101127, Feb. 2020. [Baidu Scholar] 

5

M. Mousavizade, F. Bai, R. Garmabdari et al., “Adaptive control of V2Gs in islanded microgrids incorporating EV owner expectations,” Applied Energy, vol. 341, p. 121118, Jul. 2023. [Baidu Scholar] 

6

C. Christiansen and N. Hillmann. (2017, May). Feasibility of fast frequency response obligations of new generators. [Online]. Available: https://www.aemc.gov.au/sites/default/files/content/661d5402-3ce5-477 5-bb8a-9965f6d93a94/AECOM-Report-Feasibility-of-FFR-Obligations-of-New-Generators.pdf [Baidu Scholar] 

7

L. Meng, J. Zafar, S. K. Khadem et al., “Fast frequency response from energy storage systems – a review of grid standards, projects and technical issues,” IEEE Transactions on Smart Grid, vol. 11, no. 2, pp. 1566-1581, Mar. 2020. [Baidu Scholar] 

8

National Grid Group. (2016, Mar.). Enhanced frequency response: frequently asked questions. [Online]. Available: https://www.nationalgrid.com/sites/default/files/documents/Enhanced%20Frequency%20Respon-se%20FAQs%20v5.0_.pdf [Baidu Scholar] 

9

P. Du, N. V. Mago, W. Li et al., “New ancillary service market for ERCOT,” IEEE Access, vol. 8, pp. 178391-178401, Sept. 2020. [Baidu Scholar] 

10

Y. Yuan, Y. Zhang, J. Wang et al., “Enhanced frequency-constrained unit commitment considering variable-droop frequency control from converter-based generator,” IEEE Transactions on Power Systems, vol. 38, no. 2, pp. 1094-1110, Mar. 2023. [Baidu Scholar] 

11

M. F. M. Arani and Y. A. R I. Mohamed, “Cooperative control of wind power generator and electric vehicles for microgrid primary frequency regulation,” IEEE Transactions on Smart Grid, vol. 9, no. 6, pp. 5677-5686, Nov. 2018. [Baidu Scholar] 

12

W. Cui, Y. Jiang, and B. Zhang, “Reinforcement learning for optimal primary frequency control: a Lyapunov approach,” IEEE Transactions on Power Systems, vol. 38, no. 2, pp. 1676-1688, Mar. 2023. [Baidu Scholar] 

13

C. Zhao, U. Topcu, N. Li et al., “Design and stability of load-side primary frequency control in power systems,” IEEE Transactions on Automatic Control, vol. 59, no. 5, pp. 1177-1189, May 2014. [Baidu Scholar] 

14

Y. Liu, Y. Song, Z. Wang et al., “Optimal emergency frequency control based on coordinated droop in multi-infeed hybrid AC-DC system,” IEEE Transactions on Power Systems, vol. 36, no. 4, pp. 3305-3316, Jul. 2021. [Baidu Scholar] 

15

Z. Ding, K. Yuan, J. Qi et al., “Robust and cost-efficient coordinated primary frequency control of wind power and demand response based on their complementary regulation characteristics,” IEEE Transactions on Smart Grid, vol. 13, no. 6, pp. 4436-4448, Nov. 2022. [Baidu Scholar] 

16

E. Ekomwenrenren, J. W. Simpson-Porco, E. Farantatos et al. (2022, Aug.). Data-driven fast frequency control using inverter-based resources. [Online]. Available: https://arxiv.org/abs/2208.01761 [Baidu Scholar] 

17

E. Ekomwenrenren, Z. Tang, J. W. Simpson-Porco et al., “Hierarchical coordinated fast frequency control using inverter-based resources,” IEEE Transactions on Power Systems, vol. 36, no. 6, pp. 4992-5005, Nov. 2021. [Baidu Scholar] 

18

R. Chakraborty, A. Chakrabortty, E. Farantatos et al., “Hierarchical frequency control in multi-area power systems with prioritized utilization of inverter based resources,” in Proceedings of 2020 IEEE PES General Meeting, Montreal, Canada, Aug. 2020, pp. 1-5. [Baidu Scholar] 

19

Q. Yang, L. Yan, X. Chen et al., “A distributed dynamic inertia-droop control strategy based on multi-agent deep reinforcement learning for multiple paralleled VSGs,” IEEE Transactions on Power Systems, vol. 38, no. 6, pp. 5598-5612, Nov. 2023. [Baidu Scholar] 

20

Z. Yan, Y. Xu, Y. Wang et al., “Deep reinforcement learning-based optimal data-driven control of battery energy storage for power system frequency support,” IET Generation, Transmission & Distribution, vol. 14, no. 25, pp. 6071-6078, Dec. 2020. [Baidu Scholar] 

21

J. Beck, R. Vuorio, E. Z. Liu et al. (2023, Jan.). A survey of meta-reinforcement learning. [Online]. Available: https://arxiv.org/abs/2301. 08028 [Baidu Scholar] 

22

C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” International Conference on Machine Learning, Sydney, Australia, Aug. 2017, pp. 1126-1135. [Baidu Scholar] 

23

Y. Duan, J. Schulman, X. Chen et al. (2016, Nov.). RL2: fast reinforcement learning via slow reinforcement learning. [Online]. Available: https://arxiv.org/abs/1611.02779 [Baidu Scholar] 

24

J. Li, T. Zhou, K. He et al., “Distributed quantum multiagent deep meta reinforcement learning for area autonomy energy management of a multiarea microgrid,” Applied Energy, vol. 343, p. 121181, Aug. 2023. [Baidu Scholar] 

25

R. Huang, Y. Chen, T. Yin et al., “Learning and fast adaptation for grid emergency control via deep meta reinforcement learning,” IEEE Transactions on Power Systems, vol. 37, no. 6, pp. 4168-4178, Nov. 2022. [Baidu Scholar] 

26

Q. Shi, F. Li, and H. Cui, “Analytical method to aggregate multi-machine SFR model with applications in power system dynamic studies,” IEEE Transactions on Power Systems, vol. 33, no. 6, pp. 6355-6367, Nov. 2018. [Baidu Scholar] 

27

D. L. Poole and A. K. Mackworth, Artificial Intelligence. Cambridge, UK: Cambridge University Press, 2010. [Baidu Scholar] 

28

V. Mnih, K. Kavukcuoglu, D. Silver et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, pp. 529-533, Feb. 2015. [Baidu Scholar] 

29

T. P. Lillicrap, J. J. Hunt, A. Pritzel et al. (2015, Sept.). Continuous control with deep reinforcement learning. [Online]. Available: https://arxiv.org/abs/1509.02971 [Baidu Scholar] 

30

J. Schulman, F. Wolski, P. Dhariwal et al. (2017, Jul.). Proximal policy optimization algorithms. [Online]. Available: https://arxiv.org/abs/1707.06347 [Baidu Scholar] 

31

A. Wehenkel and G. Louppe, “Unconstrained monotonic neural networks,” in Proceedings of 33rd Conference on Neural Information Processing Systems, Vancouver, Canada, Jun. 2019, pp. 1545-1555. [Baidu Scholar] 

32

K. Cho, B. van Merriënboer, D. Bahdanau et al. (2014, Sept.). On the properties of neural machine translation: encoder-decoder approaches. [Online]. Available: https://arxiv.org/abs/1409.1259 [Baidu Scholar] 

33

K. Frans, J. Ho, and X. Chen. (2017, Oct.). Meta learning shared hierarchies. [Online]. Available: https://arxiv.org/abs/1710.09767 [Baidu Scholar]