Policy Gradient in DRL

Reinforcement learning is essentially a process in which an agent interacts with and learns from an environment. This process includes a few important components:

  • Agent: The intelligent agent that interacts with the environment.
  • Environment: The setting that interacts with the agent.
  • Reward Function: The way the environment provides feedback to the agent.

For example, using reinforcement learning to play a game might involve the following process:

  • Initialize an agent.
  • The agent receives the initial state from the environment: \(s_1\)
  • The agent takes an action: \(a_1\)
  • The environment receives \(a_1\) and provides the corresponding state \(s_2\)

This process repeats until the game ends.

We consider the entire period from the beginning to the end of the game as an episode, represented by \(\tau\). Suppose this game is a familiar one like Raiden (a plane shooting game), where the user needs to control the plane to dodge meteors and other obstacles while actively engaging enemies to score points. Our fighter jet can be seen as the agent in RL, while meteors and enemy planes, which we can’t control (and are somewhat random), form the environment.

To help our agent learn the game’s techniques, we need to design a reward function, essentially a feedback mechanism. The game itself has such a feedback mechanism. For example, shooting down an enemy gives points, collecting a supply yields points, and being hit by a bullet costs points. Every step taken by the agent affects the game’s outcome in some way.

The total score obtained in an episode can be represented by the reward function:

\[R(\tau) = \{ r_1 + r_2 + r_3 + ... + r_n \}\]

In deep reinforcement learning, the agent is essentially a Deep Neural Network (DNN) that outputs an action for a given state input. The learning process involves updating the DNN’s parameters so that the reward function \(R(\tau)\) reaches its maximum value for an episode.

We refer to an agent’s game-playing strategy as a policy, represented by \(\theta\). Different values of \(\theta\) represent different game strategies (or different agents). Our goal is to maximize \(R_{\theta}\) for a given \(\theta\). This can be achieved by using gradient ascent:

\[\theta \leftarrow \theta + \eta \nabla R\]

To accurately update the neural network’s parameters, we need to collect as much game data as possible. Under the same policy, we might play a lot of games, so the average reward across multiple games is:

\[\overline{R_{\theta}} = \sum_{\tau} R(\tau) p_{\theta} (\tau)\]

Taking the gradient with respect to \(\theta\):

\[\begin{align} \nabla \overline{R_{\theta}} & = \sum_{\tau} R(\tau) \nabla p_{\theta} (\tau) \\ & = \sum_{\tau} R(\tau) p_{\theta}(\tau) \frac{\nabla p_{\theta}(\tau)}{p_{\theta}(\tau)} \\ & = \sum_{\tau} R(\tau) p_{\theta}(\tau) \nabla \log p_{\theta}(\tau) \\ & = \mathbb{E}_{\tau \sim p_{\theta}(\tau)} \left [ R(\tau) \nabla \log p_{\theta}(\tau) \right ] \\ & \approx \frac{1}{N} \sum_{n=1}^{N} \sum_{t=1}^{T_n} R(\tau^{n}) \nabla \log p_{\theta}(a_t^n | s_t^n) \end{align}\]

To simplify the implementation, we write the equation in this form, where \(R(\tau^{n})\) is the total reward for the n-th episode, and \(T_n\) represents the total number of steps in the episode (where a step is defined as an action taken by the agent given a state s).

This equation is straightforward: to maximize the policy gradient, if a particular step in \(\tau\) yields a relatively large \(R(\tau)\), we increase its probability of occurring; otherwise, if the reward is small, we reduce the likelihood of that action.

An approximation is used here, in expectation calculations:

\[\mathbb{E}_{x \sim p} \left [ f(x) \right ] \approx \frac{1}{N} \sum_{i=1}^{N} f(x^i)\]

The larger the value of N, the closer the results get to the true distribution of p(x) due to the increased sampling.

Another small trick is:

\[\nabla f(x) = f(x) \nabla \log f(x)\]

We can factor out \(p_{\theta}(\tau)\) by multiplying both the numerator and denominator by \(p_{\theta}(\tau)\):

\[\sum_{\tau} R(\tau) p_{\theta}(\tau) \frac{\nabla p_{\theta}(\tau)}{p_{\theta}(\tau)} = \sum_{\tau} R(\tau) p_{\theta}(\tau) \nabla \log p_{\theta}(\tau)\]

A More Refined Reward Function

Improving \(R_{\theta}(\tau)\)

There are some issues with the above formula, particularly with defining our reward function \(R_{\theta}(\tau)\). If we only follow the game’s rules, \(R_{\theta}(\tau)\) is the cumulative reward generated from each step in the game. In the formula:

\[\nabla \overline{R_{\theta}} \approx \frac{1}{N} \sum_{n=1}^{N} \sum_{t=1}^{T_n} R(\tau^{n}) \nabla \log p_{\theta}(a_t^n | s_t^n)\]

Some actions are beneficial, and some are not, but all action probabilities are weighted by the same factor: \(R(\tau^{n})\). This approach is clearly flawed.

Suppose the agent outputs \(a_t\) after being given \(s_t\); it doesn’t actually affect prior conditions before \(s_t\). The rewards before \(s_t\) are unrelated to \(a_t\).

For example, in a simple game, we played two rounds:

State \(s_a\) \(s_b\) \(s_c\)
Action \(a_1\) \(a_2\) \(a_3\)
Reward +10 +0 -6
\[R_1 = +4\]
State \(s_a\) \(s_b\) \(s_c\)
Action \(a_2\) \(a_2\) \(a_3\)
Reward -5 +0 -6
\[R_2 = -11\]

In the first game, the combination \((s_b, a_2)\) will have its probability increased by a factor of 4, while in the second game, the same scenario is reduced by a factor of -11. This approach is flawed because the bad outcome in the second game comes from \((s_a, a_2)\) generating -5, unrelated to \((s_b, a_2)\), although \((s_c, a_3)\) after \((s_b, a_2)\) results in a -6 reward.

Thus, we can represent the reward of \(a_t\) as the sum of all rewards following that specific action, not the total reward. To represent this reward computation, we introduce an advantage function: \(A^{\theta}(s_t, a_t)\), where previously \(A^{\theta}(s_t, a_t) = R(\tau^{n})\).

Using the improved reward calculation \(A^{\theta}(s_t, a_t) = \sum_{t'=t}^{T_n} r_{t'}^{n}\) in place of

\[\nabla \overline{R_{\theta}} \approx \frac{1}{N} \sum_{n=1}^{N} \sum_{t=1}^{T_n} A^{\theta}(s_t, a_t) \nabla \log p_{\theta}(a_t^n | s_t^n)\]

we get

\[\nabla \overline{R_{\theta}} \approx \frac{1}{N} \sum_{n=1}^{N} \sum_{t=1}^{T_n} \sum_{t'=t}^{T_n} r_{t'}^{n} \nabla \log p_{\theta}(a_t^n | s_t^n)\]

Additionally, we can apply a discount factor \(\gamma\) to \(\sum_{t'=t}^{T_n} r_{t'}^{n}\) because the further an event occurred in the past, the less it affects future outcomes:

\[A^{\theta}(s_t, a_t) = \sum_{t'=t}^{T_n} \gamma^{t'-t} \cdot r_{t'}^{n} , (0 < \gamma < 1)\]

Adding a Baseline

In some games, regardless of the action taken, the reward is consistently positive. This doesn’t theoretically pose a problem. However, if the number of samples is insufficient, unsampled actions remain unchanged while sampled actions increase, reducing the probability of unsampled actions in the normalized result. But can we conclude that unsampled actions aren’t good actions?

Clearly not, so here’s another small tip: subtracting a baseline so that \(A^{\theta}(s_t, a_t)\) can have both positive and negative values. Typically, this baseline is the expected reward:

\[b = \mathbb{E}\left [ R(\tau) \right ]\]

Substituting:

\[\begin{align} \nabla \overline{R_{\theta}} & \approx \frac{1}{N} \sum_{n=1}^{N} \sum_{t=1}^{T_n} \left [ R(\tau) - b \right ] \nabla \log p_{\theta}(a_t^n | s_t^n) \\ & = \frac{1}{N} \sum_{n=1}^{N} \sum_{t=1}^{T_n} \left [ R(\tau) - \mathbb{E}\left [ R(\tau) \right ] \right ] \nabla \log p_{\theta}(a_t^n | s_t^n) \end{align}\]

Incorporating the optimization method from above:

\[\nabla \overline{R_{\theta}} \approx \frac{1}{N} \sum_{n=1}^{N} \sum_{t=1}^{T_n} \sum_{t'=t}^{T_n} (\gamma^{t'-t} \cdot r_{t'}^{n} - \mathbb{E}\left [ R(\tau) \right ]) \nabla \log p_{\theta}(a_t^n | s_t^n)\]

From On-Policy to Off-Policy

On-Policy Learning

Understanding the above principles, the next steps involve updating the neural network. On-policy means that the agent learning from interacting with the environment is the same agent being passively updated. The specific process can be represented as follows:

  • The agent initializes and interacts with the environment.
  • During the interaction, we sample a certain number (m) of data points.
  • After accumulating m episodes of \(\tau\) data, we use this data to update the agent’s policy.
  • We discard the used data and continue interacting with the environment to generate new data (since the policy has been updated, the old data is no longer useful).
  • Continue updating the agent’s policy with new data.

On-policy has a few drawbacks. For instance, in a game like Raiden, where the DNN’s state input is represented by images, the train -> sample -> train method is time-consuming. Once the original policy updates,

\[\nabla \overline{R_{\theta}} = \mathbb{E}_{\tau \sim p_{\theta}(\tau)} \left [ R(\tau) \nabla \log p_{\theta}(\tau) \right ]\]

the distribution \(p_{\theta}(\tau)\) changes. The previously sampled data based on the old \(\tau \sim p_{\theta}(\tau)\) distribution becomes unusable, which means that each policy update wastes a lot of data and requires considerable time for sampling.

Therefore, to avoid affecting the agent’s interaction with the environment, researchers developed off-policy, mainly PPO/TRPO and PPO2.

Importance Sampling

Importance Sampling isn’t unique to RL. Simply put, for offline learning, we need to estimate our target distribution \(p(x)\) using a different distribution \(q(x)\), which in off-policy means that we want to use a different \({\theta}'\) to interact with the environment, using the data collected by \({\theta}'\) to train the desired \(\theta\). This approach is like letting one child watch another play a game to learn how to play.

Using this method, \({\theta}'\) can interact with the environment multiple times, and we don’t need to worry about data becoming obsolete due to \(\theta\) changes.

Specifically, in importance sampling, using a distribution \(q(x)\) to estimate another distribution \(p(x)\) can be represented as:

\[\begin{align} \mathbb{E}_{x \sim p} \left [ f(x) \right ] & \approx \frac{1}{N} \sum_{i=1}^{N} f(x^i) \\ & = \int f(x)p(x) dx = \int \frac{f(x)p(x)}{q(x)} \cdot q(x) \\ & = \mathbb{E}_{x \sim q} \left [ \frac{f(x)p(x)}{q(x)} \right ] \\ \end{align}\]

Proximal Policy Optimization (PPO)

Applying importance sampling to policy gradient gives us:

\[\begin{align} \nabla \overline{R_{\theta}} & = \mathbb{E}_{\tau \sim p_{\theta}(\tau)} \left [ R(\tau) \nabla \log p_{\theta}(\tau) \right ] \\ & = \mathbb{E}_{\tau \sim p_{\theta'}(\tau)} \left [ \frac{p_{\theta}(\tau)}{p_{\theta'}(\tau)} R(\tau) \nabla \log p_{\theta}(\tau) \right ] \\ & \rightarrow \mathbb{E}_{(s_t, a_t) \sim \pi_{\theta}} \left [ A^{\theta}(s_t, a_t) \nabla \log p_{\theta}(\tau) \right ] \\ & = \mathbb{E}_{(s_t, a_t) \sim \pi_{\theta'}} \left [ \frac{p_{\theta}(s_t, a_t)}{p_{\theta'}(s_t, a_t)} A^{\theta}(s_t, a_t) \nabla \log p_{\theta}(\tau) \right ] \\ & = \mathbb{E}_{(s_t, a_t) \sim \pi_{\theta'}} \left [ \frac{p_{\theta}(a_t | s_t)p_{\theta}(s_t)}{p_{\theta'}(a_t | s_t)p_{\theta'}(s_t)} A^{\theta}(s_t, a_t) \nabla \log p_{\theta}(\tau) \right ] \\ \end{align}\]

Assuming \(p_{\theta'}(s_t) \approx p_{\theta}(s_t)\), the equation simplifies to:

\[\mathbb{E}_{(s_t, a_t) \sim \pi_{\theta'}} \left [ \frac{p_{\theta}(a_t | s_t)}{p_{\theta'}(a_t | s_t)} A^{\theta}(s_t, a_t) \nabla \log p_{\theta}(\tau) \right ]\]

Using \(\pi_{\theta'}\) to approximate \(\pi_{\theta}\) means having agent \(\pi'\) interact with the environment, and updating the policy based on its data. This importance sampling approach has one drawback: while the means of both distributions are the same, their variances differ. With fewer samples, \(\pi_{\theta'}\)’s variance increases, so we need as much sample data as possible while keeping the two distributions similar.

Using the formula \(\nabla f(x) = f(x) \nabla \log f(x)\), we can derive the original objective function, giving us \(J^{\theta'}(\theta)\):

\[J^{\theta'}(\theta) = \mathbb{E}_{(s_t, a_t) \sim \pi_{\theta'}} \left [ \frac{p_{\theta}(a_t | s_t)}{p_{\theta'}(a_t | s_t)} A^{\theta}(s_t, a_t) \right ]\]

\(J^{\theta'}(\theta)\) is quite intuitive: we use \(\theta'\) to demonstrate and optimize the desired parameters \(\theta\). Due to the effects of importance sampling, we need to keep the two distributions as close as possible, leading to the PPO approach: adding a constraint during training: \(\beta KL(\theta, \theta')\) (\(\beta\) is a constant), representing the KL divergence between distributions \(\theta, \theta'\). This gives:

\[J_{PPO}^{\theta'}(\theta) = J^{\theta'}(\theta) - \beta KL(\theta, \theta')\]

The greater the divergence between \(\theta\) and \(\theta'\), the smaller \(J_{PPO}^{\theta'}(\theta)\) becomes. By maximizing this function, we achieve better reinforcement learning performance.

Notably, the KL divergence \(KL(\theta, \theta')\) refers to the similarity between actions, not parameter distances. PPO’s algorithm can be summarized as:

  • Initialize a policy \(\theta^{0}\)
  • During each iteration:
    • Use \(\theta'\) to interact with the environment, collect data \(\{ s_t, a_t \}\), and compute \(A^{\theta'}(s_t, a_t)\)
    • Find \(\theta\) to optimize \(J_{PPO}(\theta)\), where \(J_{PPO}^{\theta'}(\theta) = J^{\theta'}(\theta) - \beta KL(\theta, \theta')\)

For the constraint \(\beta KL(\theta, \theta')\), we can set a max and min value for \(\beta\) dynamically:

  • If \(KL(\theta', \theta) > KL_{max}\), increase \(\beta\)
  • If \(KL(\theta', \theta) < KL_{min}\), decrease \(\beta\)

Trust Region Policy Optimization (TRPO)

Another method is TRPO (Trust Region Policy Optimization), which differs from PPO in its constraint design:

\[J_{TRPO}^{\theta'}(\theta) = \mathbb{E}_{(s_t, a_t) \sim \pi_{\theta'}} \left [ \frac{p_{\theta}(a_t | s_t)}{p_{\theta'}(a_t | s_t)} A^{\theta}(s_t, a_t) \right ], KL(\theta, \theta') < \delta\]

However, this external constraint is difficult to handle in practice, so TRPO is generally not recommended (PPO and TRPO perform similarly, but PPO is simpler to implement).

PPO2

The PPO2 algorithm is an extension of PPO designed to limit the difference between the two distributions \(\theta\) and \(\theta'\):

\[J_{PPO2}^{\theta^k}(\theta) \approx \sum_{(s_t, a_t)} min \left [ \frac{p_{\theta}(a_t | s_t)}{p_{\theta^k}(a_t | s_t)} A^{\theta^k}(s_t, a_t), clip(\frac{p_{\theta}(a_t | s_t)}{p_{\theta^k}(a_t | s_t)}, 1 - \varepsilon, 1 + \varepsilon) A^{\theta^k}(s_t, a_t) \right ]\]

\(clip \left [f(x), 1 - \varepsilon, 1 + \varepsilon \right ]\) means:

\[clip \left [f(x), 1 - \varepsilon, 1 + \varepsilon \right ] = \begin{cases} f(x) & 1 - \varepsilon < f(x) < 1 + \varepsilon \\ 1 - \varepsilon & f(x) < 1 - \varepsilon \\ 1 + \varepsilon & f(x) > 1 + \varepsilon \end{cases}\]

This constrains \(f(x)\) within \(1 - \varepsilon\) and \(1 + \varepsilon\), keeping the two distributions close. The minimum function means that when \(A^{\theta^k}(s_t, a_t) > 0\), we want the objective function to be as large as possible, but once it exceeds \(1 + \varepsilon\), there’s no added benefit, as it violates the similarity constraint.

Q-Learning

Aside from directly learning a policy, we can take a different approach by learning a critic, which is a value-based learning method. The critic evaluates the quality of an action.

Here, we introduce the state value function \(V^{\pi}(s)\), which represents the expected cumulative reward after state s. The higher \(V^{\pi}(s)\), the more reward an agent can expect to achieve, providing a measure of the critic’s guidance.

Monte Carlo (MC) and Temporal-Difference (TD)

Two common methods to estimate \(V^{\pi}(s)\) are MC and TD, each with advantages and drawbacks.

MC methods typically train a DNN that outputs the total reward sum after a given state \(s_a\) as input, aiming to match the actual sum \(G_a\) as closely as possible:

\[s_a \rightarrow network \left [ V^{\pi} \right ] \rightarrow V^{\pi}(s_a) \leftrightarrow G_a\]

TD-based methods, unlike MC, don’t require accumulating all rewards, meaning you don’t need to finish the entire episode to estimate with MC-based methods. This is helpful if the game is time-consuming.

In TD-based methods, each step provides \(...s_t, a_t, r_t, s_{t + 1}...\), indicating that \(V^{\pi}(s_t)\) satisfies:

\[V^{\pi}(s_t) = V^{\pi}(s_{t + 1}) + r_t\]

We can set up two identical networks \(V^{\pi}\), one for \(s_t\) and one for \(s_{t+1}\), and train them by minimizing the difference between \(V^{\pi}(s_{t}) - V^{\pi}(s_{t + 1}) \approx r_t\), ensuring consistency with the training data \(r_t\).

This approach eliminates the need for the entire episode’s reward sum, estimating \(V^{\pi}\) by using reward differences between consecutive steps.

MC’s main drawback is the high variance of \(G_a\) due to randomness from the environment and subsequent actions, while TD mainly has variability from reward r differences.

State-action Value Function \(Q^{\pi}(s, a)\)

Introducing an advanced version of \(V^{\pi}(s)\), we focus on the Q function in Q-Learning. Unlike \(V^{\pi}(s)\), which calculates the initial state without specifying the action, the Q function provides both the initial state and action:

\[Q^{\pi}(s_t, a_t)\]

The rest is like calculating cumulative reward in V. Q-Learning can be summarized in three steps:

  • Initialize an actor \(\pi\)
  • In each iteration:
    • Actor \(\pi\) interacts with the environment, collecting \(s, a, r\) data.
    • Using this data, TD or MC estimates the Q function \(Q^{\pi}(s, a)\).
    • Find a “better” \(\pi'\) than \(\pi\) based on Q values.
    • Replace \(\pi\) with \(\pi'\).

A “better” policy means for any s: \(V^{\pi'}(s) > V^{\pi}(s)\), where \(\pi'(s) = arg max_{a} Q^{\pi}(s, a)\) selects the action that maximizes \(Q^{\pi}(s, a)\). If actions are discrete, you can simply try each one. However, continuous actions are trickier.

Proof that \(\pi'(s) = arg max_{a} Q^{\pi}(s, a)\) implies \(V^{\pi'}(s) > V^{\pi}(s)\) for any s:

\[V^{\pi}(s) = Q^{\pi}(s, \pi(s)) \leq max_a Q^{\pi}(s, a) = Q^{\pi}(s, \pi'(s))\]

Given any specific s, the action selected by \(\pi'\) won’t yield a lower reward than \(\pi\), so if each step follows \(\pi'\)’s action:

\[\begin{align} V^{\pi}(s) & \leq Q^{\pi}(s, \pi'(s)) \\ & = \mathbb{E} \left [ r_{t+1} + V^{\pi}(s_{t+1})_{|s_t = s, a_t = \pi'(s_t)} \right ] \\ & \leq \mathbb{E} \left [ r_{t+1} + Q^{\pi}(s_{t+1}, \pi'(s_{t+1}))_{|s_t = s, a_t = \pi'(s_t)} \right ] \\ & = \mathbb{E} \left [ r_{t+1} + r_{t+2} + V^{\pi}(s_{t+2})_{| ...} \right ] \\ & \leq \mathbb{E} \left [ r_{t+1} + r_{t+2} + Q^{\pi}(s_{t+2}, \pi'(s_{t+2}))_{| ...} \right ] ... = V^{\pi'}(s) \end{align}\]

Target Network

If using TD to train a neural network for Q, two identical DNNs are required:

\[(s_t, a_t) \rightarrow \left [ Q^{\pi} \right ] \rightarrow Q^{\pi}(s_t, a_t) \leftrightarrow r_t + Q^{\pi}(s_{t+1}, \pi(s_{t+1})) \leftarrow \left [ Q^{\pi} \right ] \leftarrow (s_{t+1}, \pi(s_{t+1}))\]

Their difference is \(r_t\), but during training, target inputs \((s_{t+1}, \pi(s_{t+1}))\) are used for stability by fixing one network until certain iterations are completed before updating.

Exploitation vs. Exploration

RL always involves balancing exploration (trying new actions) and exploitation (getting the highest reward). This classic problem, called the multi-armed bandit problem, goes as follows:

You’re in a casino with K slot machines, each having some unknown probability of yielding a reward. You have T coins, and each coin only plays one machine once. How can you maximize your earnings?

This involves a trade-off between exploration and exploitation. Q-Learning has this same challenge: if all possible actions for a given state \(s\) haven’t been sampled yet, the action with the highest reward will always be chosen, causing unsampled actions to remain unexplored.

One intuitive solution is Epsilon Greedy:

\[a = \begin{cases} arg max_a Q^{\pi}(s, a) & f(x) < 1 - \varepsilon \\ random & otherwise \end{cases}\]

With some probability, exploration takes place. \(1 - \varepsilon\) usually decreases over time as unexplored actions diminish.

Or, to avoid random sampling, use a probability distribution over Q-values, like in policy gradient methods. Actions with higher Q-values have higher probabilities, but all actions have some chance. This method, called Boltzmann Exploration, is:

\[P(a|s) = \frac{exp(Q(s, a))}{\sum_a exp(Q(s, a))}\]

Exp is used to normalize Q-values with both positive and negative values.

Replay Buffer

In Q-Learning, a policy interacts with the environment, producing data stored in a replay buffer with \(( s_t, a_t, r_t, s_{t+1})\). This buffer contains vast data and only drops old data when it’s full.

With the replay buffer, learning can be considered off-policy, increasing training efficiency. As the buffer includes various policy data, it provides diverse samples for estimating \(Q^{\pi}(s, a)\).

In summary, a typical Deep Q-Learning algorithm can be described as follows:

  • Initialize two Q functions: Q and target Q function \(\hat{Q} = Q\)
  • In each episode:
    • In each iteration:
      • Given an input state \(s_t\), choose an action \(a_t\) based on Q.
      • Obtain reward \(r_t\) and transition to the next state \(s_{t+1}\).
      • Store \((s_t, a_t, r_t, s_{t+1})\) in the replay buffer.
      • Sample data from the replay buffer, usually by batch.
      • Train and update Q’s parameters, ensuring \(Q(s_i, a_i)\) approaches \(y = r_i + max_a \hat{Q}(s_{i+1}, a)\)
    • Every N iterations, update \(\hat{Q} = Q\)