Skip to content

Policy Gradient

Policy Gradient

A policy-based reinforcement learning algorithm that directly optimizes the policy using gradient ascent on expected returns.

Family: Reinforcement Learning Status: 📋 Planned

Need Help Understanding This Algorithm?

🤖 Ask ChatGPT about Policy Gradient

Overview

Policy Gradient methods are a class of reinforcement learning algorithms that directly optimize

the policy function using gradient ascent on the expected return. Unlike value-based methods like Q-Learning, policy gradient methods learn the policy directly without needing to learn value functions first.

These methods are particularly useful for continuous action spaces, stochastic policies, and scenarios where the optimal policy might be stochastic. They work by computing gradients of the expected return with respect to the policy parameters and updating the policy in the direction that increases expected return.

The most common policy gradient algorithm is REINFORCE, which uses the policy gradient theorem to derive unbiased gradient estimates from sampled trajectories.

Mathematical Formulation

🧮 Ask ChatGPT about Mathematical Formulation

Problem Definition

Given:

  • State space: S
  • Action space: A
  • Policy: π(a|s;θ) parameterized by θ
  • Reward function: R(s,a,s')
  • Discount factor: γ ∈ [0,1]

Find policy parameters θ that maximize expected return:

J(θ) = E[∑{t=0}^∞ γ^t R | π(·|·;θ)]

Using gradient ascent: θ ← θ + α ∇_θ J(θ)

Key Properties

Policy Gradient Theorem

∇_θ J(θ) = E[∑_{t=0}^∞ γ^t ∇_θ log π(a_t|s_t;θ) A^π(s_t,a_t)]

Gradient of expected return with respect to policy parameters


REINFORCE Update

θ ← θ + α ∑_{t=0}^T γ^t R_t ∇_θ log π(a_t|s_t;θ)

Policy parameter update using sampled returns


Baseline Subtraction

θ ← θ + α ∑_{t=0}^T γ^t (R_t - b(s_t)) ∇_θ log π(a_t|s_t;θ)

Reduces variance by subtracting baseline from returns


Key Properties

🔑 Ask ChatGPT about Key Properties

  • Policy-Based


    Directly optimizes the policy function

  • Continuous Actions


    Naturally handles continuous action spaces

  • Stochastic Policies


    Can learn stochastic optimal policies

  • High Variance


    Gradient estimates have high variance

Implementation Approaches

💻 Ask ChatGPT about Implementation

Basic policy gradient algorithm using Monte Carlo returns

Complexity:

  • Time: O(episodes × steps_per_episode × policy_forward_pass)
  • Space: O(policy_parameters)

Advantages

  • Directly optimizes policy

  • Handles continuous action spaces naturally

  • Can learn stochastic policies

  • Theoretically sound convergence guarantees

Disadvantages

  • High variance in gradient estimates

  • Sample inefficient

  • Slow convergence

  • Requires complete episodes

Policy gradient with value function baseline for variance reduction

Complexity:

  • Time: O(episodes × steps_per_episode × (policy_forward_pass + value_forward_pass))
  • Space: O(policy_parameters + value_parameters)

Advantages

  • Reduces variance compared to REINFORCE

  • More sample efficient

  • Can learn online (no need for complete episodes)

  • Combines benefits of policy and value methods

Disadvantages

  • More complex implementation

  • Requires tuning of two networks

  • Can be unstable during training

  • Still has higher variance than value-based methods

Complete Implementation

The full implementation with error handling, comprehensive testing, and additional variants is available in the source code:

Complexity Analysis

📊 Ask ChatGPT about Complexity

Time & Space Complexity Comparison

Approach Time Complexity Space Complexity Notes
REINFORCE O(episodes × steps_per_episode × policy_forward_pass) O(policy_parameters) Time complexity dominated by policy network forward passes. Space complexity includes policy parameters only

Use Cases & Applications

🌍 Ask ChatGPT about Applications

Application Categories

Continuous Control

  • Robot Control: Learning continuous control policies for robots

  • Autonomous Vehicles: Learning steering and acceleration policies

  • Game Playing: Learning continuous control in games

  • Physics Simulation: Learning control policies in physics engines

Stochastic Environments

  • Financial Trading: Learning stochastic trading policies

  • Resource Allocation: Learning allocation policies under uncertainty

  • Game Theory: Learning mixed strategies in games

  • Multi-Agent Systems: Learning policies in competitive environments

High-Dimensional Action Spaces

  • Robotics: Learning control for high-DOF robots

  • Animation: Learning motion policies for characters

  • Music Generation: Learning policies for music composition

  • Text Generation: Learning policies for text generation

Real-World Applications

  • Recommendation Systems: Learning stochastic recommendation policies

  • Ad Placement: Learning placement policies with uncertainty

  • Energy Management: Learning energy allocation policies

  • Traffic Control: Learning traffic light control policies

Educational Value

  • Policy-Based Methods: Understanding direct policy optimization

  • Gradient Methods: Understanding gradient-based optimization in RL

  • Variance Reduction: Understanding techniques to reduce variance

  • Continuous Control: Understanding continuous action spaces

Educational Value

  • Policy-Based Methods: Understanding direct policy optimization approaches

  • Gradient Methods: Learning gradient-based optimization in RL

  • Variance Reduction: Understanding techniques to reduce variance in estimates

  • Continuous Control: Understanding how to handle continuous action spaces

References & Further Reading

:material-book: Core Textbooks

:material-book:
Reinforcement Learning: An Introduction
2018MIT PressISBN 978-0-262-03924-6
:material-book:
Algorithms for Reinforcement Learning
2010Morgan & ClaypoolISBN 978-1-60845-492-1

:material-library: Policy Gradient

:material-book:
Simple statistical gradient-following algorithms for connectionist reinforcement learning
1992Machine LearningVolume 8, pages 229-256 - Original REINFORCE paper
:material-book:
Policy gradient methods for reinforcement learning with function approximation
2000Advances in Neural Information Processing SystemsVolume 12, pages 1057-1063

:material-library: Actor-Critic

:material-book:
Neuronlike adaptive elements that can solve difficult learning control problems
1983IEEE Transactions on Systems, Man, and CyberneticsVolume 13, pages 834-846
:material-book:
Actor-critic algorithms
2000Advances in Neural Information Processing SystemsVolume 12, pages 1008-1014

:material-web: Online Resources

:material-link:
Wikipedia article on policy gradient methods
:material-link:
OpenAI Spinning Up RL tutorial
:material-link:
GeeksforGeeks policy gradient implementation

:material-code-tags: Implementation & Practice

:material-link:
RL environment library for testing algorithms
:material-link:
High-quality RL algorithm implementations
:material-link:
Scalable RL library for production use

Interactive Learning

Try implementing the different approaches yourself! This progression will give you deep insight into the algorithm's principles and applications.

Pro Tip: Start with the simplest implementation and gradually work your way up to more complex variants.

Related Algorithms in Reinforcement Learning:

  • SARSA - An on-policy temporal difference learning algorithm that learns action-value functions by following the current policy.

  • Actor-Critic - A hybrid reinforcement learning algorithm that combines policy gradient methods with value function approximation for improved learning efficiency.

  • Q-Learning - A model-free reinforcement learning algorithm that learns optimal action-value functions through temporal difference learning.

  • Proximal Policy Optimization (PPO) - A state-of-the-art policy gradient algorithm that uses clipped objective to ensure stable policy updates with improved sample efficiency.

  • Deep Q-Network (DQN) - A deep reinforcement learning algorithm that uses neural networks to approximate Q-functions for high-dimensional state spaces.