Skip to content

Actor-Critic

Actor-Critic

A hybrid reinforcement learning algorithm that combines policy gradient methods with value function approximation for improved learning efficiency.

Family: Reinforcement Learning Status: 📋 Planned

Need Help Understanding This Algorithm?

🤖 Ask ChatGPT about Actor-Critic

Overview

Actor-Critic is a hybrid reinforcement learning algorithm that combines the benefits of both

policy-based and value-based methods. It uses two neural networks: an "actor" that learns the policy and a "critic" that learns the value function. The critic provides a baseline for the actor's policy updates, significantly reducing the variance of gradient estimates compared to pure policy gradient methods like REINFORCE.

The algorithm works by having the actor select actions based on the current policy, while the critic evaluates the quality of the current state or state-action pair. The critic's value estimates are then used to compute advantages, which guide the actor's policy updates. This combination allows for more stable and sample-efficient learning compared to pure policy gradient methods.

Actor-Critic methods can be applied to both discrete and continuous action spaces and are particularly effective in environments with high-dimensional state spaces.

Mathematical Formulation

🧮 Ask ChatGPT about Mathematical Formulation

Problem Definition

Given:

  • State space: S
  • Action space: A
  • Policy: π(a|s;θ) parameterized by θ (Actor)
  • Value function: V(s;φ) parameterized by φ (Critic)
  • Reward function: R(s,a,s')
  • Discount factor: γ ∈ [0,1]

Find parameters θ and φ that maximize expected return:

J(θ) = E[∑{t=0}^∞ γ^t R | π(·|·;θ)]

Using gradient ascent on policy and value function: θ ← θ + α_θ ∇_θ J(θ) φ ← φ + α_φ ∇_φ L(φ)

Key Properties

Actor Update

θ ← θ + α_θ ∑_{t=0}^T ∇_θ log π(a_t|s_t;θ) A^π(s_t,a_t)

Policy parameters updated using advantage estimates


Critic Update

φ ← φ + α_φ ∑_{t=0}^T ∇_φ (V(s_t;φ) - V^π(s_t))²

Value function parameters updated to minimize prediction error


Advantage Estimation

A^π(s_t,a_t) = Q^π(s_t,a_t) - V^π(s_t)

Advantage computed as difference between Q-value and state value


Key Properties

🔑 Ask ChatGPT about Key Properties

  • Hybrid Approach


    Combines policy-based and value-based methods

  • Variance Reduction


    Uses value function as baseline to reduce variance

  • Online Learning


    Can learn from incomplete episodes

  • Sample Efficiency


    More sample efficient than pure policy gradient methods

Implementation Approaches

💻 Ask ChatGPT about Implementation

Standard Actor-Critic with separate actor and critic networks

Complexity:

  • Time: O(episodes × steps_per_episode × (policy_forward_pass + value_forward_pass))
  • Space: O(policy_parameters + value_parameters)

Advantages

  • Reduces variance compared to pure policy gradient methods

  • More sample efficient than REINFORCE

  • Can learn online without complete episodes

  • Combines benefits of policy and value methods

Disadvantages

  • More complex implementation than pure methods

  • Requires tuning of two networks

  • Can be unstable during training

  • Still has higher variance than pure value-based methods

Synchronous Actor-Critic with advantage estimation

Complexity:

  • Time: O(episodes × steps_per_episode × (policy_forward_pass + value_forward_pass))
  • Space: O(policy_parameters + value_parameters)

Advantages

  • More stable than basic Actor-Critic

  • Better sample efficiency

  • Can handle both discrete and continuous actions

  • Synchronous updates are simpler to implement

Disadvantages

  • Still requires careful hyperparameter tuning

  • Can be slower than asynchronous methods

  • Requires more memory for n-step returns

  • May not scale as well to large state spaces

Complete Implementation

The full implementation with error handling, comprehensive testing, and additional variants is available in the source code:

Complexity Analysis

📊 Ask ChatGPT about Complexity

Time & Space Complexity Comparison

Approach Time Complexity Space Complexity Notes
Basic Actor-Critic O(episodes × steps_per_episode × (policy_forward_pass + value_forward_pass)) O(policy_parameters + value_parameters) Time complexity includes both actor and critic network operations. Space complexity includes parameters for both networks

Use Cases & Applications

🌍 Ask ChatGPT about Applications

Application Categories

Continuous Control

  • Robot Control: Learning continuous control policies for robots

  • Autonomous Vehicles: Learning driving policies

  • Game Playing: Learning continuous control in games

  • Physics Simulation: Learning control policies in physics engines

Discrete Control

  • Game Playing: Learning discrete action policies in games

  • Resource Allocation: Learning allocation policies

  • Scheduling: Learning scheduling policies

  • Routing: Learning routing policies in networks

High-Dimensional State Spaces

  • Computer Vision: Learning from image inputs

  • Natural Language Processing: Learning from text inputs

  • Robotics: Learning from sensor data

  • Finance: Learning from market data

Real-Time Applications

  • Trading: Learning trading strategies in real-time

  • Robotics: Learning control policies during operation

  • Gaming: Learning game strategies while playing

  • Resource Management: Learning allocation policies during operation

Educational Value

  • Hybrid Methods: Understanding combination of policy and value methods

  • Variance Reduction: Understanding techniques to reduce variance

  • Online Learning: Understanding online learning in RL

  • Advantage Estimation: Understanding advantage-based updates

Educational Value

  • Hybrid Methods: Understanding combination of policy and value methods

  • Variance Reduction: Learning techniques to reduce variance in gradient estimates

  • Online Learning: Understanding online learning capabilities in RL

  • Advantage Estimation: Understanding advantage-based policy updates

References & Further Reading

:material-book: Core Textbooks

:material-book:
Reinforcement Learning: An Introduction
2018MIT PressISBN 978-0-262-03924-6
:material-book:
Algorithms for Reinforcement Learning
2010Morgan & ClaypoolISBN 978-1-60845-492-1

:material-library: Actor-Critic

:material-book:
Neuronlike adaptive elements that can solve difficult learning control problems
1983IEEE Transactions on Systems, Man, and CyberneticsVolume 13, pages 834-846
:material-book:
Actor-critic algorithms
2000Advances in Neural Information Processing SystemsVolume 12, pages 1008-1014

:material-library: A2C

:material-book:
Asynchronous Methods for Deep Reinforcement Learning
2016ICMLA3C paper (A2C is synchronous version)
:material-book:
Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation
2017ICMLACKTR paper with A2C baseline

:material-web: Online Resources

:material-link:
Wikipedia article on Actor-Critic methods
:material-link:
OpenAI Spinning Up RL tutorial
:material-link:
GeeksforGeeks Actor-Critic implementation

:material-code-tags: Implementation & Practice

:material-link:
RL environment library for testing algorithms
:material-link:
High-quality RL algorithm implementations
:material-link:
Scalable RL library for production use

Interactive Learning

Try implementing the different approaches yourself! This progression will give you deep insight into the algorithm's principles and applications.

Pro Tip: Start with the simplest implementation and gradually work your way up to more complex variants.

Related Algorithms in Reinforcement Learning:

  • SARSA - An on-policy temporal difference learning algorithm that learns action-value functions by following the current policy.

  • Policy Gradient - A policy-based reinforcement learning algorithm that directly optimizes the policy using gradient ascent on expected returns.

  • Q-Learning - A model-free reinforcement learning algorithm that learns optimal action-value functions through temporal difference learning.

  • Proximal Policy Optimization (PPO) - A state-of-the-art policy gradient algorithm that uses clipped objective to ensure stable policy updates with improved sample efficiency.

  • Deep Q-Network (DQN) - A deep reinforcement learning algorithm that uses neural networks to approximate Q-functions for high-dimensional state spaces.