Actor-Critic

A hybrid reinforcement learning algorithm that combines policy gradient methods with value function approximation for improved learning efficiency.

Family: Reinforcement Learning Status: 📋 Planned

Need Help Understanding This Algorithm?

🤖 Ask ChatGPT about Actor-Critic

Overview

Actor-Critic is a hybrid reinforcement learning algorithm that combines the benefits of both

policy-based and value-based methods. It uses two neural networks: an "actor" that learns the policy and a "critic" that learns the value function. The critic provides a baseline for the actor's policy updates, significantly reducing the variance of gradient estimates compared to pure policy gradient methods like REINFORCE.

The algorithm works by having the actor select actions based on the current policy, while the critic evaluates the quality of the current state or state-action pair. The critic's value estimates are then used to compute advantages, which guide the actor's policy updates. This combination allows for more stable and sample-efficient learning compared to pure policy gradient methods.

Actor-Critic methods can be applied to both discrete and continuous action spaces and are particularly effective in environments with high-dimensional state spaces.

Mathematical Formulation¶

🧮 Ask ChatGPT about Mathematical Formulation

Problem Definition

Given:

State space: S
Action space: A
Policy: π(a|s;θ) parameterized by θ (Actor)
Value function: V(s;φ) parameterized by φ (Critic)
Reward function: R(s,a,s')
Discount factor: γ ∈ [0,1]

Find parameters θ and φ that maximize expected return:

J(θ) = E[∑{t=0}^∞ γ^t R | π(·|·;θ)]

Using gradient ascent on policy and value function: θ ← θ + α_θ ∇_θ J(θ) φ ← φ + α_φ ∇_φ L(φ)

Key Properties

Actor Update

θ ← θ + α_θ ∑_{t=0}^T ∇_θ log π(a_t|s_t;θ) A^π(s_t,a_t)

Policy parameters updated using advantage estimates

Critic Update

φ ← φ + α_φ ∑_{t=0}^T ∇_φ (V(s_t;φ) - V^π(s_t))²

Value function parameters updated to minimize prediction error

Advantage Estimation

A^π(s_t,a_t) = Q^π(s_t,a_t) - V^π(s_t)

Advantage computed as difference between Q-value and state value

Key Properties¶

🔑 Ask ChatGPT about Key Properties

Hybrid Approach

Combines policy-based and value-based methods
Variance Reduction

Uses value function as baseline to reduce variance
Online Learning

Can learn from incomplete episodes
Sample Efficiency

More sample efficient than pure policy gradient methods

Implementation Approaches¶

💻 Ask ChatGPT about Implementation

Basic Actor-CriticAdvantage Actor-Critic (A2C)

Standard Actor-Critic with separate actor and critic networks

Complexity:

Time: O(episodes × steps_per_episode × (policy_forward_pass + value_forward_pass))
Space: O(policy_parameters + value_parameters)

Advantages

Reduces variance compared to pure policy gradient methods
More sample efficient than REINFORCE
Can learn online without complete episodes
Combines benefits of policy and value methods

Disadvantages

More complex implementation than pure methods
Requires tuning of two networks
Can be unstable during training
Still has higher variance than pure value-based methods

Synchronous Actor-Critic with advantage estimation

Complexity:

Time: O(episodes × steps_per_episode × (policy_forward_pass + value_forward_pass))
Space: O(policy_parameters + value_parameters)

Advantages

More stable than basic Actor-Critic
Better sample efficiency
Can handle both discrete and continuous actions
Synchronous updates are simpler to implement

Disadvantages

Still requires careful hyperparameter tuning
Can be slower than asynchronous methods
Requires more memory for n-step returns
May not scale as well to large state spaces

Complete Implementation

The full implementation with error handling, comprehensive testing, and additional variants is available in the source code:

Main implementation with basic Actor-Critic and A2C variants: src/algokit/reinforcement_learning/actor_critic.py
Comprehensive test suite including convergence tests: tests/unit/reinforcement_learning/test_actor_critic.py

Complexity Analysis¶

📊 Ask ChatGPT about Complexity

Time & Space Complexity Comparison

Approach	Time Complexity	Space Complexity	Notes
Basic Actor-Critic	O(episodes × steps_per_episode × (policy_forward_pass + value_forward_pass))	O(policy_parameters + value_parameters)	Time complexity includes both actor and critic network operations. Space complexity includes parameters for both networks

Use Cases & Applications¶

🌍 Ask ChatGPT about Applications

Application Categories

Continuous Control

Robot Control: Learning continuous control policies for robots
Autonomous Vehicles: Learning driving policies
Game Playing: Learning continuous control in games
Physics Simulation: Learning control policies in physics engines

Discrete Control

Game Playing: Learning discrete action policies in games
Resource Allocation: Learning allocation policies
Scheduling: Learning scheduling policies
Routing: Learning routing policies in networks

High-Dimensional State Spaces

Computer Vision: Learning from image inputs
Natural Language Processing: Learning from text inputs
Robotics: Learning from sensor data
Finance: Learning from market data

Real-Time Applications

Trading: Learning trading strategies in real-time
Robotics: Learning control policies during operation
Gaming: Learning game strategies while playing
Resource Management: Learning allocation policies during operation

Educational Value

Hybrid Methods: Understanding combination of policy and value methods
Variance Reduction: Understanding techniques to reduce variance
Online Learning: Understanding online learning in RL
Advantage Estimation: Understanding advantage-based updates

Educational Value

Hybrid Methods: Understanding combination of policy and value methods
Variance Reduction: Learning techniques to reduce variance in gradient estimates
Online Learning: Understanding online learning capabilities in RL
Advantage Estimation: Understanding advantage-based policy updates

References & Further Reading¶

:material-book: Core Textbooks

:material-book:

Reinforcement Learning: An Introduction

2018 • MIT Press • ISBN 978-0-262-03924-6

:material-amazon: Buy on Amazon

:material-book:

Algorithms for Reinforcement Learning

2010 • Morgan & Claypool • ISBN 978-1-60845-492-1

:material-amazon: Buy on Amazon

:material-library: Actor-Critic

:material-book:

Neuronlike adaptive elements that can solve difficult learning control problems

1983 • IEEE Transactions on Systems, Man, and Cybernetics • Volume 13, pages 834-846

:material-book:

Actor-critic algorithms

2000 • Advances in Neural Information Processing Systems • Volume 12, pages 1008-1014

:material-library: A2C

:material-book:

Asynchronous Methods for Deep Reinforcement Learning

2016 • ICML • A3C paper (A2C is synchronous version)

:material-book:

Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation

2017 • ICML • ACKTR paper with A2C baseline

:material-web: Online Resources

:material-link:

Actor-Critic Methods

Wikipedia article on Actor-Critic methods

:material-link:

Reinforcement Learning Tutorial

OpenAI Spinning Up RL tutorial

:material-link:

Actor-Critic Algorithm

GeeksforGeeks Actor-Critic implementation

:material-code-tags: Implementation & Practice

:material-link:

Gymnasium (OpenAI Gym)

RL environment library for testing algorithms

:material-link:

Stable Baselines3

High-quality RL algorithm implementations

:material-link:

Ray RLlib

Scalable RL library for production use

Interactive Learning

Try implementing the different approaches yourself! This progression will give you deep insight into the algorithm's principles and applications.

Pro Tip: Start with the simplest implementation and gradually work your way up to more complex variants.

Need More Help? Ask ChatGPT!

🧒 Explain Simply 📝 Practice Problems 🔀 Compare Algorithms 🐛 Debug Help

Related Algorithms in Reinforcement Learning:

SARSA - An on-policy temporal difference learning algorithm that learns action-value functions by following the current policy.
Policy Gradient - A policy-based reinforcement learning algorithm that directly optimizes the policy using gradient ascent on expected returns.
Q-Learning - A model-free reinforcement learning algorithm that learns optimal action-value functions through temporal difference learning.
Proximal Policy Optimization (PPO) - A state-of-the-art policy gradient algorithm that uses clipped objective to ensure stable policy updates with improved sample efficiency.
Deep Q-Network (DQN) - A deep reinforcement learning algorithm that uses neural networks to approximate Q-functions for high-dimensional state spaces.

Actor-Critic

Mathematical Formulation¶

Key Properties¶

Implementation Approaches¶

Complexity Analysis¶

Use Cases & Applications¶

References & Further Reading¶

:material-book: Core Textbooks

:material-library: Actor-Critic

:material-library: A2C

:material-web: Online Resources

:material-code-tags: Implementation & Practice

Navigation¶