Actor-Critic
Actor-Critic
A hybrid reinforcement learning algorithm that combines policy gradient methods with value function approximation for improved learning efficiency.
Family: Reinforcement Learning Status: 📋 Planned
Need Help Understanding This Algorithm?
Overview
Actor-Critic is a hybrid reinforcement learning algorithm that combines the benefits of both
policy-based and value-based methods. It uses two neural networks: an "actor" that learns the policy and a "critic" that learns the value function. The critic provides a baseline for the actor's policy updates, significantly reducing the variance of gradient estimates compared to pure policy gradient methods like REINFORCE.
The algorithm works by having the actor select actions based on the current policy, while the critic evaluates the quality of the current state or state-action pair. The critic's value estimates are then used to compute advantages, which guide the actor's policy updates. This combination allows for more stable and sample-efficient learning compared to pure policy gradient methods.
Actor-Critic methods can be applied to both discrete and continuous action spaces and are particularly effective in environments with high-dimensional state spaces.
Mathematical Formulation¶
🧮 Ask ChatGPT about Mathematical Formulation
Problem Definition
Given:
- State space: S
- Action space: A
- Policy: π(a|s;θ) parameterized by θ (Actor)
- Value function: V(s;φ) parameterized by φ (Critic)
- Reward function: R(s,a,s')
- Discount factor: γ ∈ [0,1]
Find parameters θ and φ that maximize expected return:
J(θ) = E[∑{t=0}^∞ γ^t R | π(·|·;θ)]
Using gradient ascent on policy and value function: θ ← θ + α_θ ∇_θ J(θ) φ ← φ + α_φ ∇_φ L(φ)
Key Properties
Actor Update
θ ← θ + α_θ ∑_{t=0}^T ∇_θ log π(a_t|s_t;θ) A^π(s_t,a_t)
Policy parameters updated using advantage estimates
Critic Update
φ ← φ + α_φ ∑_{t=0}^T ∇_φ (V(s_t;φ) - V^π(s_t))²
Value function parameters updated to minimize prediction error
Advantage Estimation
A^π(s_t,a_t) = Q^π(s_t,a_t) - V^π(s_t)
Advantage computed as difference between Q-value and state value
Key Properties¶
🔑 Ask ChatGPT about Key Properties
-
Hybrid Approach
Combines policy-based and value-based methods
-
Variance Reduction
Uses value function as baseline to reduce variance
-
Online Learning
Can learn from incomplete episodes
-
Sample Efficiency
More sample efficient than pure policy gradient methods
Implementation Approaches¶
💻 Ask ChatGPT about Implementation
Standard Actor-Critic with separate actor and critic networks
Complexity:
- Time: O(episodes × steps_per_episode × (policy_forward_pass + value_forward_pass))
- Space: O(policy_parameters + value_parameters)
Advantages
-
Reduces variance compared to pure policy gradient methods
-
More sample efficient than REINFORCE
-
Can learn online without complete episodes
-
Combines benefits of policy and value methods
Disadvantages
-
More complex implementation than pure methods
-
Requires tuning of two networks
-
Can be unstable during training
-
Still has higher variance than pure value-based methods
Synchronous Actor-Critic with advantage estimation
Complexity:
- Time: O(episodes × steps_per_episode × (policy_forward_pass + value_forward_pass))
- Space: O(policy_parameters + value_parameters)
Advantages
-
More stable than basic Actor-Critic
-
Better sample efficiency
-
Can handle both discrete and continuous actions
-
Synchronous updates are simpler to implement
Disadvantages
-
Still requires careful hyperparameter tuning
-
Can be slower than asynchronous methods
-
Requires more memory for n-step returns
-
May not scale as well to large state spaces
Complete Implementation
The full implementation with error handling, comprehensive testing, and additional variants is available in the source code:
-
Main implementation with basic Actor-Critic and A2C variants:
src/algokit/reinforcement_learning/actor_critic.py
-
Comprehensive test suite including convergence tests:
tests/unit/reinforcement_learning/test_actor_critic.py
Complexity Analysis¶
📊 Ask ChatGPT about Complexity
Time & Space Complexity Comparison
Approach | Time Complexity | Space Complexity | Notes |
---|---|---|---|
Basic Actor-Critic | O(episodes × steps_per_episode × (policy_forward_pass + value_forward_pass)) | O(policy_parameters + value_parameters) | Time complexity includes both actor and critic network operations. Space complexity includes parameters for both networks |
Use Cases & Applications¶
🌍 Ask ChatGPT about Applications
Application Categories
Continuous Control
-
Robot Control: Learning continuous control policies for robots
-
Autonomous Vehicles: Learning driving policies
-
Game Playing: Learning continuous control in games
-
Physics Simulation: Learning control policies in physics engines
Discrete Control
-
Game Playing: Learning discrete action policies in games
-
Resource Allocation: Learning allocation policies
-
Scheduling: Learning scheduling policies
-
Routing: Learning routing policies in networks
High-Dimensional State Spaces
-
Computer Vision: Learning from image inputs
-
Natural Language Processing: Learning from text inputs
-
Robotics: Learning from sensor data
-
Finance: Learning from market data
Real-Time Applications
-
Trading: Learning trading strategies in real-time
-
Robotics: Learning control policies during operation
-
Gaming: Learning game strategies while playing
-
Resource Management: Learning allocation policies during operation
Educational Value
-
Hybrid Methods: Understanding combination of policy and value methods
-
Variance Reduction: Understanding techniques to reduce variance
-
Online Learning: Understanding online learning in RL
-
Advantage Estimation: Understanding advantage-based updates
Educational Value
-
Hybrid Methods: Understanding combination of policy and value methods
-
Variance Reduction: Learning techniques to reduce variance in gradient estimates
-
Online Learning: Understanding online learning capabilities in RL
-
Advantage Estimation: Understanding advantage-based policy updates
References & Further Reading¶
:material-book: Core Textbooks
:material-library: Actor-Critic
:material-library: A2C
:material-web: Online Resources
:material-code-tags: Implementation & Practice
Interactive Learning
Try implementing the different approaches yourself! This progression will give you deep insight into the algorithm's principles and applications.
Pro Tip: Start with the simplest implementation and gradually work your way up to more complex variants.
Need More Help? Ask ChatGPT!
Navigation¶
Related Algorithms in Reinforcement Learning:
-
SARSA - An on-policy temporal difference learning algorithm that learns action-value functions by following the current policy.
-
Policy Gradient - A policy-based reinforcement learning algorithm that directly optimizes the policy using gradient ascent on expected returns.
-
Q-Learning - A model-free reinforcement learning algorithm that learns optimal action-value functions through temporal difference learning.
-
Proximal Policy Optimization (PPO) - A state-of-the-art policy gradient algorithm that uses clipped objective to ensure stable policy updates with improved sample efficiency.
-
Deep Q-Network (DQN) - A deep reinforcement learning algorithm that uses neural networks to approximate Q-functions for high-dimensional state spaces.