Option Critic
Option-Critic
A hierarchical reinforcement learning algorithm that learns options (temporally extended actions) end-to-end using policy gradient methods.
Family: Hierarchical Reinforcement Learning Status: 📋 Planned
Need Help Understanding This Algorithm?
Overview
Option-Critic is a hierarchical reinforcement learning algorithm that learns options (temporally
extended actions) end-to-end using policy gradient methods. The algorithm automatically discovers useful options that can be reused across different tasks, enabling temporal abstraction and improved sample efficiency.
This approach learns three components simultaneously: an option policy that selects actions given an option, an option selection policy that chooses which option to execute, and termination functions that determine when to end options. Option-Critic is particularly powerful in domains where tasks have natural temporal structure, such as robotics manipulation, navigation, and game playing.
Mathematical Formulation¶
🧮 Ask ChatGPT about Mathematical Formulation
Problem Definition
Given:
- State space: S
- Option space: Ω
- Action space: A
- Option policy: π_ω(a|s) for option ω
- Option selection policy: π_Ω(ω|s)
- Termination function: β_ω(s) for option ω
- Reward function: R(s,a,s')
Find option-critic policies that maximize expected cumulative reward:
π(a_t|s_t) = ∑_{ω_t} π_Ω(ω_t|s_t) · π_ω_t(a_t|s_t)
Key Properties
Option-Critic Policy Gradient Theorem
∇_θ J(θ) = E_{τ ~ π_θ}[∑_{t=0}^T ∇_θ log π_Ω(ω_t|s_t) A_Ω(s_t, ω_t) + ∇_θ log π_ω(a_t|s_t, ω_t) A_ω(s_t, ω_t, a_t)]
Policy gradient decomposes into option selection and action selection components
Option Advantage Function
A_Ω(s_t, ω_t) = Q_Ω(s_t, ω_t) - V_Ω(s_t)
Advantage function for option selection
Action Advantage Function
A_ω(s_t, ω_t, a_t) = Q_ω(s_t, ω_t, a_t) - V_ω(s_t, ω_t)
Advantage function for action selection within options
Key Properties¶
🔑 Ask ChatGPT about Key Properties
-
End-to-End Learning
All components (option policy, selection policy, termination) learned simultaneously
-
Automatic Option Discovery
Useful options emerge from learning without manual design
-
Temporal Abstraction
Options operate over extended time horizons
-
Reusability
Learned options can be applied to new tasks
Implementation Approaches¶
💻 Ask ChatGPT about Implementation
Standard Option-Critic implementation with option policy, selection policy, and termination networks
Complexity:
- Time: O(batch_size × (option_policy_params + option_selection_params + termination_params))
- Space: O(batch_size × (state_size + option_size))
Advantages
-
End-to-end learning of all option components
-
Automatic discovery of useful options
-
Temporal abstraction enables learning at different time scales
-
Options can be reused across different tasks
Disadvantages
-
Requires careful coordination between three networks
-
Option discovery can be challenging and slow
-
Three networks increase complexity and training time
-
Termination function learning can be unstable
Complete Implementation
The full implementation with error handling, comprehensive testing, and additional variants is available in the source code:
-
Main implementation with option policy, selection, and termination networks:
src/algokit/hierarchical_rl/option_critic.py
-
Comprehensive test suite including convergence tests:
tests/unit/hierarchical_rl/test_option_critic.py
Complexity Analysis¶
📊 Ask ChatGPT about Complexity
Time & Space Complexity Comparison
Approach | Time Complexity | Space Complexity | Notes |
---|---|---|---|
Basic Option-Critic | O(batch_size × (option_policy_params + option_selection_params + termination_params)) | O(batch_size × (state_size + option_size)) | Three-network architecture requires careful coordination and training |
Use Cases & Applications¶
🌍 Ask ChatGPT about Applications
Application Categories
Robotics and Control
-
Robot Manipulation: Complex manipulation tasks with reusable options
-
Autonomous Navigation: Multi-level navigation with temporal abstraction
-
Industrial Automation: Process control with learned options
-
Swarm Robotics: Coordinated behavior with shared options
Game AI and Strategy
-
Strategy Games: Multi-level decision making with learned strategies
-
Puzzle Games: Complex puzzles with reusable solution patterns
-
Adventure Games: Quest completion with learned option sequences
-
Simulation Games: Resource management with learned option policies
Real-World Applications
-
Autonomous Vehicles: Multi-level driving with learned driving options
-
Healthcare: Treatment planning with learned medical options
-
Finance: Portfolio management with learned investment options
-
Network Control: Traffic management with learned routing options
Educational Value
-
Option Learning: Understanding temporally extended actions
-
Automatic Discovery: Learning useful behaviors without manual design
-
Temporal Abstraction: Understanding different time scales in learning
-
Transfer Learning: Learning reusable skills across different tasks
Educational Value
-
Option Learning: Perfect introduction to temporally extended actions
-
Automatic Discovery: Shows how useful behaviors can emerge from learning
-
Temporal Abstraction: Demonstrates learning at different time scales
-
Transfer Learning: Illustrates how options can be reused across tasks
References & Further Reading¶
:material-library: Core Papers
:material-book: Hierarchical RL Textbooks
:material-web: Online Resources
:material-code-tags: Implementation & Practice
Interactive Learning
Try implementing the different approaches yourself! This progression will give you deep insight into the algorithm's principles and applications.
Pro Tip: Start with the simplest implementation and gradually work your way up to more complex variants.
Need More Help? Ask ChatGPT!
Navigation¶
Related Algorithms in Hierarchical Reinforcement Learning:
-
Hierarchical Q-Learning - Extends traditional Q-Learning to handle temporal abstraction and hierarchical task decomposition with multi-level Q-functions.
-
Hierarchical Task Networks (HTNs) - A hierarchical reinforcement learning approach that decomposes complex tasks into hierarchical structures of subtasks for planning and execution.
-
Hierarchical Actor-Critic (HAC) - An advanced hierarchical reinforcement learning algorithm that extends the actor-critic framework with temporal abstraction and hierarchical structure.
-
Hierarchical Policy Gradient - Extends traditional policy gradient methods to handle temporal abstraction and hierarchical task decomposition with multi-level policies.
-
Feudal Networks (FuN) - A hierarchical reinforcement learning algorithm that implements a manager-worker architecture for temporal abstraction and goal-based learning.