SARSA
SARSA
An on-policy temporal difference learning algorithm that learns action-value functions by following the current policy.
Family: Reinforcement Learning Status: 📋 Planned
Need Help Understanding This Algorithm?
Overview
SARSA (State-Action-Reward-State-Action) is an on-policy temporal difference learning algorithm
that learns the action-value function Q(s,a) by following the current policy. Unlike Q-Learning, SARSA updates its Q-values based on the action actually taken in the next state, making it more conservative and safer for online learning scenarios.
The algorithm is particularly useful when the agent must learn while interacting with the environment in real-time, as it considers the exploration strategy in its value updates. This makes SARSA more suitable for scenarios where safety is important and the agent cannot afford to take potentially dangerous exploratory actions.
Mathematical Formulation¶
🧮 Ask ChatGPT about Mathematical Formulation
Problem Definition
Given:
- State space: S
- Action space: A
- Reward function: R(s,a,s')
- Discount factor: γ ∈ [0,1]
- Learning rate: α ∈ (0,1]
- Current policy: π
Find the action-value function Q^π(s,a) for the current policy π:
Q^π(s,a) = E[∑{t=0}^∞ γ^t R | S_0 = s, A_0 = a, π]
Key Properties
On-Policy Learning
Q(s,a) ← Q(s,a) + α[r + γ Q(s',a') - Q(s,a)]
Updates Q-values based on the action actually taken in the next state
Policy Following
Learns Q^π while following policy π
Updates are based on the current policy, not the optimal policy
Conservative Updates
Considers exploration in value updates
More conservative than off-policy methods like Q-Learning
Key Properties¶
🔑 Ask ChatGPT about Key Properties
-
On-Policy
Learns the value of the policy being followed
-
Model-Free
Does not require knowledge of environment dynamics
-
Temporal Difference
Updates estimates based on other estimates (bootstrapping)
-
Conservative Learning
Considers exploration strategy in value updates
Implementation Approaches¶
💻 Ask ChatGPT about Implementation
Standard SARSA with Q-table for discrete state-action spaces
Complexity:
- Time: O(|S| × |A| × episodes)
- Space: O(|S| × |A|)
Advantages
-
Safer for online learning scenarios
-
Considers exploration in value updates
-
More stable learning in some environments
-
No model of environment required
Disadvantages
-
May converge to suboptimal policies
-
Requires discrete state-action spaces
-
Memory requirements grow with state space size
-
Slower convergence than off-policy methods
SARSA with eligibility traces for faster learning
Complexity:
- Time: O(|S| × |A| × episodes)
- Space: O(|S| × |A|)
Advantages
-
Faster learning through eligibility traces
-
Better sample efficiency
-
Maintains on-policy learning benefits
-
Can handle delayed rewards more effectively
Disadvantages
-
More complex implementation
-
Additional memory for eligibility traces
-
Requires tuning of lambda parameter
Complete Implementation
The full implementation with error handling, comprehensive testing, and additional variants is available in the source code:
-
Main implementation with tabular and function approximation variants:
src/algokit/reinforcement_learning/sarsa.py
-
Comprehensive test suite including convergence tests:
tests/unit/reinforcement_learning/test_sarsa.py
Complexity Analysis¶
📊 Ask ChatGPT about Complexity
Time & Space Complexity Comparison
Approach | Time Complexity | Space Complexity | Notes |
---|---|---|---|
Tabular SARSA | O( | S | × |
Use Cases & Applications¶
🌍 Ask ChatGPT about Applications
Application Categories
Online Learning
-
Real-time Trading: Learning trading strategies while trading
-
Robot Navigation: Learning navigation while moving
-
Game Playing: Learning strategies while playing
-
Resource Management: Learning allocation while managing resources
Safety-Critical Applications
-
Autonomous Vehicles: Learning driving policies safely
-
Medical Systems: Learning treatment protocols
-
Industrial Control: Learning control policies for safety
-
Financial Systems: Learning investment strategies conservatively
Exploration Scenarios
-
Scientific Discovery: Learning experimental protocols
-
Drug Discovery: Learning molecular design strategies
-
Materials Science: Learning synthesis protocols
-
Environmental Monitoring: Learning sampling strategies
Educational Value
-
Reinforcement Learning: Understanding on-policy vs off-policy learning
-
Temporal Difference Learning: Understanding policy-dependent updates
-
Exploration vs Exploitation: Understanding conservative learning
-
Value Functions: Understanding policy evaluation
Educational Value
-
On-Policy Learning: Understanding the difference between on-policy and off-policy methods
-
Conservative Learning: Learning about safer learning approaches
-
Policy Evaluation: Understanding how to evaluate current policies
-
Temporal Difference Learning: Understanding policy-dependent value updates
References & Further Reading¶
:material-book: Core Textbooks
:material-library: SARSA
:material-web: Online Resources
:material-code-tags: Implementation & Practice
Interactive Learning
Try implementing the different approaches yourself! This progression will give you deep insight into the algorithm's principles and applications.
Pro Tip: Start with the simplest implementation and gradually work your way up to more complex variants.
Need More Help? Ask ChatGPT!
Navigation¶
Related Algorithms in Reinforcement Learning:
-
Actor-Critic - A hybrid reinforcement learning algorithm that combines policy gradient methods with value function approximation for improved learning efficiency.
-
Policy Gradient - A policy-based reinforcement learning algorithm that directly optimizes the policy using gradient ascent on expected returns.
-
Q-Learning - A model-free reinforcement learning algorithm that learns optimal action-value functions through temporal difference learning.
-
Proximal Policy Optimization (PPO) - A state-of-the-art policy gradient algorithm that uses clipped objective to ensure stable policy updates with improved sample efficiency.
-
Deep Q-Network (DQN) - A deep reinforcement learning algorithm that uses neural networks to approximate Q-functions for high-dimensional state spaces.