SARSA

An on-policy temporal difference learning algorithm that learns action-value functions by following the current policy.

Family: Reinforcement Learning Status: 📋 Planned

Need Help Understanding This Algorithm?

Overview

SARSA (State-Action-Reward-State-Action) is an on-policy temporal difference learning algorithm

that learns the action-value function Q(s,a) by following the current policy. Unlike Q-Learning, SARSA updates its Q-values based on the action actually taken in the next state, making it more conservative and safer for online learning scenarios.

The algorithm is particularly useful when the agent must learn while interacting with the environment in real-time, as it considers the exploration strategy in its value updates. This makes SARSA more suitable for scenarios where safety is important and the agent cannot afford to take potentially dangerous exploratory actions.

Mathematical Formulation¶

🧮 Ask ChatGPT about Mathematical Formulation

Problem Definition

Given:

State space: S
Action space: A
Reward function: R(s,a,s')
Discount factor: γ ∈ [0,1]
Learning rate: α ∈ (0,1]
Current policy: π

Find the action-value function Q^π(s,a) for the current policy π:

Q^π(s,a) = E[∑{t=0}^∞ γ^t R | S_0 = s, A_0 = a, π]

Key Properties

On-Policy Learning

Q(s,a) ← Q(s,a) + α[r + γ Q(s',a') - Q(s,a)]

Updates Q-values based on the action actually taken in the next state

Policy Following

Learns Q^π while following policy π

Updates are based on the current policy, not the optimal policy

Conservative Updates

Considers exploration in value updates

More conservative than off-policy methods like Q-Learning

Key Properties¶

🔑 Ask ChatGPT about Key Properties

On-Policy

Learns the value of the policy being followed
Model-Free

Does not require knowledge of environment dynamics
Temporal Difference

Updates estimates based on other estimates (bootstrapping)
Conservative Learning

Considers exploration strategy in value updates

Implementation Approaches¶

💻 Ask ChatGPT about Implementation

Tabular SARSA (Classic)SARSA(λ) with Eligibility Traces

Standard SARSA with Q-table for discrete state-action spaces

Complexity:

Time: O(|S| × |A| × episodes)
Space: O(|S| × |A|)

Advantages

Safer for online learning scenarios
Considers exploration in value updates
More stable learning in some environments
No model of environment required

Disadvantages

May converge to suboptimal policies
Requires discrete state-action spaces
Memory requirements grow with state space size
Slower convergence than off-policy methods

SARSA with eligibility traces for faster learning

Complexity:

Time: O(|S| × |A| × episodes)
Space: O(|S| × |A|)

Advantages

Faster learning through eligibility traces
Better sample efficiency
Maintains on-policy learning benefits
Can handle delayed rewards more effectively

Disadvantages

More complex implementation
Additional memory for eligibility traces
Requires tuning of lambda parameter

Complete Implementation

The full implementation with error handling, comprehensive testing, and additional variants is available in the source code:

Main implementation with tabular and function approximation variants: src/algokit/reinforcement_learning/sarsa.py
Comprehensive test suite including convergence tests: tests/unit/reinforcement_learning/test_sarsa.py

Complexity Analysis¶

📊 Ask ChatGPT about Complexity

Time & Space Complexity Comparison

Approach	Time Complexity	Space Complexity	Notes
Tabular SARSA	O(	S	×

Use Cases & Applications¶

🌍 Ask ChatGPT about Applications

Application Categories

Online Learning

Real-time Trading: Learning trading strategies while trading
Robot Navigation: Learning navigation while moving
Game Playing: Learning strategies while playing
Resource Management: Learning allocation while managing resources

Safety-Critical Applications

Autonomous Vehicles: Learning driving policies safely
Medical Systems: Learning treatment protocols
Industrial Control: Learning control policies for safety
Financial Systems: Learning investment strategies conservatively

Exploration Scenarios

Scientific Discovery: Learning experimental protocols
Drug Discovery: Learning molecular design strategies
Materials Science: Learning synthesis protocols
Environmental Monitoring: Learning sampling strategies

Educational Value

Reinforcement Learning: Understanding on-policy vs off-policy learning
Temporal Difference Learning: Understanding policy-dependent updates
Exploration vs Exploitation: Understanding conservative learning
Value Functions: Understanding policy evaluation

Educational Value

On-Policy Learning: Understanding the difference between on-policy and off-policy methods
Conservative Learning: Learning about safer learning approaches
Policy Evaluation: Understanding how to evaluate current policies
Temporal Difference Learning: Understanding policy-dependent value updates

References & Further Reading¶

:material-book: Core Textbooks

:material-book:

Reinforcement Learning: An Introduction

2018 • MIT Press • ISBN 978-0-262-03924-6

:material-amazon: Buy on Amazon

:material-book:

Algorithms for Reinforcement Learning

2010 • Morgan & Claypool • ISBN 978-1-60845-492-1

:material-amazon: Buy on Amazon

:material-library: SARSA

:material-book:

On-line Q-learning using connectionist systems

1994 • Cambridge University Engineering Department • Technical Report CUED/F-INFENG/TR 166

:material-book:

Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding

1996 • Advances in Neural Information Processing Systems • Volume 8, pages 1038-1044

:material-web: Online Resources

:material-link:

SARSA

Wikipedia article on SARSA

:material-link:

Reinforcement Learning Tutorial

OpenAI Spinning Up RL tutorial

:material-link:

SARSA Algorithm

GeeksforGeeks SARSA implementation

:material-code-tags: Implementation & Practice

:material-link:

Gymnasium (OpenAI Gym)

RL environment library for testing algorithms

:material-link:

Stable Baselines3

High-quality RL algorithm implementations

:material-link:

Ray RLlib

Scalable RL library for production use

Interactive Learning

Try implementing the different approaches yourself! This progression will give you deep insight into the algorithm's principles and applications.

Pro Tip: Start with the simplest implementation and gradually work your way up to more complex variants.

Need More Help? Ask ChatGPT!

🧒 Explain Simply 📝 Practice Problems 🔀 Compare Algorithms 🐛 Debug Help

Related Algorithms in Reinforcement Learning:

Actor-Critic - A hybrid reinforcement learning algorithm that combines policy gradient methods with value function approximation for improved learning efficiency.
Policy Gradient - A policy-based reinforcement learning algorithm that directly optimizes the policy using gradient ascent on expected returns.
Q-Learning - A model-free reinforcement learning algorithm that learns optimal action-value functions through temporal difference learning.
Proximal Policy Optimization (PPO) - A state-of-the-art policy gradient algorithm that uses clipped objective to ensure stable policy updates with improved sample efficiency.
Deep Q-Network (DQN) - A deep reinforcement learning algorithm that uses neural networks to approximate Q-functions for high-dimensional state spaces.

SARSA

Mathematical Formulation¶

Key Properties¶

Implementation Approaches¶

Complexity Analysis¶

Use Cases & Applications¶

References & Further Reading¶

:material-book: Core Textbooks

:material-library: SARSA

:material-web: Online Resources

:material-code-tags: Implementation & Practice

Navigation¶