Skip to content

SARSA

SARSA

An on-policy temporal difference learning algorithm that learns action-value functions by following the current policy.

Family: Reinforcement Learning Status: 📋 Planned

Need Help Understanding This Algorithm?

🤖 Ask ChatGPT about SARSA

Overview

SARSA (State-Action-Reward-State-Action) is an on-policy temporal difference learning algorithm

that learns the action-value function Q(s,a) by following the current policy. Unlike Q-Learning, SARSA updates its Q-values based on the action actually taken in the next state, making it more conservative and safer for online learning scenarios.

The algorithm is particularly useful when the agent must learn while interacting with the environment in real-time, as it considers the exploration strategy in its value updates. This makes SARSA more suitable for scenarios where safety is important and the agent cannot afford to take potentially dangerous exploratory actions.

Mathematical Formulation

🧮 Ask ChatGPT about Mathematical Formulation

Problem Definition

Given:

  • State space: S
  • Action space: A
  • Reward function: R(s,a,s')
  • Discount factor: γ ∈ [0,1]
  • Learning rate: α ∈ (0,1]
  • Current policy: π

Find the action-value function Q^π(s,a) for the current policy π:

Q^π(s,a) = E[∑{t=0}^∞ γ^t R | S_0 = s, A_0 = a, π]

Key Properties

On-Policy Learning

Q(s,a) ← Q(s,a) + α[r + γ Q(s',a') - Q(s,a)]

Updates Q-values based on the action actually taken in the next state


Policy Following

Learns Q^π while following policy π

Updates are based on the current policy, not the optimal policy


Conservative Updates

Considers exploration in value updates

More conservative than off-policy methods like Q-Learning


Key Properties

🔑 Ask ChatGPT about Key Properties

  • On-Policy


    Learns the value of the policy being followed

  • Model-Free


    Does not require knowledge of environment dynamics

  • Temporal Difference


    Updates estimates based on other estimates (bootstrapping)

  • Conservative Learning


    Considers exploration strategy in value updates

Implementation Approaches

💻 Ask ChatGPT about Implementation

Standard SARSA with Q-table for discrete state-action spaces

Complexity:

  • Time: O(|S| × |A| × episodes)
  • Space: O(|S| × |A|)

Advantages

  • Safer for online learning scenarios

  • Considers exploration in value updates

  • More stable learning in some environments

  • No model of environment required

Disadvantages

  • May converge to suboptimal policies

  • Requires discrete state-action spaces

  • Memory requirements grow with state space size

  • Slower convergence than off-policy methods

SARSA with eligibility traces for faster learning

Complexity:

  • Time: O(|S| × |A| × episodes)
  • Space: O(|S| × |A|)

Advantages

  • Faster learning through eligibility traces

  • Better sample efficiency

  • Maintains on-policy learning benefits

  • Can handle delayed rewards more effectively

Disadvantages

  • More complex implementation

  • Additional memory for eligibility traces

  • Requires tuning of lambda parameter

Complete Implementation

The full implementation with error handling, comprehensive testing, and additional variants is available in the source code:

Complexity Analysis

📊 Ask ChatGPT about Complexity

Time & Space Complexity Comparison

Approach Time Complexity Space Complexity Notes
Tabular SARSA O( S ×

Use Cases & Applications

🌍 Ask ChatGPT about Applications

Application Categories

Online Learning

  • Real-time Trading: Learning trading strategies while trading

  • Robot Navigation: Learning navigation while moving

  • Game Playing: Learning strategies while playing

  • Resource Management: Learning allocation while managing resources

Safety-Critical Applications

  • Autonomous Vehicles: Learning driving policies safely

  • Medical Systems: Learning treatment protocols

  • Industrial Control: Learning control policies for safety

  • Financial Systems: Learning investment strategies conservatively

Exploration Scenarios

  • Scientific Discovery: Learning experimental protocols

  • Drug Discovery: Learning molecular design strategies

  • Materials Science: Learning synthesis protocols

  • Environmental Monitoring: Learning sampling strategies

Educational Value

  • Reinforcement Learning: Understanding on-policy vs off-policy learning

  • Temporal Difference Learning: Understanding policy-dependent updates

  • Exploration vs Exploitation: Understanding conservative learning

  • Value Functions: Understanding policy evaluation

Educational Value

  • On-Policy Learning: Understanding the difference between on-policy and off-policy methods

  • Conservative Learning: Learning about safer learning approaches

  • Policy Evaluation: Understanding how to evaluate current policies

  • Temporal Difference Learning: Understanding policy-dependent value updates

References & Further Reading

:material-book: Core Textbooks

:material-book:
Reinforcement Learning: An Introduction
2018MIT PressISBN 978-0-262-03924-6
:material-book:
Algorithms for Reinforcement Learning
2010Morgan & ClaypoolISBN 978-1-60845-492-1

:material-library: SARSA

:material-book:
On-line Q-learning using connectionist systems
1994Cambridge University Engineering DepartmentTechnical Report CUED/F-INFENG/TR 166
:material-book:
Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding
1996Advances in Neural Information Processing SystemsVolume 8, pages 1038-1044

:material-web: Online Resources

:material-link:
Wikipedia article on SARSA
:material-link:
OpenAI Spinning Up RL tutorial
:material-link:
GeeksforGeeks SARSA implementation

:material-code-tags: Implementation & Practice

:material-link:
RL environment library for testing algorithms
:material-link:
High-quality RL algorithm implementations
:material-link:
Scalable RL library for production use

Interactive Learning

Try implementing the different approaches yourself! This progression will give you deep insight into the algorithm's principles and applications.

Pro Tip: Start with the simplest implementation and gradually work your way up to more complex variants.

Related Algorithms in Reinforcement Learning:

  • Actor-Critic - A hybrid reinforcement learning algorithm that combines policy gradient methods with value function approximation for improved learning efficiency.

  • Policy Gradient - A policy-based reinforcement learning algorithm that directly optimizes the policy using gradient ascent on expected returns.

  • Q-Learning - A model-free reinforcement learning algorithm that learns optimal action-value functions through temporal difference learning.

  • Proximal Policy Optimization (PPO) - A state-of-the-art policy gradient algorithm that uses clipped objective to ensure stable policy updates with improved sample efficiency.

  • Deep Q-Network (DQN) - A deep reinforcement learning algorithm that uses neural networks to approximate Q-functions for high-dimensional state spaces.