Skip to content

SARSA (State-Action-Reward-State-Action)

SARSA (State-Action-Reward-State-Action)

A model-free, on-policy reinforcement learning algorithm that learns action-value functions while following the current policy.

Family: Reinforcement Learning Status: ✅ Complete

Need Help Understanding This Algorithm?

🤖 Ask ChatGPT about SARSA (State-Action-Reward-State-Action)

Overview

SARSA (State-Action-Reward-State-Action) is a model-free, on-policy reinforcement learning

algorithm that learns the action-value function while following the current policy. Unlike Q-Learning, SARSA updates Q-values based on the action actually taken in the next state, making it more conservative and suitable for online learning scenarios.

The key difference between SARSA and Q-Learning is that SARSA uses the action actually taken in the next state (on-policy), while Q-Learning uses the action with the highest Q-value (off-policy). This makes SARSA more conservative and safer for online learning, but potentially less sample-efficient than Q-Learning.

SARSA is particularly useful in scenarios where exploration is costly or dangerous, such as robotics applications or financial trading, where taking suboptimal actions can have significant consequences.

Mathematical Formulation

🧮 Ask ChatGPT about Mathematical Formulation

Problem Definition

Given:

  • State space: S
  • Action space: A
  • Reward function: R(s,a,s')
  • Discount factor: γ ∈ [0,1]
  • Learning rate: α ∈ (0,1]

Find Q*(s,a) that maximizes expected return:

Q(s,a) = E[R_{t+1} + γ Q(s_{t+1}, a_{t+1}) | s_t = s, a_t = a]

Using SARSA update rule: Q(s_t, a_t) ← Q(s_t, a_t) + α[r_{t+1} + γ Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t)]

Key Properties

On-Policy Learning

``

Learns the value of the policy being followed


Conservative Updates

``

Uses actual next action, not optimal action


Online Learning

``

Can learn while interacting with environment


Exploration-Sensitive

``

Learning depends on exploration strategy


Complete Implementation

The full implementation with error handling, comprehensive testing, and additional variants is available in the source code:

Use Cases & Applications

🌍 Ask ChatGPT about Applications

Application Categories

Robotics

  • navigation: navigation

  • manipulation: manipulation

  • locomotion: locomotion

Finance

  • trading: trading

  • portfolio management: portfolio management

  • risk management: risk management

Gaming

  • strategy games: strategy games

  • real-time games: real-time games

  • multiplayer games: multiplayer games

Control Systems

  • process control: process control

  • autonomous systems: autonomous systems

  • adaptive control: adaptive control

References & Further Reading

:material-book: Core Textbooks

:material-book:
Reinforcement Learning: An Introduction
2018MIT PressISBN 978-0-262-03924-6
:material-book:
Algorithms for Reinforcement Learning
2010Morgan & ClaypoolISBN 978-1-60845-492-1

:material-library: SARSA Algorithm

:material-book:
On-line Q-learning using connectionist systems
1994Cambridge University Engineering DepartmentTechnical Report CUED/F-INFENG/TR 166
:material-book:
Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding
1996Advances in Neural Information Processing SystemsVolume 8, pages 1038-1044

:material-web: Online Resources

Interactive Learning

Try implementing the different approaches yourself! This progression will give you deep insight into the algorithm's principles and applications.

Pro Tip: Start with the simplest implementation and gradually work your way up to more complex variants.

Related Algorithms in Reinforcement Learning:

  • Actor-Critic - A hybrid reinforcement learning algorithm that combines policy gradient methods with value function approximation for improved learning efficiency.

  • Deep Q-Network (DQN) - A deep reinforcement learning algorithm that uses neural networks to approximate Q-functions for high-dimensional state spaces.

  • Proximal Policy Optimization (PPO) - A state-of-the-art policy gradient algorithm that uses clipped objective to ensure stable policy updates with improved sample efficiency.

  • Q-Learning - A model-free reinforcement learning algorithm that learns optimal action-value functions through temporal difference learning.

  • Policy Gradient - A policy-based reinforcement learning algorithm that directly optimizes the policy using gradient ascent on expected returns.