SARSA (State-Action-Reward-State-Action)
SARSA (State-Action-Reward-State-Action)
A model-free, on-policy reinforcement learning algorithm that learns action-value functions while following the current policy.
Family: Reinforcement Learning Status: ✅ Complete
Need Help Understanding This Algorithm?
🤖 Ask ChatGPT about SARSA (State-Action-Reward-State-Action)
Overview
SARSA (State-Action-Reward-State-Action) is a model-free, on-policy reinforcement learning
algorithm that learns the action-value function while following the current policy. Unlike Q-Learning, SARSA updates Q-values based on the action actually taken in the next state, making it more conservative and suitable for online learning scenarios.
The key difference between SARSA and Q-Learning is that SARSA uses the action actually taken in the next state (on-policy), while Q-Learning uses the action with the highest Q-value (off-policy). This makes SARSA more conservative and safer for online learning, but potentially less sample-efficient than Q-Learning.
SARSA is particularly useful in scenarios where exploration is costly or dangerous, such as robotics applications or financial trading, where taking suboptimal actions can have significant consequences.
Mathematical Formulation¶
🧮 Ask ChatGPT about Mathematical Formulation
Problem Definition
Given:
- State space: S
- Action space: A
- Reward function: R(s,a,s')
- Discount factor: γ ∈ [0,1]
- Learning rate: α ∈ (0,1]
Find Q*(s,a) that maximizes expected return:
Q(s,a) = E[R_{t+1} + γ Q(s_{t+1}, a_{t+1}) | s_t = s, a_t = a]
Using SARSA update rule: Q(s_t, a_t) ← Q(s_t, a_t) + α[r_{t+1} + γ Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t)]
Key Properties
On-Policy Learning
``
Learns the value of the policy being followed
Conservative Updates
``
Uses actual next action, not optimal action
Online Learning
``
Can learn while interacting with environment
Exploration-Sensitive
``
Learning depends on exploration strategy
Complete Implementation
The full implementation with error handling, comprehensive testing, and additional variants is available in the source code:
-
Main implementation with tabular SARSA and epsilon-greedy exploration:
src/algokit/algorithms/reinforcement_learning/sarsa.py -
Comprehensive test suite including convergence tests:
tests/reinforcement_learning/test_sarsa.py
Use Cases & Applications¶
🌍 Ask ChatGPT about Applications
Application Categories
Robotics
-
navigation: navigation
-
manipulation: manipulation
-
locomotion: locomotion
Finance
-
trading: trading
-
portfolio management: portfolio management
-
risk management: risk management
Gaming
-
strategy games: strategy games
-
real-time games: real-time games
-
multiplayer games: multiplayer games
Control Systems
-
process control: process control
-
autonomous systems: autonomous systems
-
adaptive control: adaptive control
References & Further Reading¶
:material-book: Core Textbooks
:material-library: SARSA Algorithm
:material-web: Online Resources
Interactive Learning
Try implementing the different approaches yourself! This progression will give you deep insight into the algorithm's principles and applications.
Pro Tip: Start with the simplest implementation and gradually work your way up to more complex variants.
Need More Help? Ask ChatGPT!
Navigation¶
Related Algorithms in Reinforcement Learning:
-
Actor-Critic - A hybrid reinforcement learning algorithm that combines policy gradient methods with value function approximation for improved learning efficiency.
-
Deep Q-Network (DQN) - A deep reinforcement learning algorithm that uses neural networks to approximate Q-functions for high-dimensional state spaces.
-
Proximal Policy Optimization (PPO) - A state-of-the-art policy gradient algorithm that uses clipped objective to ensure stable policy updates with improved sample efficiency.
-
Q-Learning - A model-free reinforcement learning algorithm that learns optimal action-value functions through temporal difference learning.
-
Policy Gradient - A policy-based reinforcement learning algorithm that directly optimizes the policy using gradient ascent on expected returns.