Reinforcement Learning Algorithms¶
Reinforcement Learning enables agents to learn optimal behavior through interaction with an environment using rewards and penalties.
Reinforcement Learning (RL) is a machine learning paradigm where an agent learns to make decisions
by interacting with an environment. The agent receives rewards or penalties based on its actions and learns to maximize cumulative reward over time through trial and error.
Unlike supervised learning, RL doesn't require labeled training data. Instead, the agent learns through exploration and exploitation, discovering optimal policies through interaction with the environment. This makes RL particularly powerful for sequential decision-making problems where the consequences of actions unfold over time.
The RL Framework¶
┌─────────┐ ┌──────────────┐
│ Agent │◄───── State ────────┤ │
│ │ │ Environment │
│ ├────── Action ───────► │
└─────────┘◄───── Reward ───────┘ │
└──────────────┘
The agent perceives the environment state, selects an action, receives a reward, and observes the next state. This cycle repeats, allowing the agent to learn which actions lead to higher cumulative rewards over time.
Overview¶
Key Characteristics:
-
Trial and Error Learning
Agent learns through interaction and feedback from the environment
-
Delayed Rewards
Actions may have consequences that are only revealed later
-
Exploration vs Exploitation
Balance between trying new actions and using known good actions
-
Policy Learning
Learn a mapping from states to actions that maximizes reward
Common Applications:
-
chess
-
go
-
poker
-
video games
-
navigation
-
manipulation
-
autonomous vehicles
-
trading
-
portfolio management
-
risk management
-
content recommendation
-
ad placement
-
personalization
-
energy management
-
traffic control
-
cloud computing
Key Concepts¶
-
Agent
The learner or decision maker that interacts with the environment
-
Environment
The world in which the agent operates and receives feedback
-
State
Current situation or configuration of the environment
-
Action
Choice made by the agent that affects the environment
-
Reward
Immediate feedback signal indicating the quality of an action
-
Policy
Strategy or mapping from states to actions
-
Value Function
Expected cumulative reward from a state or state-action pair
-
Q-Function
Expected cumulative reward for taking an action in a state
Complexity Analysis¶
Complexity Overview
Time: O(|S|²|A|) to O(|S||A|) Space: O(|S||A|) to O(|S|)
Complexity depends on state space size |S|, action space size |A|, and algorithm choice
🎨 Our Implementations¶
Value-Based Methods¶
- Q-Learning (97% coverage) - Off-policy tabular RL
- SARSA (96% coverage) - On-policy tabular RL
- DQN (94% coverage) - Deep Q-Network for high-dimensional states
Policy-Based Methods¶
- Policy Gradient (100% coverage) - REINFORCE algorithm
- PPO (91% coverage) - Proximal Policy Optimization, state-of-the-art
Hybrid Methods¶
- Actor-Critic (90% coverage) - Combines value and policy learning
Model-Based vs Model-Free
Model-Based RL:
- Learn a model of the environment
- Use the model to plan optimal actions
- More sample efficient
- Requires accurate environment modeling
Model-Free RL:
- Learn directly from experience
- No explicit environment model
- More robust to model errors
- May require more samples
Learning Approaches
- Value-Based: Learn value functions, derive policy (Q-Learning, SARSA)
- Policy-Based: Learn policy directly (REINFORCE, Policy Gradient)
- Actor-Critic: Combine value and policy learning
- Model-Based: Learn environment model, plan with it
Algorithm Selection Guide
Use Tabular Methods (Q-Learning, SARSA) when:
- Small, discrete state/action spaces
- Need guaranteed convergence
- Want interpretable results
Use Deep RL (DQN, PPO) when:
- High-dimensional observations (images, sensors)
- Complex state spaces
- Need function approximation
- Have GPU compute available
Use Policy Gradient (REINFORCE, PPO) when:
- Continuous action spaces
- Stochastic policies needed
- Direct policy optimization preferred
Comparison Table¶
| Algorithm | Status | Time Complexity | Space Complexity | Difficulty | Applications |
|---|---|---|---|---|---|
| Actor-Critic | ✅ Complete | O(episodes × steps_per_episode × (policy_forward_pass + value_forward_pass)) | O(policy_parameters + value_parameters) | Medium | Continuous Control, Discrete Control |
| Deep Q-Network (DQN) | ✅ Complete | O(batch_size × network_forward_pass) | O(replay_buffer_size + network_parameters) | Medium | Game Playing, Robotics |
| Proximal Policy Optimization (PPO) | ✅ Complete | O(episodes × steps_per_episode × n_epochs × (policy_forward_pass + value_forward_pass)) | O(policy_parameters + value_parameters) | Medium | Continuous Control, Discrete Control |
| Q-Learning | ✅ Complete | O( | S | × | A |
| SARSA (State-Action-Reward-State-Action) | ✅ Complete | Varies | Varies | Medium | Robotics, Finance |
| Policy Gradient | ✅ Complete | O(episodes × steps_per_episode × policy_forward_pass) | O(policy_parameters) | Medium | Continuous Control, Stochastic Environments |
Algorithms in This Family¶
-
Actor-Critic - A hybrid reinforcement learning algorithm that combines policy gradient methods with value function approximation for improved learning efficiency.
-
Deep Q-Network (DQN) - A deep reinforcement learning algorithm that uses neural networks to approximate Q-functions for high-dimensional state spaces.
-
Proximal Policy Optimization (PPO) - A state-of-the-art policy gradient algorithm that uses clipped objective to ensure stable policy updates with improved sample efficiency.
-
Q-Learning - A model-free reinforcement learning algorithm that learns optimal action-value functions through temporal difference learning.
-
SARSA (State-Action-Reward-State-Action) - A model-free, on-policy reinforcement learning algorithm that learns action-value functions while following the current policy.
-
Policy Gradient - A policy-based reinforcement learning algorithm that directly optimizes the policy using gradient ascent on expected returns.
Implementation Status¶
-
Complete
6/6 algorithms (100%)
-
Planned
0/6 algorithms (0%)
Related Algorithm Families¶
-
Dynamic-Programming: Many RL algorithms are based on dynamic programming principles
-
Optimization: RL can be viewed as an optimization problem for finding optimal policies
-
Neural-Networks: Deep RL combines RL with neural networks for function approximation
-
Control-Theory: RL extends control theory to unknown environments
References¶
- Sutton, Richard S. and Barto, Andrew G. (2018). Reinforcement Learning: An Introduction. MIT Press
Tags¶
Optimization Algorithms that find optimal solutions to problems
Algorithms General algorithmic concepts and implementations