Skip to content

Reinforcement Learning Algorithms

Reinforcement Learning enables agents to learn optimal behavior through interaction with an environment using rewards and penalties.

Reinforcement Learning (RL) is a machine learning paradigm where an agent learns to make decisions

by interacting with an environment. The agent receives rewards or penalties based on its actions and learns to maximize cumulative reward over time through trial and error.

Unlike supervised learning, RL doesn't require labeled training data. Instead, the agent learns through exploration and exploitation, discovering optimal policies through interaction with the environment. This makes RL particularly powerful for sequential decision-making problems where the consequences of actions unfold over time.

The RL Framework

┌─────────┐                    ┌──────────────┐
│  Agent  │◄───── State ────────┤              │
│         │                     │ Environment  │
│         ├────── Action ───────►              │
└─────────┘◄───── Reward ───────┘              │
                                └──────────────┘

The agent perceives the environment state, selects an action, receives a reward, and observes the next state. This cycle repeats, allowing the agent to learn which actions lead to higher cumulative rewards over time.

Overview

Key Characteristics:

  • Trial and Error Learning


    Agent learns through interaction and feedback from the environment

  • Delayed Rewards


    Actions may have consequences that are only revealed later

  • Exploration vs Exploitation


    Balance between trying new actions and using known good actions

  • Policy Learning


    Learn a mapping from states to actions that maximizes reward

Common Applications:

  • chess

  • go

  • poker

  • video games

  • navigation

  • manipulation

  • autonomous vehicles

  • trading

  • portfolio management

  • risk management

  • content recommendation

  • ad placement

  • personalization

  • energy management

  • traffic control

  • cloud computing

Key Concepts

  • Agent


    The learner or decision maker that interacts with the environment

  • Environment


    The world in which the agent operates and receives feedback

  • State


    Current situation or configuration of the environment

  • Action


    Choice made by the agent that affects the environment

  • Reward


    Immediate feedback signal indicating the quality of an action

  • Policy


    Strategy or mapping from states to actions

  • Value Function


    Expected cumulative reward from a state or state-action pair

  • Q-Function


    Expected cumulative reward for taking an action in a state

Complexity Analysis

Complexity Overview

Time: O(|S|²|A|) to O(|S||A|) Space: O(|S||A|) to O(|S|)

Complexity depends on state space size |S|, action space size |A|, and algorithm choice

🎨 Our Implementations

Value-Based Methods

  • Q-Learning (97% coverage) - Off-policy tabular RL
  • SARSA (96% coverage) - On-policy tabular RL
  • DQN (94% coverage) - Deep Q-Network for high-dimensional states

Policy-Based Methods

  • Policy Gradient (100% coverage) - REINFORCE algorithm
  • PPO (91% coverage) - Proximal Policy Optimization, state-of-the-art

Hybrid Methods

  • Actor-Critic (90% coverage) - Combines value and policy learning

Model-Based vs Model-Free

Model-Based RL:

  • Learn a model of the environment
  • Use the model to plan optimal actions
  • More sample efficient
  • Requires accurate environment modeling

Model-Free RL:

  • Learn directly from experience
  • No explicit environment model
  • More robust to model errors
  • May require more samples

Learning Approaches

  1. Value-Based: Learn value functions, derive policy (Q-Learning, SARSA)
  2. Policy-Based: Learn policy directly (REINFORCE, Policy Gradient)
  3. Actor-Critic: Combine value and policy learning
  4. Model-Based: Learn environment model, plan with it

Algorithm Selection Guide

Use Tabular Methods (Q-Learning, SARSA) when:

  • Small, discrete state/action spaces
  • Need guaranteed convergence
  • Want interpretable results

Use Deep RL (DQN, PPO) when:

  • High-dimensional observations (images, sensors)
  • Complex state spaces
  • Need function approximation
  • Have GPU compute available

Use Policy Gradient (REINFORCE, PPO) when:

  • Continuous action spaces
  • Stochastic policies needed
  • Direct policy optimization preferred

Comparison Table

Algorithm Status Time Complexity Space Complexity Difficulty Applications
Actor-Critic ✅ Complete O(episodes × steps_per_episode × (policy_forward_pass + value_forward_pass)) O(policy_parameters + value_parameters) Medium Continuous Control, Discrete Control
Deep Q-Network (DQN) ✅ Complete O(batch_size × network_forward_pass) O(replay_buffer_size + network_parameters) Medium Game Playing, Robotics
Proximal Policy Optimization (PPO) ✅ Complete O(episodes × steps_per_episode × n_epochs × (policy_forward_pass + value_forward_pass)) O(policy_parameters + value_parameters) Medium Continuous Control, Discrete Control
Q-Learning ✅ Complete O( S × A
SARSA (State-Action-Reward-State-Action) ✅ Complete Varies Varies Medium Robotics, Finance
Policy Gradient ✅ Complete O(episodes × steps_per_episode × policy_forward_pass) O(policy_parameters) Medium Continuous Control, Stochastic Environments

Algorithms in This Family

  • Actor-Critic - A hybrid reinforcement learning algorithm that combines policy gradient methods with value function approximation for improved learning efficiency.

  • Deep Q-Network (DQN) - A deep reinforcement learning algorithm that uses neural networks to approximate Q-functions for high-dimensional state spaces.

  • Proximal Policy Optimization (PPO) - A state-of-the-art policy gradient algorithm that uses clipped objective to ensure stable policy updates with improved sample efficiency.

  • Q-Learning - A model-free reinforcement learning algorithm that learns optimal action-value functions through temporal difference learning.

  • SARSA (State-Action-Reward-State-Action) - A model-free, on-policy reinforcement learning algorithm that learns action-value functions while following the current policy.

  • Policy Gradient - A policy-based reinforcement learning algorithm that directly optimizes the policy using gradient ascent on expected returns.

Implementation Status

  • Complete


    6/6 algorithms (100%)

  • Planned


    0/6 algorithms (0%)

  • Dynamic-Programming: Many RL algorithms are based on dynamic programming principles

  • Optimization: RL can be viewed as an optimization problem for finding optimal policies

  • Neural-Networks: Deep RL combines RL with neural networks for function approximation

  • Control-Theory: RL extends control theory to unknown environments

References

  1. Sutton, Richard S. and Barto, Andrew G. (2018). Reinforcement Learning: An Introduction. MIT Press

Tags

Optimization Algorithms that find optimal solutions to problems

Algorithms General algorithmic concepts and implementations