On-Policy Learning In Reinforcement Learning (RL)
On-policy methods are about learning from what you are currently doing. Imagine you’re trying to teach a robot to navigate a maze. In on-policy learning, the robot learns based on the actions it is currently taking. It’s like learning to cook by trying out different recipes yourself. It refers to learning the value of the policy being used by the agent, including the exploration steps. The policy directs the agent’s actions in every state, including the decision-making process while learning. The agent evaluates the outcomes of its present actions, refining its strategy incrementally. This method, much like mastering a skill through hands-on practice, allows the agent to adapt and improve its decision-making by directly engaging with the environment and learning from its own real-time interactions.
SARSA for On-Policy Learning
A prominent example of an on-policy method is SARSA, which stands for State-Action-Reward-State-Action. In SARSA, the agent learns by updating its policy based on the current action (A), the reward (R) received, and the next state-action pair. The update is based on the observed transition without needing a model of the environment’s dynamics. This approach is like learning on the job, where every step you take informs your next decision. SARSA updates the action-value function based on the current action and the following state and action.
Mathematically, it can be represented as:
where,
- represents the action-value function, denoting the expected cumulative future rewards of taking action a in state s.
- is the learning rate, determining the step size for the update.
- is the reward received after taking action at in state and transitioning to state .
- is the discount factor, weighing the importance of future rewards.
- is the Q-value for the next state-action pair.
The SARSA algorithm updates its Q-values based on the observed reward and the estimate of future rewards, promoting the learning of an optimal policy over successive iterations.
On-Policy Learning Implementation
Let’s use the OpenAI Gym library, which provides various environments for testing RL algorithms. We will demonstrate both on-policy (using SARSA) and off-policy (using Q-Learning) methods.
Install necessary Python package
!pip install gym
Step 1: Import necessary packages
Python3
import gym import numpy as np import matplotlib.pyplot as plt |
Step 2: Initialize Environment
Python3
env = gym.make( 'FrozenLake-v1' ) env.reset() |
Step 3: Initialize Q-table and Setting Hyperparameters
The Q-table is a matrix where rows correspond to states in the environment and columns to possible actions. Initially, it’s filled with zeros.
Learning Process
- The agent learns through episodes. In each episode, it starts in an initial state and continues until a terminal state is reached. During each step:
- The agent selects an action based on the current policy, typically using an epsilon-greedy strategy (a mix of exploration and exploitation).
- After performing the action, the agent observes the reward and the new state.
Python3
Q = np.zeros([env.observation_space.n, env.action_space.n]) # Hyperparameters alpha = 0.1 gamma = 0.99 epsilon = 0.1 num_episodes = 1000 |
Step 4: On-Policy Method (SARSA) Algorithm Implementation
The epsilon-greedy strategy is employed for exploration, and the Q-values are updated using the SARSA formula which considers the reward received and the estimated value of the next action according to the current policy. The code tracks rewards and steps per episode for analysis.
Policy Improvement: Over time, as the agent explores the environment and receives feedback (rewards), the Q-table (representing the policy) gets refined, ideally converging to an optimal policy.
Python
rewards_sarsa = [] steps_per_episode = [] for i in range (num_episodes): state = env.reset() done = False total_reward = 0 step_count = 0 while not done: if np.random.rand() < epsilon: action = env.action_space.sample() else : action = np.argmax(Q[state, :]) new_state, reward, done, _ = env.step(action) new_action = np.argmax(Q[new_state, :]) Q[state, action] + = alpha * (reward + gamma * Q[new_state, new_action] - Q[state, action]) state = new_state total_reward + = reward step_count + = 1 rewards_sarsa.append(total_reward) steps_per_episode.append(step_count) |
Step 5: Visualization
Python3
plt.figure(figsize = ( 12 , 5 )) plt.subplot( 1 , 2 , 1 ) plt.plot(rewards_sarsa) plt.title( "Rewards per Episode - SARSA" ) plt.xlabel( "Episode" ) plt.ylabel( "Total Reward" ) plt.subplot( 1 , 2 , 2 ) plt.plot(steps_per_episode) plt.title( "Steps per Episode - SARSA" ) plt.xlabel( "Episode" ) plt.ylabel( "Steps" ) plt.tight_layout() plt.show() print ( "Training complete with SARSA" ) |
Output:
- The rewards plot illustrates how the agent’s ability to accumulate rewards evolves over episodes, indicating learning efficiency. A rising trend signifies better strategy formulation.
- The steps plot demonstrates the agent’s efficiency in completing episodes, where a decreasing trend indicates quicker solutions, reflecting improved decision-making over time.
The agent learns based on the current policy it is following, including the exploration steps. It evaluates and improves the policy it uses to make decisions.
Training complete with SARSA
On-policy vs off-policy methods Reinforcement Learning
In the world of Reinforcement Learning (RL), two primary approaches dictate how an agent (like a robot or a software program) learns from its environment: On-policy methods and Off-policy methods. Understanding the difference between these two is crucial for grasping the fundamentals of RL. This tutorial aims to demystify the concepts, providing a solid foundation for understanding the nuances between on-policy and off-policy strategies.