Implementation of Q-Learning
Defining Enviroment and parameters
import numpy as np
# Define the environment
n_states = 16 # Number of states in the grid world
n_actions = 4 # Number of possible actions (up, down, left, right)
goal_state = 15 # Goal state
# Initialize Q-table with zeros
Q_table = np.zeros((n_states, n_actions))
# Define parameters
learning_rate = 0.8
discount_factor = 0.95
exploration_prob = 0.2
epochs = 1000
n this Q-learning implementation, a grid world environment is defined with 16 states, and agents can take 4 possible actions: up, down, left, and right. The goal is to reach state 15. The Q-table, initialized with zeros, serves as a memory to store Q-values for state-action pairs.
The learning parameters include a learning rate of 0.8, a discount factor of 0.95, an exploration probability of 0.2, and a total of 1000 training epochs. The learning rate influences the weight given to new information, the discount factor adjusts the importance of future rewards, and the exploration probability determines the likelihood of the agent exploring new actions versus exploiting known actions.
Throughout the training epochs, the agent explores the environment, updating Q-values based on received rewards and future expectations, ultimately learning a strategy to navigate the grid world towards the goal state.
Implement Q-Algorithm
# Q-learning algorithm
for epoch in range(epochs):
current_state = np.random.randint(0, n_states) # Start from a random state
while current_state != goal_state:
# Choose action with epsilon-greedy strategy
if np.random.rand() < exploration_prob:
action = np.random.randint(0, n_actions) # Explore
else:
action = np.argmax(Q_table[current_state]) # Exploit
# Simulate the environment (move to the next state)
# For simplicity, move to the next state
next_state = (current_state + 1) % n_states
# Define a simple reward function (1 if the goal state is reached, 0 otherwise)
reward = 1 if next_state == goal_state else 0
# Update Q-value using the Q-learning update rule
Q_table[current_state, action] += learning_rate * \
(reward + discount_factor *
np.max(Q_table[next_state]) - Q_table[current_state, action])
current_state = next_state # Move to the next state
# After training, the Q-table represents the learned Q-values
print("Learned Q-table:")
print(Q_table)
Output:
Learned Q-table: [[0.48767498 0.48377358 0.48751874 0.48377357] [0.51252074 0.51317781 0.51334071 0.51334208] [0.54036009 0.5403255 0.54018713 0.54036009] [0.56880009 0.56880009 0.56880008 0.56880009] [0.59873694 0.59873694 0.59873694 0.59873694] [0.63024941 0.63024941 0.63024941 0.63024941] [0.66342043 0.66342043 0.66342043 0.66342043] [0.6983373 0.6983373 0.6983373 0.6983373 ] [0.73509189 0.73509189 0.73509189 0.73509189] [0.77378094 0.77378094 0.77378094 0.77378094] [0.81450625 0.81450625 0.81450625 0.81450625] [0.857375 0.857375 0.857375 0.857375 ] [0.9025 0.9025 0.9025 0.9025 ] [0.95 0.95 0.95 0.95 ] [1. 1. 1. 1. ] [0. 0. 0. 0. ]]
The Q-learning algorithm involves iterative training where the agent explores and updates its Q-table. It starts from a random state, selects actions via epsilon-greedy strategy, and simulates movements. A reward function grants a 1 for reaching the goal state. Q-values update using the Q-learning rule, combining received and expected rewards. This process continues until the agent learns optimal strategies. The final Q-table represents acquired state-action values after training.
Q-Learning in Python
Reinforcement Learning is a paradigm of the Learning Process in which a learning agent learns, over time, to behave optimally in a certain environment by interacting continuously in the environment. The agent during its course of learning experiences various situations in the environment it is in. These are called states. The agent while being in that state may choose from a set of allowable actions which may fetch different rewards (or penalties). Over time, The learning agent learns to maximize these rewards to behave optimally at any given state it is in. Q-learning is a basic form of Reinforcement Learning that uses Q-values (also called action values) to iteratively improve the behavior of the learning agent.
This example helps us to better understand reinforcement learning.