Defining Enviroment and parameters


import numpy as np # Define the environment n_states = 16 # Number of states in the grid world n_actions = 4 # Number of possible actions (up, down, left, right) goal_state = 15 # Goal state # Initialize Q-table with zeros Q_table = np.zeros((n_states, n_actions)) # Define parameters learning_rate = 0.8 discount_factor = 0.95 exploration_prob = 0.2 epochs = 1000

n this Q-learning implementation, a grid world environment is defined with 16 states, and agents can take 4 possible actions: up, down, left, and right. The goal is to reach state 15. The Q-table, initialized with zeros, serves as a memory to store Q-values for state-action pairs.

The learning parameters include a learning rate of 0.8, a discount factor of 0.95, an exploration probability of 0.2, and a total of 1000 training epochs. The learning rate influences the weight given to new information, the discount factor adjusts the importance of future rewards, and the exploration probability determines the likelihood of the agent exploring new actions versus exploiting known actions.

Throughout the training epochs, the agent explores the environment, updating Q-values based on received rewards and future expectations, ultimately learning a strategy to navigate the grid world towards the goal state.

Implement Q-Algorithm


# Q-learning algorithm for epoch in range(epochs): current_state = np.random.randint(0, n_states) # Start from a random state while current_state != goal_state: # Choose action with epsilon-greedy strategy if np.random.rand() < exploration_prob: action = np.random.randint(0, n_actions) # Explore else: action = np.argmax(Q_table[current_state]) # Exploit # Simulate the environment (move to the next state) # For simplicity, move to the next state next_state = (current_state + 1) % n_states # Define a simple reward function (1 if the goal state is reached, 0 otherwise) reward = 1 if next_state == goal_state else 0 # Update Q-value using the Q-learning update rule Q_table[current_state, action] += learning_rate * \ (reward + discount_factor * np.max(Q_table[next_state]) - Q_table[current_state, action]) current_state = next_state # Move to the next state # After training, the Q-table represents the learned Q-values print("Learned Q-table:") print(Q_table)


Learned Q-table: [[0.48767498 0.48377358 0.48751874 0.48377357] [0.51252074 0.51317781 0.51334071 0.51334208] [0.54036009 0.5403255 0.54018713 0.54036009] [0.56880009 0.56880009 0.56880008 0.56880009] [0.59873694 0.59873694 0.59873694 0.59873694] [0.63024941 0.63024941 0.63024941 0.63024941] [0.66342043 0.66342043 0.66342043 0.66342043] [0.6983373 0.6983373 0.6983373 0.6983373 ] [0.73509189 0.73509189 0.73509189 0.73509189] [0.77378094 0.77378094 0.77378094 0.77378094] [0.81450625 0.81450625 0.81450625 0.81450625] [0.857375 0.857375 0.857375 0.857375 ] [0.9025 0.9025 0.9025 0.9025 ] [0.95 0.95 0.95 0.95 ] [1. 1. 1. 1. ] [0. 0. 0. 0. ]]

The Q-learning algorithm involves iterative training where the agent explores and updates its Q-table. It starts from a random state, selects actions via epsilon-greedy strategy, and simulates movements. A reward function grants a 1 for reaching the goal state. Q-values update using the Q-learning rule, combining received and expected rewards. This process continues until the agent learns optimal strategies. The final Q-table represents acquired state-action values after training.

Reinforcement Learning is a paradigm of the Learning Process in which a learning agent learns, over time, to behave optimally in a certain environment by interacting continuously in the environment. The agent during its course of learning experiences various situations in the environment it is in. These are called states. The agent while being in that state may choose from a set of allowable actions which may fetch different rewards (or penalties). Over time, The learning agent learns to maximize these rewards to behave optimally at any given state it is in. Q-learning is a basic form of Reinforcement Learning that uses Q-values (also called action values) to iteratively improve the behavior of the learning agent.

This example helps us to better understand reinforcement learning.


Q-learning in Reinforcement Learning

Q-learning is a popular model-free reinforcement learning algorithm used in machine learning and artificial intelligence applications. It falls under the category of temporal difference learning techniques, in which an agent picks up new information by observing results, interacting with the environment, and getting feedback in the form of rewards....

Key Components of Q-learning

Q-Values or Action-Values: Q-values are defined for states and actions. [Tex]Q(S, A) [/Tex] is an estimation of how good is it to take the action A at the state S . This estimation of [Tex]Q(S, A) [/Tex] will be iteratively computed using the TD- Update rule which we will see in the upcoming sections.Rewards and Episodes: An agent throughout its lifetime starts from a start state, and makes several transitions from its current state to a next state based on its choice of action and also the environment the agent is interacting in. At every step of transition, the agent from a state takes an action, observes a reward from the environment, and then transits to another state. If at any point in time, the agent ends up in one of the terminating states that means there are no further transitions possible. This is said to be the completion of an episode.Temporal Difference or TD-Update: The Temporal Difference or TD-Update rule can be represented as follows: [Tex]Q(S,A)\leftarrow Q(S,A) + \alpha (R + \gamma Q({S}’,{A}’) – Q(S,A)) [/Tex]This update rule to estimate the value of Q is applied at every time step of the agent’s interaction with the environment. The terms used are explained below:S – Current State of the agent.A – Current Action Picked according to some policy.S’ – Next State where the agent ends up.A’ – Next best action to be picked using current Q-value estimation, i.e. pick the action with the maximum Q-value in the next state.R – Current Reward observed from the environment in Response of current action.[Tex]\gamma [/Tex](>0 and <=1) : Discounting Factor for Future Rewards. Future rewards are less valuable than current rewards so they must be discounted. Since Q-value is an estimation of expected rewards from a state, discounting rule applies here as well.[Tex]\alpha [/Tex]: Step length taken to update the estimation of Q(S, A).Selecting the Course of Action with ϵ-greedy policy: A simple method for selecting an action to take based on the current estimates of the Q-value is the ϵ-greedy policy. This is how it operates:...

How does Q-Learning Works?

Q-learning models engage in an iterative process where various components collaborate to train the model. This iterative procedure encompasses the agent exploring the environment and continuously updating the model based on this exploration. The key components of Q-learning include:...

Q-learning Advantages and Disadvantages


Q-learning Applications

Applications for Q-learning, a reinforcement learning algorithm, can be found in many different fields. Here are a few noteworthy instances:...

