How Actor-Critic Algorithm works?

Actor Critic Algorithm Objective Function

The objective function for the Actor-Critic algorithm is a combination of the policy gradient (for the actor) and the value function (for the critic).
The overall objective function is typically expressed as the sum of two components:

Policy Gradient (Actor)

[Tex]\nabla_\theta J(\theta)\approx \frac{1}{N} \sum_{i=0}^{N} \nabla_\theta \log\pi_\theta (a_i|s_i)\cdot A(s_i,a_i) [/Tex]

Here,

[Tex]J(θ)[/Tex] represents the expected return under the policy parameterized by [Tex]θ [/Tex]
[Tex]π_\theta (a∣s)[/Tex] is the policy function
N is the number of sampled experiences.
[Tex]A(s,a)[/Tex] is the advantage function representing the advantage of taking action a in state s.
i represents the index of the sample

Value Function Update (Critic)

[Tex]\nabla_w J(w) \approx \frac{1}{N}\sum_{i=1}^{N} \nabla_w (V_{w}(s_i)- Q_{w}(s_i , a_i))^2 [/Tex]

Here,

[Tex]\nabla_w J(w)[/Tex] is the gradient of the loss function with respect to the critic’s parameters w.
N is number of samples
[Tex]V_w(s_i)[/Tex] is the critic’s estimate of value of state s with parameter w
[Tex]Q_w (s_i , a_i) [/Tex] is the critic’s estimate of the action-value of taking action a
i represents the index of the sample

Update Rules

The update rules for the actor and critic involve adjusting their respective parameters using gradient ascent (for the actor) and gradient descent (for the critic).

Actor Update

[Tex] \theta_{t+1}= \theta_t + \alpha \nabla_\theta J(\theta_t) [/Tex]

Here,

[Tex]\alpha[/Tex]: learning rate for the actor
t is the time step within an episode

Critic Update

[Tex]w_{t} = w_t -\beta \nabla_w J(w_t) [/Tex]

Here,

w represents the parameters of the critic network
[Tex]\beta[/Tex] is the learning rate for the critic

Advantage Function

The advantage function, [Tex]A(s,a) [/Tex], measures the advantage of taking action a in state s over the expected value of the state under the current policy.

[Tex]A(s,a)=Q(s,a)−V(s) [/Tex]

The advantage function, then, provides a measure of how much better or worse an action is compared to the average action.

These mathematical expressions highlight the essential computations involved in the Actor-Critic method. The actor is updated based on the policy gradient, encouraging actions with higher advantages, while the critic is updated to minimize the difference between the estimated value and the action-value.

Actor-Critic Algorithm in Reinforcement Learning

Reinforcement learning (RL) stands as a pivotal component in the realm of artificial intelligence, enabling agents to learn optimal decision-making strategies through interaction with their environments.

Let’s Dive into the actor-critic algorithm, a key concept in reinforcement learning, and learn how it can improve your machine learning models.

Table of Content

What is the Actor-Critic Algorithm?
How Actor-Critic Algorithm works?
A2C (Advantage Actor-Critic)
Training Agent: Actor-Critic Algorithm
Advantages of Actor Critic Algorithm
Advantage Actor Critic (A2C) vs. Asynchronous Advantage Actor Critic (A3C)
Conclusion

How Actor-Critic Algorithm works?

Actor Critic Algorithm Objective Function

Policy Gradient (Actor)

Value Function Update (Critic)

Update Rules

Actor Update

Critic Update

Advantage Function

Actor-Critic Algorithm in Reinforcement Learning

Categories

Contact US

How Actor-Critic Algorithm works?

Actor Critic Algorithm Objective Function

Policy Gradient (Actor)

Value Function Update (Critic)

Update Rules

Actor Update

Critic Update

Advantage Function

Actor-Critic Algorithm in Reinforcement Learning

Similar Reads

Categories

Contact US