How Actor-Critic Algorithm works?
Actor Critic Algorithm Objective Function
- The objective function for the Actor-Critic algorithm is a combination of the policy gradient (for the actor) and the value function (for the critic).
- The overall objective function is typically expressed as the sum of two components:
Policy Gradient (Actor)
[Tex]\nabla_\theta J(\theta)\approx \frac{1}{N} \sum_{i=0}^{N} \nabla_\theta \log\pi_\theta (a_i|s_i)\cdot A(s_i,a_i) [/Tex]
Here,
- [Tex]J(θ)[/Tex] represents the expected return under the policy parameterized by [Tex]θ [/Tex]
- [Tex]π_\theta (a∣s)[/Tex] is the policy function
- N is the number of sampled experiences.
- [Tex]A(s,a)[/Tex] is the advantage function representing the advantage of taking action a in state s.
- i represents the index of the sample
Value Function Update (Critic)
[Tex]\nabla_w J(w) \approx \frac{1}{N}\sum_{i=1}^{N} \nabla_w (V_{w}(s_i)- Q_{w}(s_i , a_i))^2 [/Tex]
Here,
- [Tex]\nabla_w J(w)[/Tex] is the gradient of the loss function with respect to the critic’s parameters w.
- N is number of samples
- [Tex]V_w(s_i)[/Tex] is the critic’s estimate of value of state s with parameter w
- [Tex]Q_w (s_i , a_i) [/Tex] is the critic’s estimate of the action-value of taking action a
- i represents the index of the sample
Update Rules
The update rules for the actor and critic involve adjusting their respective parameters using gradient ascent (for the actor) and gradient descent (for the critic).
Actor Update
[Tex] \theta_{t+1}= \theta_t + \alpha \nabla_\theta J(\theta_t) [/Tex]
Here,
- [Tex]\alpha[/Tex]: learning rate for the actor
- t is the time step within an episode
Critic Update
[Tex]w_{t} = w_t -\beta \nabla_w J(w_t) [/Tex]
Here,
- w represents the parameters of the critic network
- [Tex]\beta[/Tex] is the learning rate for the critic
Advantage Function
The advantage function, [Tex]A(s,a) [/Tex], measures the advantage of taking action a in state s over the expected value of the state under the current policy.
[Tex]A(s,a)=Q(s,a)−V(s) [/Tex]
The advantage function, then, provides a measure of how much better or worse an action is compared to the average action.
These mathematical expressions highlight the essential computations involved in the Actor-Critic method. The actor is updated based on the policy gradient, encouraging actions with higher advantages, while the critic is updated to minimize the difference between the estimated value and the action-value.
Actor-Critic Algorithm in Reinforcement Learning
Reinforcement learning (RL) stands as a pivotal component in the realm of artificial intelligence, enabling agents to learn optimal decision-making strategies through interaction with their environments.
Let’s Dive into the actor-critic algorithm, a key concept in reinforcement learning, and learn how it can improve your machine learning models.
Table of Content
- What is the Actor-Critic Algorithm?
- How Actor-Critic Algorithm works?
- A2C (Advantage Actor-Critic)
- Training Agent: Actor-Critic Algorithm
- Advantages of Actor Critic Algorithm
- Advantage Actor Critic (A2C) vs. Asynchronous Advantage Actor Critic (A3C)
- Conclusion