and get access to the augmented documentation experience
Collaborate on models, datasets and Spaces
Faster examples with accelerated inference
Switch between documentation themes
to get started
(Optional) the Policy Gradient Theorem
In this optional section where we’re going to study how we differentiate the objective function that we will use to approximate the policy gradient.
Let’s first recap our different formulas:
The Objective function
The probability of a trajectory (given that action comes fromπθ):
So we have:
∇θJ(θ)=∇θ∑τP(τ;θ)R(τ)
We can rewrite the gradient of the sum as the sum of the gradient:
=∑τ∇θ(P(τ;θ)R(τ))=∑τ∇θP(τ;θ)R(τ) asR(τ) is not dependent onθ
We then multiply every term in the sum byP(τ;θ)P(τ;θ)(which is possible since it’s = 1)
=∑τP(τ;θ)P(τ;θ)∇θP(τ;θ)R(τ)
We can simplify further this sinceP(τ;θ)P(τ;θ)∇θP(τ;θ)=P(τ;θ)P(τ;θ)∇θP(τ;θ).
Thus we can rewrite the sum as
P(τ;θ)P(τ;θ)∇θP(τ;θ)=∑τP(τ;θ)P(τ;θ)∇θP(τ;θ)R(τ)
We can then use the derivative log trick (also called likelihood ratio trick or REINFORCE trick), a simple rule in calculus that implies that∇xlogf(x)=f(x)∇xf(x)
So given we haveP(τ;θ)∇θP(τ;θ) we transform it as∇θlogP(τ∣θ)
So this is our likelihood policy gradient:
∇θJ(θ)=∑τP(τ;θ)∇θlogP(τ;θ)R(τ)
Thanks for this new formula, we can estimate the gradient using trajectory samples (we can approximate the likelihood ratio policy gradient with sample-based estimate if you prefer).
∇θJ(θ)=m1∑i=1m∇θlogP(τ(i);θ)R(τ(i)) where eachτ(i) is a sampled trajectory.
But we still have some mathematics work to do there: we need to simplify∇θlogP(τ∣θ)
We know that:
∇θlogP(τ(i);θ)=∇θlog[μ(s0)∏t=0HP(st+1(i)∣st(i),at(i))πθ(at(i)∣st(i))]
Whereμ(s0) is the initial state distribution andP(st+1(i)∣st(i),at(i)) is the state transition dynamics of the MDP.
We know that the log of a product is equal to the sum of the logs:
∇θlogP(τ(i);θ)=∇θ[logμ(s0)+t=0∑HlogP(st+1(i)∣st(i)at(i))+t=0∑Hlogπθ(at(i)∣st(i))]
We also know that the gradient of the sum is equal to the sum of gradient:
∇θlogP(τ(i);θ)=∇θlogμ(s0)+∇θt=0∑HlogP(st+1(i)∣st(i)at(i))+∇θt=0∑Hlogπθ(at(i)∣st(i))
Since neither initial state distribution or state transition dynamics of the MDP are dependent ofθ, the derivate of both terms are 0. So we can remove them: