I want to implement the following algorithm, taken from this book, section 13.6:
I don't understand how to implement the update rule in pytorch (the rule for w is quite similar to that of theta).
As far as I know, torch requires a loss for loss.backwward()
.
This form does not seem to apply for the quoted algorithm.
I'm still certain there is a correct way of implementing such update rules in pytorch.
Would greatly appreciate a code snippet of how the w weights should be updated, given that V(s,w) is the output of the neural net, parameterized by w.
EDIT: Chris Holland suggested a way to implement, and I implemented it. It does not converge on Cartpole, and I wonder if I did something wrong.
The critic does converge on the solution to the function gamma*f(n)=f(n)-1
which happens to be the sum of the series gamma+gamma^2+...+gamma^inf
meaning, gamma=1 diverges. gamma=0.99 converges on 100, gamma=0.5 converges on 2 and so on. Regardless of the actor or policy.
The code:
def _update_grads_with_eligibility(self, is_critic, delta, discount, ep_t):
gamma = self.args.gamma
if is_critic:
params = list(self.critic_nn.parameters())
lamb = self.critic_lambda
eligibilities = self.critic_eligibilities
else:
params = list(self.actor_nn.parameters())
lamb = self.actor_lambda
eligibilities = self.actor_eligibilities
is_episode_just_started = (ep_t == 0)
if is_episode_just_started:
eligibilities.clear()
for i, p in enumerate(params):
if not p.requires_grad:
continue
eligibilities.append(torch.zeros_like(p.grad, requires_grad=False))
# eligibility traces
for i, p in enumerate(params):
if not p.requires_grad:
continue
eligibilities[i][:] = (gamma * lamb * eligibilities[i]) + (discount * p.grad)
p.grad[:] = delta.squeeze() * eligibilities[i]
and
expected_reward_from_t = self.critic_nn(s_t)
probs_t = self.actor_nn(s_t)
expected_reward_from_t1 = torch.tensor([[0]], dtype=torch.float)
if s_t1 is not None: # s_t is not a terminal state, s_t1 exists.
expected_reward_from_t1 = self.critic_nn(s_t1)
delta = r_t + gamma * expected_reward_from_t1.data - expected_reward_from_t.data
negative_expected_reward_from_t = -expected_reward_from_t
self.critic_optimizer.zero_grad()
negative_expected_reward_from_t.backward()
self._update_grads_with_eligibility(is_critic=True,
delta=delta,
discount=discount,
ep_t=ep_t)
self.critic_optimizer.step()
EDIT 2: Chris Holland's solution works. The problem originated from a bug in my code that caused the line
if s_t1 is not None:
expected_reward_from_t1 = self.critic_nn(s_t1)
to always get called, thus expected_reward_from_t1
was never zero, and thus no stopping condition was specified for the bellman equation recursion.
With no reward engineering, gamma=1
, lambda=0.6
, and a single hidden layer of size 128 for both actor and critic, this converged on a rather stable optimal policy within 500 episodes.
Even faster with gamma=0.99
, as the graph shows (best discounted episode reward is about 86.6).
BIG thank you to @Chris Holland, who "gave this a try"
I am gonna give this a try.
.backward()
does not need a loss function, it just needs a differentiable scalar output. It approximates a gradient with respect to the model parameters. Let's just look at the first case the update for the value function.
We have one gradient appearing for v, we can approximate this gradient by
v = model(s)
v.backward()
This gives us a gradient of v
which has the dimension of your model parameters. Assuming we already calculated the other parameter updates, we can calculate the actual optimizer update:
for i, p in enumerate(model.parameters()):
z_theta[i][:] = gamma * lamda * z_theta[i] + l * p.grad
p.grad[:] = alpha * delta * z_theta[i]
We can then use opt.step()
to update the model parameters with the adjusted gradient.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With