I've seen such words as:
A policy defines the learning agent's way of behaving at a given time. Roughly speaking, a policy is a mapping from perceived states of the environment to actions to be taken when in those states.
But still didn't fully understand. What exactly is a policy in reinforcement learning?
A policy is, therefore, a strategy that an agent uses in pursuit of goals. The policy dictates the actions that the agent takes as a function of the agent's state and the environment.
Policy. A policy defines how an agent acts from a specific state. For a deterministic policy, it is the action taken at a specific state. For a stochastic policy, it is the probability of taking an action a given the state s.
On-policy methods attempt to evaluate or improve the policy that is used to make decisions. In contrast, off-policy methods evaluate or improve a policy different from that used to generate the data.
What is the difference between a policy action and a regular action? a “policy” has no end date, it is still technically an “action”. A policy is also normally set to reapply so that if the change is reverted, it will be remade.
The definition is correct, though not instantly obvious if you see it for the first time. Let me put it this way: a policy is an agent's strategy.
For example, imagine a world where a robot moves across the room and the task is to get to the target point (x, y), where it gets a reward. Here:
A policy is what an agent does to accomplish this task:
Obviously, some policies are better than others, and there are multiple ways to assess them, namely state-value function and action-value function. The goal of RL is to learn the best policy. Now the definition should make more sense (note that in the context time is better understood as a state):
A policy defines the learning agent's way of behaving at a given time.
More formally, we should first define Markov Decision Process (MDP) as a tuple (S
, A
, P
, R
, y
), where:
S
is a finite set of statesA
is a finite set of actionsP
is a state transition probability matrix (probability of ending up in a state for each current state and each action)R
is a reward function, given a state and an actiony
is a discount factor, between 0 and 1Then, a policy π
is a probability distribution over actions given states. That is the likelihood of every action when an agent is in a particular state (of course, I'm skipping a lot of details here). This definition corresponds to the second part of your definition.
I highly recommend David Silver's RL course available on YouTube. The first two lectures focus particularly on MDPs and policies.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With