Dr. Mark Humphrys

School of Computing. Dublin City University.

Home      Blog      Teaching      Research      Contact


CA216      CA249      CA318

CA400      CA651      CA668

RL - The task

Deterministic v. Stochastic Policy

The agent acts according to a policy tex2html_wrap_inline6475 which tells it what action to take in each state x. A policy that specifies a unique action to be performed tex2html_wrap_inline6479 is called a deterministic policy - as opposed to a stochastic policy, where an action a is chosen from a distribution tex2html_wrap_inline6483 with probability tex2html_wrap_inline6485 .

a       a1   a2    a3
Q(x,a)  0    5.1   5
Deterministic policy - When in x, always take action a2.

Stochastic policy - When in x, 80 % of time take a2, 20 % of time take a3.

Typically - Start stochastic. Become more deterministic as we learn.

A policy that has no concept of time is called a stationary or memory-less policy - as opposed to a non-stationary policy (e.g. "do actions a and b alternately"). A non-stationary policy requires the agent to possess memory. Note that stochastic does not imply non-stationary.

Following a stationary deterministic policy tex2html_wrap_inline6475 , at time t, the agent observes state tex2html_wrap_inline6495 , takes action tex2html_wrap_inline6497 , observes new state tex2html_wrap_inline6499 , and receives reward tex2html_wrap_inline6501 with expected value:


The task - Go for short-term gain or long-term gain?

Do we go for short-term gain or long-term gain?
Can we take Short-term loss for Long-term gain?

Instead of deciding here for all problems, we make this a parameter that we can adjust.

Not just maximise immediate E(r) (i.e. simply deal with probability).

Could maximise long-term sum of expected rewards. This may involve taking a short-term loss for long-term gain (equivalent to allowing a move downhill in hill-climbing).

Discounted Reward - Using the discounting factor γ (gamma)

The task is for the agent to optimise not necessarily immediate rewards, but the total discounted reward.
In this measure, rewards received n steps into the future are worth less than rewards received now, by a factor of tex2html_wrap_inline6505 where tex2html_wrap_inline6507 :


The discounting factor γ defines how much expected future rewards affect decisions now.

0 ≤ γ < 1

(e.g. γ = 0.9)

If rewards can be positive or negative, R may be a sum of positive and negative terms.

Special case: γ = 0
Doesn't work: γ = 1 (see later)

The interesting situation is in the middle (0 < γ < 1)

Genuine immediate reinforcement learning is the special case tex2html_wrap_inline6511 , where we only try to maximize immediate reward. Low γ means pay little attention to the future. High γ means that potential future rewards have a major influence on decisions now - we may be willing to trade short-term loss for long-term gain. Note that for tex2html_wrap_inline6517 , the value of future rewards always eventually becomes negligible.

Short-term loss for Long-term gain

r in the future can even override immediate punishment

If we decide what to do based on which path has highest R, then [punishment now, r in the future] can beat [reward now, but not in future] if:

  1. r is big enough
  2. r is near enough - not discounted by γn
  3. both of the above


If maximum reward is rmax, what is maximum possible value of R?
The most it could be would be to get the maximum reward every step:

What is this quantity on the right? Let:

Multiply by γ:

Subtract the 2 equations:

And we get a value for X:

That is:

This is just the Negative Binomial Series in the Binomial Theorem.
(x+1)(-1) = 1 - x + x2 - x3 + x4 - ...
Let x = - γ

Exercise - long-term reward

The task - Find the optimal policy

The expected total discounted reward if we follow policy tex2html_wrap_inline6475 forever, starting from tex2html_wrap_inline6495 , is:


tex2html_wrap_inline6523 is called the value of state x under policy tex2html_wrap_inline6475 . The problem for the agent is to find an optimal policy - one that maximizes the total discounted expected reward (there may be more than one policy that does this).

It is known (not shown here) that in an MDP, there is a stationary and deterministic optimal policy. Briefly, in an MDP one does not need memory to behave optimally, the reason being essentially that the current state contains all information about what is going to happen next. A stationary deterministic optimal policy tex2html_wrap_inline6529 satisfies, for all x:


For any state x, there is a unique value tex2html_wrap_inline6535 , which is the best that an agent can do from x. Optimal policies tex2html_wrap_inline6529 may be non-unique, but the value tex2html_wrap_inline6541 is unique. All optimal policies tex2html_wrap_inline6529 will have:


Dynamic Programming (DP)

If the transition probabilities tex2html_wrap_inline6279 , tex2html_wrap_inline6281 are explicitly known, Dynamic Programming finds an optimal policy by starting with random V(x) and random Q(x,a) and repeating forever (or at least until the policy is considered good enough):


This systematic iterative DP process is obviously an off-line process. The agent must cease interacting with the world while it runs through this loop until a satisfactory policy is found.

Note that initially-random V and Q are on RHS of equation, so updates must be repeated to factor in more accurate values later.

The exception is if γ = 0 (only look at immediate reward). Then you only need one iteration of updates and you are done.
You have worked out immediate E(r) completely.

Trying out states and actions haphazardly

The problem for the Q-learning agent is to find an optimal policy when tex2html_wrap_inline6279 , tex2html_wrap_inline6281 are initially unknown (i.e. we are assuming when modelling the world that some such transition probabilities exist). The agent must interact with the world to learn these probabilities. Hence we cannot try out states and actions systematically as in Dynamic Programming. We may not know how to return to x to try out a different action (x,b) - we just have to wait until x happens to present itself again. We will experience states and actions rather haphazardly.

Fortunately, we can still learn from this. In the DP process above, the updates of V(x) can in fact occur in any order, provided that each x will be visited infinitely often over an infinite run. So we can interleave updates with acting in the world, having a modest amount of computation per timestep. In Q-learning we cannot update directly from the transition probabilities - we can only update from individual experiences.

How do I return to state x?

In the real world, we may not know how to get back to state x to try another action:
x = "I am at party"
a = "I say chat-up line c1"
result - punishment

I might want to say "Go back to x, try c2" but I may not know how to get back to x. All I can do is try out actions in the states in which I happen to find myself. I have to wait until state x happens to present itself again.

DP v. Q-learning

Q-learning is for when you don't know Pxa(y) and Pxa(r).

But one approach is just:
Interact with world large number of times to build up a table of estimates for Pxa(y) and Pxa(r).
Then just do offline DP.

Feeds      w2mind.org

On Internet since 1987.