Dr. Mark Humphrys

School of Computing. Dublin City University.

Home      Blog      Teaching      Research      Contact

Search:

CA216      CA249      CA318

CA400      CA651      CA668


Exercise - long-term reward


What is wrong with γ = 1?

i.e. Just sum all future rewards.
Consider this situation:


State-space is not geographic-space

Note state-space is not geographic-space.
Transitions could be one-way, not 2-way. No easy way back to the state you left forever.
e.g. In chess, state = your queen is captured.

or e.g. You are Hitler. It is 1941. You have conquered Europe and are unthreatened.
So you decide to invade the Soviet Union and declare war on the US (neither of whom were at war with you so far).
There is no way back ever to that state before invasion. You have gone through a one-way door.



r=0, γ = anything

If take action a, expected long-term reward:
Ea(R) = 5 + γ 0 + γ2 0 + γ3 0 + ..
    = 5

Eb(R) = 0 + γ 100 + γ2 0 + γ3 0 + ..
    = 100 γ

  1. If γ = 0, it picks a.

  2. If γ = 1, it takes action b.

  3. Exercise - What is the smallest γ for which it picks b?



r=1, γ = anything

Imagine if the infinite loops scored 1 each time round instead of 0.

Ea(R) = 5 + γ + γ2 + γ3 + ..

Eb(R) = 0 + γ 100 + γ2 + γ3 + ..

  1. If γ = 0, it still picks a.

  2. If γ = 1
    Ea(R) = 5 + 1 + 1 + ..
    Eb(R) = 0 + 100 + 1 + 1 + ...
    Both infinite, can't tell the difference, will pick either one. Not necessarily b.

    Notes on infinity:
    If γ = 1, then even if one infinite loop gives 1 forever, and the other gives 100 forever, it can't tell the difference. Both infinite. In fact, all the following are the same:

    1 + 1 + 1 + 1 + ...
    -1000 -1000 -1000 -1000 -1000 + 1 + 1 + 1 + ...
    10 + 10 + 10 + 10 + ...
    For any γ < 1 we do not have this problem.

  3. Exercise - If γ < 1, for what γ will it pick b?



r=anything, γ = anything

If the infinite loops give reward r each time round:

Ea(R) = 5 + γ r + γ2 r + γ3 r + ..
    = 5 + γ r + r (γ2 + γ3 + .. )

Eb(R) = 0 + γ 100 + γ2 r + γ3 r + ..
    = 100 γ + r (γ2 + γ3 + .. )



Feeds      w2mind.org

On Internet since 1987.