Dr. Mark Humphrys School of Computing. Dublin City University. My big idea: Ancient Brain Search:

# Q-learning

Recall we are trying to maximise long-term discounted reward.
Taking action a in state x yields long-term discounted reward:

```

```

# Q(x,a) values

The Q-learning agent builds up Quality-values (Q-values) Q(x,a) for each pair (x,a).

We tend to take the action with the highest Q-value:

maxb Q(x,b)

The Q-value will express the discounted reward if you take that action in that state:

Can be built up recursively:

```

```

# Naive attempt at a learning algorithm

You are in state x, take action a, get reward r, new state y. Then update:

 Q(x,a) := r + γ maxb Q(y,b)

Recall Assignment Notation

```
```
Problem: We allow a probabilistic world (MDP), where taking the same action a in same state x can have different results.
r may vary. y may vary.
Want to average these results instead of Q-value just expressing the most recent one.

With the above rule, the Q-value will bounce back and forth:

Q(x,a) := 5
Q(x,a) := 10
Q(x,a) := 6
...

More sensible to build up an average:

Q(x,a) := 5
Q(x,a) := 1/2 (5 + 10)
Q(x,a) := 1/3 (5 + 10 + 6)
...

How do we do this? Do we need an ever-growing memory of all past outcomes?

```
```

```
```

```
```

```
```

# Core Q-learning algorithm

 In 1-step Q-learning, after each experience, we observe state y, receive reward r, and update: (remember notation). The Q-value is updated in the direction of the current sample. Note that we are updating from the current estimate of Q(y,b) - this too is constantly changing. If we store each Q(x,a) explicitly in lookup tables, we update:

```
```

# Q-values are bounded

Since the rewards are bounded, it follows that the Q-values are bounded.

Proof: In the discrete case, Q is updated by:

so by Theorem B.1:

This can also be viewed in terms of temporal discounting:

Similarly:

For example, if , then . And (assuming ) as , .

Note that since , it follows that .

```
```

# Qmax and Qmin

i.e. If take this action, you will get immediate reward rmax
and you can get next reward rmax (it is the best you can get in next state, other actions may be worse)
and you can get next reward rmax ..
....

i.e. If take this action, you will get immediate reward rmin
and you have to get next reward rmin (all actions in next state lead to the worst reward rmin, otherwise the Q-value would be higher)
and you have to get next reward rmin ..
....

i.e. You are trapped in an attractor from which you cannot escape.
You cannot ever again get out and go to a state with rewards greater than rmin (and such states must exist).
Not necessarily a single absorbing state, but a family of states outside of which you can never leave.

```
```

```
```
Note if expected reward E(r) = rmin, then you will definitely get rmin on this action (r deterministic, though y can still be probabilistic).
If any reward r   >   rmin has non-zero probability, then E(r)   >   rmin

E(r) = p rmin + (1-p) r
> p rmin + (1-p) rmin
= rmin
```

```
```
```

Feeds      w2mind.org      ancientbrain.com

On the Internet since 1987.