We tend to take the action with the highest Qvalue:
max_{b} Q(x,b)
The Qvalue will express the discounted reward if you take that action in that state:
Can be built up recursively:
Q(x,a) := r + γ max_{b} Q(y,b) 
Recall Assignment Notation
Problem: We allow a probabilistic world (MDP), where taking the same action a in same state x can have different results.
With the above rule, the Qvalue will bounce back and forth:
Q(x,a) := 5
Q(x,a) := 10
Q(x,a) := 6
...
More sensible to build up an average:
Q(x,a) := 5
Q(x,a) := 1/2 (5 + 10)
Q(x,a) := 1/3 (5 + 10 + 6)
...
How do we do this? Do we need an evergrowing memory of all past outcomes?
In 1step Qlearning, after each experience, we observe state y, receive reward r, and update:
(remember notation). The Qvalue is updated in the direction of the current sample. Note that we are updating from the current estimate of Q(y,b)  this too is constantly changing. If we store each Q(x,a) explicitly in lookup tables, we update:

Since the rewards are bounded, it follows that the Qvalues are bounded.
Proof: In the discrete case, Q is updated by:
so by Theorem B.1:
This can also be viewed in terms of temporal discounting:
Similarly:
For example, if , then . And (assuming ) as , .
Note that since , it follows that .
i.e. If take this action, you will get immediate reward r_{max}
and you can get next reward r_{max}
(it is the best you can get in next state, other actions may be worse)
and you can get next reward r_{max} ..
....
i.e. If take this action,
you will get immediate reward r_{min}
and you have to get next reward r_{min}
(all actions in next state lead to
the worst reward r_{min},
otherwise the Qvalue would be higher)
and you have to get next reward r_{min} ..
....
i.e. You are trapped in an attractor
from which you cannot escape.
You cannot ever again get out and go to a state
with rewards greater than r_{min} (and such states must exist).
Not necessarily a single absorbing state,
but a family of states outside of which you can never leave.
Note if expected reward E(r) = r_{min}, then you will definitely get r_{min} on this action (r deterministic, though y can still be probabilistic).