In the discrete case, we store each Q(x,a) explicitly, and update:
for some learning rate which controls how much weight we give to the reward just experienced, as opposed to the old Q-estimate. We typically start with , i.e. give full weight to the new experience. As decreases, the Q-value is building up an average of all experiences, and the odd new unusual experience won't disturb the established Q-value much. As time goes to infinity, , which would mean no learning at all, with the Q-value fixed.
Remember that because in the short term we have inaccurate Q(y,b) estimates, and because (more importantly) in the long term we have probabilistic transitions anyway, the return from the same (x,a) will vary.
We use a different for each pair (x,a), depending on how often we have visited state x and tried action a there. When we are in a new area of state-space (i.e. we have poor knowledge), the learning rate is high. We will have taken very few samples, so the average of them will be highly influenced by the current sample. When we are in a well-visited area of state-space, the learning rate is low. We have taken lots of samples, so the single current sample can't make much difference to the average of all of them (the Q-value).
[Watkins and Dayan, 1992] proved that the discrete case of Q-learning will converge to an optimal policy under the following conditions.
The learning rate , where , should take decreasing (with each update) successive values , such that and .
The typical scheme is, where is the number of times Q(x,a) has been visited:
If each pair (x,a) is visited an infinite number of times, then Watkins and Dayan showed that for lookup tables Q-learning converges to a unique set of values which define a stationary deterministic optimal policy.
Notes:
and the geometric series 1 + r + r^{2} + r^{3} + ... with 0 < r < 1 is a finite sum. Proof.
Consider: = 1, (1/2)^{p}, (1/3)^{p}, ...
will satisfy the conditions for any .
At start, whole map is 1.
After a time, unvisited (x,a) pairs
still have α(x,a) = 1.
Visited ones are down to 1/n
(which means visited how many times?).
Unvisited states x have α(x,b) = 1 for all b.
After convergence, the agent then will maximize its total discounted expected reward if it always takes the action with the highest -value. That is, the optimal policy is defined by where:
Then:
may be non-unique, but the value is unique.
Initial Q-values can be random. typically starts at 0, wiping out the initial Q-value in the first update. Although the term may factor in an inaccurate initial Q-value from elsewhere, these Q-values will themselves be wiped out by their own first update, and poor initial samples have no effect if in the long run samples are accurate (Theorem A.2 later).
Each Q-value Q(x,a) converges to Q^{*}(x,a)
where the optimum action to take in x is the action a^{*} s.t.
Q^{*}(x,a^{*}) = max_{b} Q^{*}(x,b)
But what normally happens is:
Long before the Q-values finalise,
we will have fixated on the best action.
Often surprisingly early,
we find that:
max_{b} Q(x,b) = Q(x,a^{*})
(note Q, not Q^{*})
- that is, the leader now will still be the leader when the
Q-values converge
- the leader now
is actually a^{*}
though we don't know it yet.
In practice, the competition rapidly settles down into a contest between 1 or 2 actions.
We usually find that a greedy policy (execute the action with the highest Q-value) is an optimal policy long before the Q-values have converged to their final values.
"Infinite no. of samples" seems like a very weak convergence result. But it is only demanded because in a probabilistic world there is always a non-zero probability of any pattern of unlucky samples - e.g. Could get 1000 heads in a row, even though the coin is not only not biased towards heads, but in fact is biased towards tails. In reality, we would be very unlucky not to be operating according to an optimal policy quite quickly.
Q. If coin is biased 75-25 towards tails, what is prob. of 1000 heads in a row?Q. If you did one 1000-toss experiment a day, how long would you wait to see a single 1000-head event?
Another way of looking at it:
As we learn, we have a policy. Eventually this policy is optimal. We can definitely say:
As time goes to infinity, p(policy is optimal) -> 1.In practice, need to be very unlucky not to have optimal policy after short finite time.
e.g. After short finite time, p(policy is optimal) = 0.99999.
Consider:
Q(y,b) = 0 + γ E(V(next state))
V(z) is very high, but y,b only leads there with prob. 0.01.
V(state) is very low for the states that y,b leads to with combined prob. 0.99.
(0.99) low + (0.01) high
can still be high if "high" is high enough.
Hence Q(y,b) can be high if the r for c is high enough.
Optimal policy is to take actions a in x, b in y, c in z.
But will take a long time to learn this.
Because it's hard to get to y at all.
And even harder to get to z at all.
And you have to get to both many times in order to find b and c.
In finite time: Will learn sub-optimal policy of maximising the other rewards.
Will think we've done well.
Never find out about higher rewards available at z.
Given that a real-world experiment will end (has to end) after n timesteps, for some finite n, it is easy to design a MDP such that it is highly unlikely that you will learn an optimal policy in n steps. Can do this no matter what n is, so long as it is finite.
How do we know your problem world is not like this? You just have to hope your problem world is not like this. You just have to hope your n is big enough.
Q-learning is attractive because of the simplicity of the computational demands per timestep, and because of this proof of convergence to a global optimum, avoiding all local optima.
Maximising r's may not solve the problem:
An undergrad student of mine,
Eoin Murphy,
uses RL to play Super Mario in 2016.
His agent maximises r's by jumping up and down.
(Starts at 1:08.)
RL - Have to actually be in state x
and have to actually execute action a to find out what happens.
Problem - Can't systematically say "for all x,a, go to x, take a"
because don't know how to get to x.
That's the whole problem.
If we knew how to get to any x, no need to learn anything!
Just say "start chess game. go to checkmate" and you're done!
DP doesn't know how to get to x either, but it doesn't have to actually get there. RL does have to actually get to x.
If we don't know P_{xa}(y), then Q-learning can repeatedly sample the world to construct an optimal policy, if it is a Markov world.
But of course if we don't know P_{xa}(y), how do we know it's a Markov world!
If we know it's an MDP, we don't need Q-learning!
If we don't know it's an MDP, we don't know if Q-learning works!
In practice, no one ever applies Q-learning to an MDP!
Q-learning is only ever applied to unknown worlds, that we hope can be approximated by an MDP (i.e. applying Q-learning will produce an alright policy).