School of Computing. Dublin City University. Home Blog Teaching Research Contact My big idea: Ancient Brain 


In the discrete case, we store each Q(x,a) explicitly, and update:
for some learning rate which controls how much weight we give to the reward just experienced, as opposed to the old Qestimate. We typically start with , i.e. give full weight to the new experience. As decreases, the Qvalue is building up an average of all experiences, and the odd new unusual experience won't disturb the established Qvalue much. As time goes to infinity, , which would mean no learning at all, with the Qvalue fixed.
Remember that because in the short term we have inaccurate Q(y,b) estimates, and because (more importantly) in the long term we have probabilistic transitions anyway, the return from the same (x,a) will vary.
We use a different for each pair (x,a), depending on how often we have visited state x and tried action a there. When we are in a new area of statespace (i.e. we have poor knowledge), the learning rate is high. We will have taken very few samples, so the average of them will be highly influenced by the current sample. When we are in a wellvisited area of statespace, the learning rate is low. We have taken lots of samples, so the single current sample can't make much difference to the average of all of them (the Qvalue).
[Watkins and Dayan, 1992] proved that the discrete case of Qlearning will converge to an optimal policy under the following conditions.
The learning rate , where , should take decreasing (with each update) successive values , such that and .
The typical scheme is, where is the number of times Q(x,a) has been visited:
If each pair (x,a) is visited an infinite number of times, then Watkins and Dayan showed that for lookup tables Qlearning converges to a unique set of values which define a stationary deterministic optimal policy.
Notes:
and the geometric series 1 + r + r^{2} + r^{3} + ... with 0 < r < 1 is a finite sum. Proof.
Consider: = 1, (1/2)^{p}, (1/3)^{p}, ...
will satisfy the conditions for any .
At start, whole map is 1.
After a time, unvisited (x,a) pairs
still have α(x,a) = 1.
Visited ones are down to 1/n
(which means visited how many times?).
Unvisited states x have α(x,b) = 1 for all b.
After convergence, the agent then will maximize its total discounted expected reward if it always takes the action with the highest value. That is, the optimal policy is defined by where:
Then:
may be nonunique, but the value is unique.
Initial Qvalues can be random. typically starts at 0, wiping out the initial Qvalue in the first update. Although the term may factor in an inaccurate initial Qvalue from elsewhere, these Qvalues will themselves be wiped out by their own first update, and poor initial samples have no effect if in the long run samples are accurate (Theorem A.2 later).
Each Qvalue Q(x,a) converges to Q^{*}(x,a)
where the optimum action to take in x is the action a^{*} s.t.
Q^{*}(x,a^{*}) = max_{b} Q^{*}(x,b)
But what normally happens is:
Long before the Qvalues finalise,
we will have fixated on the best action.
Often surprisingly early,
we find that:
max_{b} Q(x,b) = Q(x,a^{*})
(note Q, not Q^{*})
 that is, the leader now will still be the leader when the
Qvalues converge
 the leader now
is actually a^{*}
though we don't know it yet.
In practice, the competition rapidly settles down into a contest between 1 or 2 actions.
We usually find that a greedy policy (execute the action with the highest Qvalue) is an optimal policy long before the Qvalues have converged to their final values.
"Infinite no. of samples" seems like a very weak convergence result. But it is only demanded because in a probabilistic world there is always a nonzero probability of any pattern of unlucky samples  e.g. Could get 1000 heads in a row, even though the coin is not only not biased towards heads, but in fact is biased towards tails. In reality, we would be very unlucky not to be operating according to an optimal policy quite quickly.
Q. If coin is biased 7525 towards tails, what is prob. of 1000 heads in a row?Q. If you did one 1000toss experiment a day, how long would you wait to see a single 1000head event?
Another way of looking at it:
As we learn, we have a policy. Eventually this policy is optimal. We can definitely say:
As time goes to infinity, p(policy is optimal) > 1.In practice, need to be very unlucky not to have optimal policy after short finite time.
e.g. After short finite time, p(policy is optimal) = 0.99999.
Consider:
Q(y,b) = 0 + γ E(V(next state))
V(z) is very high, but y,b only leads there with prob. 0.01.
V(state) is very low for the states that y,b leads to with combined prob. 0.99.
(0.99) low + (0.01) high
can still be high if "high" is high enough.
Hence Q(y,b) can be high if the r for c is high enough.
Optimal policy is to take actions a in x, b in y, c in z.
But will take a long time to learn this.
Because it's hard to get to y at all.
And even harder to get to z at all.
And you have to get to both many times in order to find b and c.
In finite time: Will learn suboptimal policy of maximising the other rewards.
Will think we've done well.
Never find out about higher rewards available at z.
Given that a realworld experiment will end (has to end) after n timesteps, for some finite n, it is easy to design a MDP such that it is highly unlikely that you will learn an optimal policy in n steps. Can do this no matter what n is, so long as it is finite.
How do we know your problem world is not like this? You just have to hope your problem world is not like this. You just have to hope your n is big enough.
Qlearning is attractive because of the simplicity of the computational demands per timestep, and because of this proof of convergence to a global optimum, avoiding all local optima.
Maximising r's may not solve the problem:
An undergrad student of mine,
Eoin Murphy,
uses RL to play Super Mario in 2016.
His agent maximises r's by jumping up and down.
(Starts at 1:08.)
RL  Have to actually be in state x
and have to actually execute action a to find out what happens.
Problem  Can't systematically say "for all x,a, go to x, take a"
because don't know how to get to x.
That's the whole problem.
If we knew how to get to any x, no need to learn anything!
Just say "start chess game. go to checkmate" and you're done!
DP doesn't know how to get to x either, but it doesn't have to actually get there. RL does have to actually get to x.
If we don't know P_{xa}(y), then Qlearning can repeatedly sample the world to construct an optimal policy, if it is a Markov world.
But of course if we don't know P_{xa}(y), how do we know it's a Markov world!
If we know it's an MDP, we don't need Qlearning!
If we don't know it's an MDP, we don't know if Qlearning works!
In practice, no one ever applies Qlearning to an MDP!
Qlearning is only ever applied to unknown worlds, that we hope can be approximated by an MDP (i.e. applying Qlearning will produce an alright policy).