Dr. Mark Humphrys

School of Computing. Dublin City University.

Home      Blog      Teaching      Research      Contact

Search:

CA216      CA249      CA318

CA400      CA651      CA668


Help on displaying equations


Mark Humphrys - Research - PhD - Chapter 5 - Chapter 6



6 W-learning (Minimize the Worst Unhappiness)

We do not assume in general that other agents' actions are meaningful to an agent. We do not assume that we have a handy estimate of tex2html_wrap_inline7415 . If we don't, then all we can do is observe what happened when we were not obeyed and the unrecognised action was taken. We can observe the tex2html_wrap_inline7019 and y it led to, and so build up a substitute for tex2html_wrap_inline7415 .

Now there might be many different unrecognised actions being taken by the leaders of different states. Rather than adding them all to our action set and learning a huge tex2html_wrap_inline7017 for every new action for every state, we only learn the minimum that we actually need to compete. In each state x, we simply learn how bad it is not to be obeyed in this state. We learn tex2html_wrap_inline7131 , which will require less memory than learning tex2html_wrap_inline7017 for more than one new action. A change of leader in the state does not give us a new tex2html_wrap_inline7017 quantity to learn (i.e. we have to expand our memory dynamically). Instead we just change the quantity of the single W-value for the state.

Another complication avoided by this strategy is that the leading agent may be executing a confusing, non-deterministic policy from our viewpoint, perhaps because it perceives a different state to the state that we perceive. In this case, we could not find a single fixed action tex2html_wrap_inline6865 with which to learn tex2html_wrap_inline7415 . Instead we would have to learn tex2html_wrap_inline7437 , where action k means "do whatever agent tex2html_wrap_inline7075 would do now". We will look more closely in §6.6 at the confusion caused by agents perceiving different subspaces. For now we assume that all agents build up W-values for the full state x. But note in passing that the W-value again avoids such complication by its strategy of simply building up an averaged estimate of how bad it is not to be obeyed.

We will need more than one sample to build up a proper estimate. W-learning (introduced in [Humphrys, 1995]) is a way of building up such an estimate when agents do not share the same suite of actions. When agent tex2html_wrap_inline7075 is the winner and has its action executed, all agents tex2html_wrap_inline6859 except tex2html_wrap_inline7075 update:

displaymath7411

for tex2html_wrap_inline7451 , but note that the reward tex2html_wrap_inline7019 and next state y were caused by tex2html_wrap_inline7075 rather than by agent tex2html_wrap_inline6859 itself. The reason why we do not update tex2html_wrap_inline6305 is explained later. When we see a difference term between predicted and actual, we expect that this "error term" will go to zero, but note that here it goes to a positive number.

For example, in the discrete case of W-learning, where we store each tex2html_wrap_inline7131 explicitly in lookup tables, we update:

displaymath7412

where tex2html_wrap_inline6315 takes successive values tex2html_wrap_inline7467 . Since the rewards and Q-values are bounded, it follows that the W-values are bounded (Theorem B.3).

Using such a measure of W, an agent will not need explicit knowledge about who it is competing with. Similar to how Q-learning works, a W-learning agent need only have local knowledge - what state x we were in, what action a it suggested, whether it was obeyed or not, what state y we went to, and what reward r that gave it. It will be aware of its competition only indirectly, by the interference they cause. It will be aware of them when they stop its action being obeyed, and will be aware of the y and r caused as a result. The agent will learn W by experience - by actually experiencing what the other agents want to do.

In fact, I considered not even telling the agent whether it was obeyed, but it seems there is no satisfactory way of doing this. §6.3 shows that the obeyed agent must be treated differently from the unobeyed.

In pseudocode, the W-learning method is, every time step:


singlespace1064


The change of W's mean that next time round in state x there may be a different winner, and so on. Note that by reward for Ai we really mean the combination of immediate reward and next state.



6.1 Comparison of Q-learning and W-learning

The important thing to remember in comparing Q-learning and W-learning is that they are solving different problems. Q-learning is solving the RL problem (choosing the action in pursuit of one goal). W-learning is solving the Action Selection problem (choosing the agent in timeslicing of multiple goals).

W-learning is not another method of RL. It is not a competitor of Q-learning.

Consider Q-learning as the process:

displaymath6319

where we are learning D, and d is caused by the execution of our action. Then W-learning is:

displaymath7484

where D is already learnt, and f is caused by the execution of another agent's action. In the discrete case, Q-learning would be:

displaymath6320

and W-learning would be:

displaymath7486

this is confusing because it looks like a standard way [Sutton, 1988] of writing the Q-learning update:

displaymath7487

where the expected value of the error term (d-D) goes to zero as we learn. But this is not the same error term as in W-learning:

displaymath7488



6.2 Progress of competition

In general, we assume that Q is learnt before W. Either we delay the learning of W (see [Humphrys, 1995, §3.1]) or, alternatively, imagine a dynamically changing collection with agents being continually created and destroyed over time, and the surviving agents adjusting their W-values as the nature of their competition changes. Q is only learnt once, right from the start of the life of the agent, whereas W is relearnt again and again. The skill that tex2html_wrap_inline6859 learns, expressed in its converged Q-values, remains intact through subsequent competitions for x. Once it learns its action tex2html_wrap_inline7519 it will promote it in all competitions, only varying the strength with which it is promoted, as its competition varies (this is why we only need to keep tex2html_wrap_inline7131 values, not tex2html_wrap_inline7523 values). In the long term, the single-step update for tex2html_wrap_inline6859 is approximated by:

displaymath7501

where tex2html_wrap_inline7019 and y are caused by the leader tex2html_wrap_inline7075 . Note that even though tex2html_wrap_inline7075 keeps suggesting the same action, we have probabilistic transitions so tex2html_wrap_inline7019 and y are not constant but are random variables. We can write this as:

displaymath7502

where the deviation tex2html_wrap_inline7319 (the loss that tex2html_wrap_inline7075 causes for tex2html_wrap_inline6859 by being obeyed in state x) is now not a fixed quantity as in §5.5.1 but a random variable.

The function tex2html_wrap_inline7547 is fixed, so any variation in tex2html_wrap_inline7319 is caused only by the variation in tex2html_wrap_inline7019 and y, which is caused by the variation in tex2html_wrap_inline7043 and tex2html_wrap_inline6279 (where tex2html_wrap_inline7559 ). We already know that tex2html_wrap_inline6279 is a stationary distribution. We will assume that, even though tex2html_wrap_inline7075 's action a is unrecognised by tex2html_wrap_inline6859 , we still have some stationary distribution tex2html_wrap_inline7043 . If so, then tex2html_wrap_inline7319 will be stationary, with expected value:

displaymath7503

We expect:

displaymath7504

though we do not actually update tex2html_wrap_inline6305 , and we expect for tex2html_wrap_inline7451 :

displaymath7505

That is, if obeyed, we expect f = D. If not obeyed, we expect .[1]

We cannot calculate tex2html_wrap_inline7595 analytically unless we know tex2html_wrap_inline7043 and tex2html_wrap_inline6279 . So instead, as in Q-learning, we calculate it by sampling it. If tex2html_wrap_inline7075 leads forever in state x, then by Theorem A.1:

displaymath7506

This convergence will be interrupted if some new agent takes the lead. tex2html_wrap_inline7131 itself might increase so that tex2html_wrap_inline7607 and tex2html_wrap_inline6859 then takes the lead in state x. If it does, W-learning stops for it until (if ever) it loses the lead. If another agent tex2html_wrap_inline7613 takes the lead by building up a high tex2html_wrap_inline7615 , then tex2html_wrap_inline6859 will suddenly be taking samples from the distribution tex2html_wrap_inline7619 . By Theorem A.2, if we update forever from this point, tex2html_wrap_inline7131 converges to the expected value of the new distribution:

displaymath7507

and so on.



6.2.1 Convergence

W-learning is an approximation of the walk through the matrix that we saw in §5.6. Instead of direct assignments to the expected loss, we have to take samples of the distribution of losses.

W-learning does this walk because we cannot exhaustively search all combinations k,i to find the highest tex2html_wrap_inline7595 in the matrix (or whatever we decide would be the fairest winner). It would be impractical to let every agent experience what it is like with every other agent in the lead for long enough to build up an expected value for each one. And it would also require memory to build up a map of the matrix and return to a past leader. W-learning tries to get down to a winner quicker than that. We just put someone into the lead, and it's up to the others to raise their W-values to pass it out. Anyone who can manage a higher W-value is allowed overtake.

Just as in Q-learning - where we don't actually have to wait for an infinite time for it to be useful, it will fixate on one or two actions fairly quickly - so in W-learning we don't actually require Q-learning to have converged and we don't have to wait to get to the expected value of tex2html_wrap_inline7319 for there to be a switch of leader. W-learning rapidly gets down to a competition involving only one or two agents. The switches of leader trail off fairly quickly as W rises.

In each state x, competition will be resolved when some agent tex2html_wrap_inline7075 , as a result of the deviations it suffers in the earlier stages of W-learning, accumulates a high enough W-value tex2html_wrap_inline6305 such that:

displaymath7623

As in the walk, the reason why the competition converges is that for the leader to keep changing, W must keep rising. While there may be statistical variations[2] of tex2html_wrap_inline7319 in any finite sample, in the long run the expected values tex2html_wrap_inline7595 must emerge, and the walk of §5.6 will take place.

tex2html_wrap_inline7075 wins because it has suffered a greater deviation in the past than any expected deviation it is now causing for the other agents. The agent that wins is the agent that would suffer the most if it did not win. Think of it as perhaps many agents "wanting" the state, but tex2html_wrap_inline7075 wanting it the most. Since all tex2html_wrap_inline7649 , we normally expect tex2html_wrap_inline7651 for it to win. W-learning resolves competition without resorting to devices such as killing off agents that are disobeyed for time t, without any tex2html_wrap_inline7655 , and in fact with normally most tex2html_wrap_inline7657 .

There will be a different competition in each state x, each being resolved at different times. Eventually, the entire statespace will have been divided up among the agents, with a winner for each state x.



6.3 Scoring Wk(x)

So why don't we update the leader's W-value as well? The answer is that if we do, we would be updating:

displaymath7663

The leader's W is converging to zero, while the other agents' W's are converging to tex2html_wrap_inline7649 . They are guaranteed to catch up with it. We might think it would be nice to try and reduce all weights to the minimum possible, so as soon as you are obeyed, you start reducing your weight. But you can only find the minimum by reducing so far that someone else takes over. As soon as an agent gets into the lead, its W-value would start dropping until it loses the lead. We will have back and forth competition forever under any such system, whereas we want someone to actually win the state.

If we do for all i including k:

displaymath7664

then tex2html_wrap_inline7677 stays the same, while (unless they are suffering zero) the other tex2html_wrap_inline7679 and overtake it. If we do for all i including k:

displaymath7484

then tex2html_wrap_inline7685 , while (unless they are suffering zero) the other tex2html_wrap_inline7687 some quantity > 0 and overtake it. So we can't have the same rule for the leader as for the others. The leader does nothing - it's up to the others to catch up with it. If they can't, we have a resolved competition.



6.4 Strict highest W

Not scoring the leader tex2html_wrap_inline6305 leads to a potential problem however. If tex2html_wrap_inline6305 somehow gets set to an unfairly large value, then it will never get corrected, since the other agents will be unable to catch up to it.

This can happen if it gets initialised randomly to some large value. Note that (Theorem B.3). Any W-value will eventually be challenged since we expect tex2html_wrap_inline7649 . However, a W-value in the range tex2html_wrap_inline7701 may never be challenged.

The solution is that we initialise all tex2html_wrap_inline7131 to be in the range tex2html_wrap_inline7705 . They can be random within that range since tex2html_wrap_inline6791 will wipe out the initial value in the first update.



6.5 Stochastic highest W

But an unfairly high tex2html_wrap_inline6305 can also happen because of unlucky samples. Say one time tex2html_wrap_inline7075 wasn't in the lead in state x, and it experienced:

displaymath7709

where tex2html_wrap_inline7719 is a sample from the very high end of the distribution with expected value tex2html_wrap_inline7721 . tex2html_wrap_inline7075 normally won't suffer this when tex2html_wrap_inline7725 is in the lead, but this single unlucky update puts tex2html_wrap_inline7075 into the lead and its W-value is never challenged after that. The other W's all converge to their expected values tex2html_wrap_inline7595 , but tex2html_wrap_inline6305 does not converge to anything. It remains representing this single unlucky sample.

The solution is still not to score the leader's W-value, at least not while it is in the lead. Rather, we make sure that its W-value is updated a few times before it can take the lead forever. It can't win based on just one sample.

The solution is to pick the highest W with a Boltzmann distribution that starts at a moderately high temperature (pick a stochastic winner centred on the highest W) and declines over time. We still only score the ones that aren't obeyed. We reduce the temperature until we end up with one winner, the others converging to the deviation it causes them. As before, we aim to have the finished creature end up with just a low temperature Boltzmann rather than brittle strict-determinism.

To be precise, when we observe state x, we obey agent tex2html_wrap_inline7075 with probability:

displaymath7710

Note that tex2html_wrap_inline7737 . The advantage of this is that if we start with a high temperature (pick random winners), it allows W-values to start totally random. We don't need an initialisation strategy any more.



6.6 W-learning with subspaces

The analysis in §6.2 showed that if agent tex2html_wrap_inline7075 leads in state x, then:

displaymath7739

But this assumes that the x in tex2html_wrap_inline7131 refers to the full state. What if our agents, which are already only learning Q-values in subspaces, were to only build up their W-values in subspaces too?

The problem with this is that there may be many different leaders for the many different full states which tex2html_wrap_inline6859 sees as all the one state. And even if there was just one leader, what tex2html_wrap_inline6859 sees as one state is seen by the leader as a number of different states so it will be executing different actions, causing quite different deviations for tex2html_wrap_inline6859 . From tex2html_wrap_inline6859 's point of view, sometimes the leader tex2html_wrap_inline7075 executes a good action, sometimes a bad one, for no apparent reason. But instead of this simply being samples from different ends of a distribution, we are actually taking samples from different distributions. In short, we are not repeatedly updating:

displaymath7502

for a single distribution tex2html_wrap_inline7319 with expected value tex2html_wrap_inline7595 . Rather we are updating:

displaymath7741

for many different distributions tex2html_wrap_inline7769 , representing different k's and different x's. By Theorem A.4:

displaymath7742

where p(j) is the probability of taking a sample from .[3] So we coalesce the expected values of multiple distributions into one W-value. This is a weighted mean since tex2html_wrap_inline7779 .

It is a rather crude form of competition. While Q-learning with subspaces is just as good for learning the policy as Q-learning with full space, W-learning with subspaces might not be as "intelligent" a form of competition as W-learning with full space. The agent doesn't need the full space to learn its own policy, but it may need it to talk to other agents.

We will return to this in §10. For now, we use both Q-learning and W-learning with subspaces for the sake of the tiny memory requirements this method will have. tex2html_wrap_inline6859 's W-value is a weighted mean of all the separate W-values that it would have if it was able to distinguish the states. While all these W-values have been compressed into one W-value, it is done fairly. The more likely it is to experience a deviation tex2html_wrap_inline7769 , the more biased tex2html_wrap_inline7131 will be in the direction of tex2html_wrap_inline7787 .



6.6.1 Agents with heterogenous sensory worlds

With W-learning with subspaces, we can add new sensors dynamically, with new agents to make use of them, and the already-existing agents will just react to it as more competition, of the strange sort where sometimes in the same state (as they see it) they are opposed and sometimes not (Figure 6.1).


Figure 6.1: Agents may compete with each other despite sharing no common sensors. The abstract "full" state here is tex2html_wrap_inline6307 , though nothing in the creature works with this full state.

Recall the "map" of the statespace that we suggested drawing in §5, showing who wins each state. When agents work in subspaces only, the creature as a whole can still draw up a map of the full state-space. A W-value built up by an agent in a sub-state will typically be enough to win only some of the full states that the sub-state maps to, and lose others. The agent itself could not draw up a map of its statespace, since it could not classify its x as either "won" or "lost".

Note how concepts are becoming distributed in this model. For example, in our House Robot problem, there is now no central Perception area where the concepts of both "dirt" and "smoke" coexist. One agent solves the perception problem of distinguishing dirt from non-dirt. Another agent solves the different problem of separating smoke from non-smoke. Each solves the problem of vision to the level of sophistication necessary for its own purposes - nobody directly solves the problem of separating dirt from smoke. Agents existing in different sensory worlds are competing against each other.

Hierarchical Q-learning with subspaces looks like this too except that the switch must know about the full state-space.



[1] Actually, although has learnt the optimal action given its statespace, action set and reward distribution, an agent that is connected to different senses or actions of the body may suggest things that are actually better for , things that is not able to learn itself because it is denied access to some senses or actions. In such a case it would positively pay not to be obeyed (we will show such a case later). Our model can actually cope with this - will simply develop negative W-values and not compete. We will return to this whole issue in §18. For the moment we will assume that agents know best what is good for themselves, that is, that they will suffer if not obeyed.

[2] For example, the one that takes the lead may not actually be the one with the worst expected loss. This is what I allowed for in [Humphrys, 1995, §4], where the longer walk will terminate within steps. In fact, the one that takes the lead mightn't even have an expected loss worse than , it might just be an unlucky sample.

[3] Assuming this probability is stationary. What we are talking about here is, given that the agent observes a certain subspace state, what is the probability that this is really a certain full space state?



Chapter 7

Return to Contents page.



Feeds      w2mind.org

On Internet since 1987.