Dr. Mark Humphrys

School of Computing. Dublin City University.

Home      Blog      Teaching      Research      Contact


CA216      CA249      CA318

CA400      CA651      CA668

Help on displaying equations

Mark Humphrys - Research - PhD - Chapter 7 - Chapter 8

8 W-learning with subspaces (test in House Robot)

Now we return to the House Robot problem to test W-learning. There being no global (x,i) statespace to worry about, we can expand the number of agents. To the previous five we add three more agents. The collection of agents is as follows:

Rewards are in the range tex2html_wrap_inline7995 . Both the tex2html_wrap_inline7017 values and tex2html_wrap_inline7131 values refer to x in the subspaces, for which lookup tables can be used.

tex2html_wrap_inline8003 should head for the centre of an open area while tex2html_wrap_inline8005 should engage in wall-following. In fact, as we shall see, tex2html_wrap_inline8005 turned out to be useless since its preferred action of course is to stay still. Perhaps its reward should just have been if (wall visible) ..

We can add more agents than probably needed - if they're not useful they just won't win any W-competitions and won't be expressed. In Hierarchical Q-learning, you can add extra agents that aren't chosen by the switch but you pay the price of a larger (x,i) statespace.

8.1 Searching for well-balanced collections

8.1.1 Making agents weaker or stronger

Since the exact values of these rewards tex2html_wrap_inline8015 don't matter for the policy learned by Q-learning, why do we not set them all to 1 as we did before?

The answer is that the size of the rewards affect the size of the W-values. Theorem C.2 shows that if we have an agent of the form:

tex2html_wrap_inline6859 	reward: if (good event) r else s 

then tex2html_wrap_inline6859 will present W-values:


where tex2html_wrap_inline8025 is a constant independent of the particular rewards. W is simply proportional to the subtraction (r-s), so in particular we lose nothing by fixing s = 0 here and just looking at different values of r.[1] Then we have simply:


Increasing the size of tex2html_wrap_inline7019 will cause tex2html_wrap_inline6859 to have the same policy (the same disagreements with the other agents about what action to take), but higher W-values (an increased ability to compete). tex2html_wrap_inline7019 is the parameter by which we make agents with the same logic stronger or weaker within the creature as a whole. We do not want all agents to be of equal importance within the creature. Rather, adaptive collections are likely to involve well-chosen combinations of weak and strong agents. For instance, a creature containing a strong version of the predator-avoiding agent: (if predator visible r = -10 else r = 0) will behave differently from one containing a weak version of the same thing: (if predator visible r = -0.1 else r = 0). The predator-avoiding agent in both will be suggesting the same actions, but in the former creature it is more likely to actually win its competitions.

The set of agents above define a vast range of creature behaviors, depending on how strong each agent is. For example, if tex2html_wrap_inline8047 and all other rewards are 0.001, then tex2html_wrap_inline8005 will beat all other agents in competition since its absolute (D-f) differences will be so large compared to theirs. In almost all states x it will build up a higher W-value tex2html_wrap_inline8055 than any of the competitor tex2html_wrap_inline7131 's. The house robot will be completely dominated by tex2html_wrap_inline8005 , and will spend all its time wall-following - in fact, spend all its time stationary, with a wall in sight, since that is tex2html_wrap_inline8005 's preferred position.

In conclusion, the kind of scaling we suggested in §5.4 is not of any use to us. We're quite happy to have unequal agents, to use the absolute differences between Q-values, not relative ones. This demands that their Q-values are all judged on the same scale, but this is no restriction - we have to adjust their rewards relative to each other anyway to make an adaptive collection.

8.1.2 Making agents weaker or stronger without re-learning Q

So we try out different combinations of the tex2html_wrap_inline7019 's, either by hand or by an automated search, looking for adaptive collections. A trick that was employed to speed up our search is that we only have to learn the agents' Q-values once at the start. We learn the Q-values for reward 1:

tex2html_wrap_inline6859 	reward: if (condition) 1 else 0
Then to generate the Q-values for this:
tex2html_wrap_inline6859 	reward: if (condition)  tex2html_wrap_inline7019  else 0
we just multiply the base Q-values by tex2html_wrap_inline7019 (Theorem C.3).

Figure 8.1 shows how multiplying the base Q-values by a factor does not change the Q-learning policy (the suggested action) but it does change the progress of W-learning (the differences D - f ) by exaggerating (making the agent stronger) or levelling out (making it weaker) the Q-values, while still preserving their basic contour. This trick allows us make agents weaker or stronger without re-learning Q.

Figure 8.1: Multiplying the reward by a factor multiplies the Q-values by that factor, which either exaggerates or levels out the contour tex2html_wrap_inline6309 . The agent will still rank the actions in the same order, and still suggest the same preferred action, but its W-values will be different.

A scaled measure of W, however, would be indifferent to multiplication by a constant. Say we were using a standard deviation-based measure of W. Instead of the static, z-score version, we would probably use a dynamic version depending on tex2html_wrap_inline6865 :


Multiplying the base Q-values by a constant c would multiply both the mean and the standard deviation by c. But then our W-value would be:


that is, unchanged. So we would have no easy way of making agents weaker or stronger. Scaling does not make agents all equal in strength to each other. But the problem is that it makes their inequalities fixed.

Finally, if we can just multiply all the Q-values by a constant, can't we just multiply the W-values by the constant instead of re-learning them? The answer is that a W-value depends on who the current leader is. If we increase tex2html_wrap_inline7019 in:


then tex2html_wrap_inline7131 increases, but as soon as it does, everything may change. tex2html_wrap_inline6859 may go into the lead itself, causing such a huge loss to another agent that it increases its W-value and then takes over the lead, and then tex2html_wrap_inline6859 's W-value will reflect a completely different loss tex2html_wrap_inline8101 . Once a W-value changes, we have to follow the whole re-organisation to its conclusion.

8.1.3 No Global reward function

The agents learn their Q-values from their local reward functions and then organise their action selection by W-learning, all without any reference to the global reward function of §4.3.1. In this work, for purposes of comparison with the other action selection methods, we will now want to test the fitness of the creature.

Obviously, if the global reward function still defines what we are looking for, we still need to use it to score the fitness of the collection. But it no longer need be available to the agents as an explicit function they can learn against. It is only used to test them. Hence the fitness function could be just implicit in the environment, as in the best Artificial Life research [Ray, 1991, Todd et al., 1994].

As has been said many times in contrasting natural evolution with the standard Genetic Algorithm, living things don't have an explicit global reward defined or available to learn from. Their fitness test is only implicit in their environment - whether they manage to live or die. How exactly they replicate is up to them, and what persists over time is often a surprise to the observer.

Similarly, as we give our artificial creature more complex multi-goal tasks, the global reward functions become much harder to design than the local ones (as we argued above).

Imagine that we know what behavior we want when we see it, but we're having trouble designing a suitable multi-reward global reward function. So we adopt the strategy of tweaking the tex2html_wrap_inline7019 's by hand, letting the agents re-organise themselves, and seeing the result. Say agent tex2html_wrap_inline6859 is not being expressed in the creature's behavior. We slowly increase tex2html_wrap_inline7019 until it first starts to win the one or two absolutely crucial states that it needs. As we increase tex2html_wrap_inline7019 further, it will win more and more extra states until eventually it would dominate the creature. And so on.

We don't want to design all the explicit behavior, but at the same time we do have some ideas as to what is suitable behavior and what is not, so we do not want to simply design an abstract set of global values as in §4.3.1 and hope for the best. What our decomposition into multiple reward functions defended by competing agents gives us is effectively an adjustable global reward function. We increase the strength in general of particular agents, while still letting the agents sort out the details.

8.2 Analysis of best solution

So it would not necessarily be difficult to design the collection of tex2html_wrap_inline7019 's by hand, increasing ones that aren't being expressed, and so on. To test a combination of tex2html_wrap_inline7019 's, we multiply the base Q-values by them, and then re-run the W-competition.

In fact, in this work I use a simple Genetic Algorithm to automate the search. The genotype encodes a set of tex2html_wrap_inline7019 's in chromosomes of length 4 (quite a coarse-grained evolutionary search). We have a population of size 60, initially randomised, evolving for 30 generations. The advantage of an automated search is that I can be confident that I have put the same amount of effort into finding good solutions for each method.

For this test (W-learning with subspaces on the House Robot problem) the best combination of tex2html_wrap_inline7019 's found by GA search was:


which averages 13.446 per 100 steps, slightly less than we got with Hierarchical Q-learning, but achieved with a reduction in memory requirements from 9.6 million to 1600.

We have solved the problem not with one complex entity, but via the complex interactions of multiple simple entities.

Looking closer at our solution, note that tex2html_wrap_inline8005 , as predicted, is useless. With such a tiny reward it will not be expressed at all. Neither, interestingly, will tex2html_wrap_inline7045 - obviously other agents are managing to take the creature back to the plug often enough for it to not be needed. The creature as a whole works by interleaving all its goals. Here are the strongest few W-values:


We can see that there is a complex intermingling of tex2html_wrap_inline8141 , tex2html_wrap_inline8143 and tex2html_wrap_inline8145 . The states (*,2) (as seen by tex2html_wrap_inline8149 ) are those where a human has been identified as a stranger. These are the crucial states for tex2html_wrap_inline8149 since it can pick up a continuous reward if it keeps the human in sight. The states (*,0) (as seen by tex2html_wrap_inline7815 ) are those where fire is visible without a wall in the way. These build up higher W-values ( tex2html_wrap_inline7815 is more confident about what to do) than when there is a wall in the way, where tex2html_wrap_inline7815 will need some kind of stochastic policy.

Here are the probabilities of each action being suggested by tex2html_wrap_inline7815 when the fire is in direction 4, behind a wall. The higher probability actions under a soft max control policy are highlighted.


We can see that tex2html_wrap_inline7815 builds up a broad front in approach to the wall. Moving at right angles to the direction of the fire (directions 2 and 6) is good because it is more likely to see the end of the wall. In any case, when the route to the fire is blocked by a wall, tex2html_wrap_inline7815 is amenable to suggestions by other agents, in particular by the combination of tex2html_wrap_inline8003 and tex2html_wrap_inline7115 , who drive the house robot in a strong wandering behavior otherwise.

tex2html_wrap_inline7045 with its tiny reward is irrelevant - its job is done for it by tex2html_wrap_inline8003 bringing the creature towards the centre and hence quite often past the plug. tex2html_wrap_inline8175 has a not insignificant reward, but finds its job is done for it by tex2html_wrap_inline8149 , which also wants to investigate unidentified humans (in case they turn out to be strangers). So tex2html_wrap_inline8175 lets tex2html_wrap_inline8149 do all the work for it, and as long as tex2html_wrap_inline8149 is being obeyed by the creature, tex2html_wrap_inline8175 is happy:


The sole exception is state (7,0), which for some reason fell to tex2html_wrap_inline8175 to be responsible for, while tex2html_wrap_inline8149 took its turn at dropping out of the competition:


[1] This is something I neglected to exploit in [Humphrys, 1995, §5.4], where I used 2-reward functions with 2 non-zero rewards and needlessly evolved both rewards.

Chapter 9

Return to Contents page.

Feeds      w2mind.org

On Internet since 1987.