Dr. Mark Humphrys

School of Computing. Dublin City University.

Home      Blog      Teaching      Research      Contact

Search:

CA216      CA249      CA318

CA400      CA651      CA668


Help on displaying equations


Mark Humphrys - Research - PhD - Chapter 4 - Chapter 5



5 An abstract decentralised model of mind

Looking at exactly what Q(x,i) are learnt in the previous method, the switch learns things like - if dirt is visible tex2html_wrap_inline7115 wins because it expects to make a good contribution to the global reward - otherwise if the plug is visible tex2html_wrap_inline7045 wins because it expects to make a (smaller) contribution. But the agents could have told us this themselves with reference only to their personal rewards. In other words, the agents could have organised themselves like this in the absence of a global reward.

In this chapter I consider how agents might organise themselves sensibly in the absence of a global reward. Watkins in his PhD [Watkins, 1989] was interested in learning methods that might plausibly take place within an animal, involving a small number of simple calculations per timestep, and so on. Similarly, I am looking for more biologically-plausible action selection, something that could plausibly self-organize in the development of a juvenile, that would not require an all-knowing homunculus. Like Watkins, I will propose methods deriving from this motivation rather than trying to copy anything seen in nature.

The starting point for this exploration of decentralised minds is Rodney Brooks' contrast in [Brooks, 1986] between the traditional, vertical AI architecture (Figure 5.1) where representations come together in a central "reasoning" area, and his horizontal subsumption architecture (Figure 5.2) where there are multiple paths from sensing to acting, and representations used by one path may not be shared by another.

figure714
Figure 5.1: The traditional, vertical AI architecture. The central intelligence operates at a symbolic level.

figure719
Figure 5.2: Brooks' horizontal subsumption architecture.



Brooks' method is to build full working robots at each stage. He builds in layers: layer 1 is a simple complete working system, layers 1-2 together form a complete, more sophisticated system, layers 1-3 together form a complete, even more sophisticated system, and so on. Lower layers do not depend on the existence of higher layers, which may be removed without problem. Higher layers may interfere with the data flow in layers below them - they may "use" the layers below them for their own purposes. To be precise, each layer continuously generates output unless inhibited by a higher layer. Subsumption is where a higher layer inhibits output by replacing it with its own signal. If a higher layer doesn't interfere, lower layers just run as if it wasn't there.

Brooks' model develops some interesting ideas:

The logical development of this kind of decentralisation is the model in Figure 5.3. Here, the layers (which we now call agents) have become peers, not ranked in any hierarchy. Each can function in the absence of all the others. Each agent is connected directly from senses to action and will drive the body in a pattern of behavior if left alone in it. Each is frustrated by the presence of other agents, who are trying to use the body to implement their plans. Brooks' hierarchy would be a special case of this where higher-level agents happen to always win when they compete against lower-level ones.

figure724
Figure 5.3: Competition among peer agents in a horizontal architecture. Each agent suggests an action, but only one action is executed. Which agent is obeyed changes dynamically.



This is actually the model that Lin's Hierarchical Q-learning used. Each agent must send its actions to a switch which will either obey or refuse it. In Hierarchical Q-learning, the switch is complex and what is sent is simple (a constant action tex2html_wrap_inline7119 ). Now we ask if we can make the switch simple, and what is sent more complex. We ask: Rather than having a smart switch organise the agents, Can the agents organise themselves off a dumb switch?



5.1 The weight W

To answer this, I tried to look at it from an agent's point of view. Agents can't know about the global system, or the mission of the creature as a whole. Each must live in a local world. Here agents have no explicit knowledge of the existence of any other agents. Each, if you like, believes itself to be the entire nervous system.[1]

The basic model is that when the creature observes state x, the agents suggest their actions with numeric strengths or Weights tex2html_wrap_inline7131 (I call these W-values). The switch in Figure 5.3 becomes a simple gate letting through the highest W-value. To be precise, the creature contains agents tex2html_wrap_inline6847 . In state x, each agent tex2html_wrap_inline6859 suggests an action tex2html_wrap_inline6861 . The switch finds the winner k such that:

displaymath7127

and the creature executes tex2html_wrap_inline6865 . We call tex2html_wrap_inline7075 the leader in the competition for state x at the moment, or the owner of x at the moment. The next time x is visited there might be a different winner.

This model will work if we can find some automatic way of resolving competition so that the "right" agent wins. The basic idea is that an agent will always have an action to suggest (being a complete sensing-and-acting machine), but it will "care" some times more than others. When no predators are in sight, the "Avoid Predator" agent will continue to generate perhaps random actions, but it will make no difference to its purposes whether these actions are actually executed or not. When a predator does come into sight however, the "Avoid Predator" agent must be listened to, and given priority over the default, background agent, "Wander Around Eating Food".

For example, in Figure 5.4, when the creature is not carrying food, the "Food" agent tends to be obeyed. When the creature is carrying food, the "Hide" agent tends to be obeyed. The result is the adaptive food-foraging creature of Figure 5.5.

figure751
Figure 5.4: Competition is decided on the basis of W-values. The action with the highest W-value is executed by the creature.



figure756
Figure 5.5: Internal conflict, adaptive external behavior: the conflict between the agents "Hide" and "Food" results in an adaptive food-forager. On the left is the kind of global, serial, control program one would normally associate with such behavior. On the right, the two competing agents produce the same effect without global control.



Note that the "Hide" agent goes on suggesting the same action with the same W in all situations. It just wants to hide all the time - it has no idea why sometimes it is obeyed, and other times it isn't.

We can draw a map of the state-space, showing for each state x, which agent succeeds in getting its action executed. Clearly, a creature in which one agent achieves total victory (wins all states) is not very interesting. It will be no surprise what the creature's behavior will be then - it will be the behavior of the agent alone. Rather, interesting creatures are ones where the state-space is fragmented among different agents (Figure 5.6).



figure762
Figure 5.6: We expect competition to result in fragmentation of the state-space among the different agents. In each state x, the "Hide" agent suggests some action with weight tex2html_wrap_inline6295 , and the "Food" agent suggests an action with weight tex2html_wrap_inline6297 . The grey area is the area where tex2html_wrap_inline6299 . The "Hide" agent wins all the states in this area. The black area shows the states that the "Food" agent wins.



In fact, with the agents we will be using, creatures entirely under the control of one agent will be quite unadaptive. Only a collection of agents will be adaptive, not any one on its own.

One of the advantages of Brooks' model is that layer 2, for example, does not need to re-implement its own version of the wandering skill implemented in layer 1 - it can just allow layer 1 be executed when it wants to use it. Similarly here, even though agents have their own entire plan for driving the creature (for every x they can suggest an a), they might still come to depend on each other. An agent can use another agent by learning to cede control of appropriate areas of state-space (which the other agent will take over).



5.2 W = Drive Strength

Our abstract decentralised model of mind is a form of the drives model commonly used in ethology (dating back to Hull's work in the 1940s), where the "drive strength" or "importance variable" [Tyrrell, 1993] is equivalent to the W-value, and the highest one wins (the exact action of that agent is executed). Tyrrell's equation [Tyrrell, 1993, §8.1]:

singlespace777

is equivalent to saying that:

displaymath7173

It is drive strength relative to the other drives that matters rather than absolute drive strength.

The big question of course is where do these drive strengths or W-values come from. Do they have to all be designed by hand? Or can they be somehow learned? Ethologists have lots of ideas for what the weights should represent, but can't come up with actual algorithms that are guaranteed to produce these kind of weights automatically. They tend to just leave it as a problem for evolution, or in our case a designer. See [Tyrrell, 1993, §4.2,§9.3.1] for the difficult design problems in actually implementing a drives model. Tyrrell has the problem of designing sensible drive strengths, relative drive strengths, and further has to design the appropriate action for each agent to execute if it wins.

It should be clear by now that we are going to use Reinforcement Learning to automatically generate these numbers. RL methods, in contrast to many forms of machine learning, build up value functions for actions. That is, an agent not only knows "what" it wants to do, it also knows "how much" it wants to do it. Traditionally, the latter are used to produce the former and are then ignored, since the agent is assumed to act alone. But the latter numbers contain useful information - they tell us how much the agent will suffer if its action is not executed (perhaps not much). They tell us which actions the agent can compromise on and which it cannot.

Systems that only know "What I want to do" find it difficult to compromise. Reinforcement Learning systems, that know "How much I want to do it", are all set up for negotiating compromises. In fact, "how much" it wants to take an action is about all the simple Q-learning agent knows - it may not even be able to explain why it wants to take it.

Our Reinforcement Learning model is also useful because given any state x at any time a state-space agent can suggest an action a - as opposed to programmed agents which will demand preconditions and also demand control for some time until they terminate. State-space agents are completely interruptable. In the terminology of [Watkins, 1989, §9], the Action Selection methods I propose in this thesis are supervisory methods, where there is a continuous stream of commands interrupting each other, rather than delegatory methods, where a top level sends a command and then is blocked waiting for some lower level loop to terminate. Watkins argues that the latter methods characterise machines (or at least, machines as traditionally conceived) while the former characterise what animals actually do.

Recently there has been an emphasis on autonomous agents as dynamical systems (e.g. [Steels, 1994]), emphasising the continuous interaction between the agent and its world such that the two cannot be meaningfully separated. It is not absolutely clear what the defining features of a dynamical system are (over and above just any self-modifying agent), but complete interruptability seems to be one part of it.

To formalise, each agent in our system will be a Q-learning agent, with its own set of Q-values tex2html_wrap_inline7017 and more importantly, with its own reward function. Each agent tex2html_wrap_inline6859 will receive rewards tex2html_wrap_inline7019 from a personal distribution tex2html_wrap_inline7043 . The distribution tex2html_wrap_inline6279 is a property of the world - it is common across all agents. Each agent tex2html_wrap_inline6859 also maintains personal W-values tex2html_wrap_inline7131 . Given a state x, agent tex2html_wrap_inline6859 suggests an action tex2html_wrap_inline6861 according to some control policy, which is most likely, as time goes on, to be such that:

displaymath7174

The creature works as follows. Each time step:

singlespace805

Note that the transition will generate a different reward tex2html_wrap_inline7019 for each agent. For updating Q, we use normal Q-learning. For updating W, we want somehow to make use of the numerical Q-values. There are a number of possibilities.



5.3 Static W=Q

A static measure of W is one where the agent promotes its action with the same strength no matter what (if any) its competition. For example:

displaymath7223

This simple method will be the fourth method we test.



5.4 Static W = importance

A more sophisticated static W would represent the difference between taking the action and taking other actions, i.e. how important this state is for the agent. If the agent does not win the state, some other agent will, and some other action will be taken. In an unimportant state, we may not mind if we are not obeyed. The concept of importance is illustrated in Figure 5.7.

figure835
Figure 5.7: The concept of importance. State x is a relatively unimportant state for the agent (no matter what action is taken, the discounted reward will be much the same). State y is a relatively important state (the action taken matters considerably to the discounted reward).



For example, W could be the difference between tex2html_wrap_inline6861 and the worst possible action:

displaymath7225

Or W could be a Boltzmann function:

displaymath7226

The idea of this kind of scaling is that what might be a high Q-value in one agent might not be a high Q-value to another.



5.4.1 Static W=z

A scaled measure of W could be how many standard deviations tex2html_wrap_inline6861 's Q-value is from the mean Q-value. That is, W is the z-score. To be precise, the mean would be:

displaymath7237

the standard deviation would be:

displaymath7238

and this scheme is:

displaymath7239

If the Q-values are all the same, tex2html_wrap_inline7247 and tex2html_wrap_inline7249 . Otherwise tex2html_wrap_inline7251 , and remember that tex2html_wrap_inline6861 is the best action, so tex2html_wrap_inline7255 .

But even this measure may still be a rather crude form of competition. Consider where there are just two actions. Either their Q-values are the same ( tex2html_wrap_inline7257 ) or else the mean is halfway between them and the larger one is precisely 1 standard deviation above the mean. So to compete in a state where we care what happens, we have tex2html_wrap_inline7259 always, independent of the values of or the gap between the Q-values.



5.5 Dynamic (learnt) W-values

The real problem with a static measure of W is that it fails to take into account what the other agents are doing. If agent tex2html_wrap_inline6859 is not obeyed, the actions chosen will not be random - they will be actions desirable to other agents. It will depend on the particular collection what these actions are, but they may overlap in places with its own suggested actions. If another agent happens to be promoting the same action as tex2html_wrap_inline6859 , then tex2html_wrap_inline6859 does not need to be obeyed. Or more subtly, the other agent might be suggesting an action which is almost-perfect for tex2html_wrap_inline6859 , while if tex2html_wrap_inline6859 's exact action succeeded, it would be disastrous for the other agent, which would fight it all the way.

We have two types of states that tex2html_wrap_inline6859 need not compete for:

Type 1
- A state which is relatively unimportant to it. It doesn't matter much to tex2html_wrap_inline6859 's discounted reward what action is taken here. In particular, it doesn't matter if some other agent tex2html_wrap_inline7075 takes an action instead of it.

Type 2
- A state in which it does matter to tex2html_wrap_inline6859 what action is taken, but where some agent tex2html_wrap_inline7075 just happens to be suggesting an action which is good for tex2html_wrap_inline6859 . This may or may not be the action tex2html_wrap_inline6859 would itself have suggested.

With dynamic W-values, the agents observe what happened when they were not obeyed, and then modify their tex2html_wrap_inline7131 values based on how bad it was, so that next time round in state x there may be a different winner.

Imagine W as the physical strength agents transmit their signals with, each transmission using up a certain amount of energy. It would be more efficient for agents to have low W if possible, to only spend energy on the states that it has to fight for. The first naive idea was if an agent is suffering a loss it raises its W-value to try to get obeyed. But this will result in an "arms race" and all W's gradually going to infinity. The second naive idea then was let the agent that is being obeyed decrease its W-value. But then its W-value will head back down until it is not obeyed any more. There will be no resolution of competition. W-values will go up and down forever.

Clearly, we want W to head to some stable value. And not just the same value, such as having all unobeyed agents tex2html_wrap_inline7289 . We want the agents to not fight equally for all states. We need something akin to an economy, where agents have finite spending power and must choose what to spend it on. We will have a resource (statespace), agents competing for ownership of the resource, caring more about some parts than others, and with a limited ability to get their way.



5.5.1 Dynamic W = D - F

What we really need to express in W is the difference between predicted reward (what is predicted if the agent is obeyed) and actual reward (what actually happened because we were not obeyed). What happens when we are not listened to depends on what the other agents are doing.

The predicted reward is D = E(d) , the expected value of the reward distribution. If we are not obeyed we will receive reward f with expected value F = E(f) . We can learn what this loss is by updating:

displaymath7291

After repeated such updates:

displaymath7292

Using a (D-f) algorithm, the agent learns what W is best without needing to know why. It does not need to know whether it has ended up with low W because this is Type 1 or because this is Type 2. It just concentrates on learning the right W.

Consider the case where our Q-learning agents all share the same suite of actions, so that if another agent is taking action tex2html_wrap_inline6865 we have already got an estimate of the return in tex2html_wrap_inline7313 . Then we can directly assign the difference:

displaymath7293

where tex2html_wrap_inline7315 and tex2html_wrap_inline7313 :

displaymath7294

That is:

displaymath7295

where tex2html_wrap_inline7319 is the deviation (expected difference between predicted and actual) or expected loss that tex2html_wrap_inline7075 causes for tex2html_wrap_inline6859 if it is obeyed in state x. We can show the losses that agents will cause to other agents in a tex2html_wrap_inline7327 matrix:

displaymath7296

where all tex2html_wrap_inline7329 . Note that all tex2html_wrap_inline7331 (the leader itself suffers an expected loss of zero).



5.6 A walk through the matrix

Given such a matrix, can we search in some way for the "right" winner? Consider first a situation where those agents not in the lead have to raise their W-values. If one of them gets a higher W-value than the leader then it goes into the lead, and so on.

theorem946

That is, there must exist at least one k such that all the values in the k'th column are less than or equal to a value in the k'th row.

Proof: The process goes:

singlespace973

We have a strictly increasing sequence here. Each element is the highest in its column, so we are using up one column at a time. So the sequence must terminate within n steps. tex2html_wrap_inline7383



5.6.1 Multiple possible winners

For a given matrix, there may be more than one possible winner. For example, consider the matrix:

displaymath7385

Start with all tex2html_wrap_inline7395 . Choose at random agent tex2html_wrap_inline7397 's action for execution. Then:

displaymath7386

Now agent tex2html_wrap_inline7399 is in the lead, and:

displaymath7387

Agent tex2html_wrap_inline7399 is the winner. However, if we had started by choosing agent tex2html_wrap_inline7403 's action, then:

displaymath7388

Now agent tex2html_wrap_inline7397 is in the lead, and:

displaymath7389

Agent tex2html_wrap_inline7397 is the winner. We have a winner when somebody finds a deviation they suffer somewhere that is worse than the deviation they cause everyone else.

We do not examine all tex2html_wrap_inline7409 combinations and make some kind of global decision. Rather we "walk" through the matrix and see where we stop. Where the walk stops may depend on where it starts. Later we will ask if we can make a global decision. But first we consider where the only possible strategy is to make a walk.



[1] Perhaps there is some evolutionary plausibility in this. Consider that the next step after getting a simple sensor-to-effector link working is not a hierarchy or a co-operative architecture but rather a mutation where, by accident, two links are built, and each of course tries to work the body as if it were alone.

Chapter 6

Return to Contents page.



Feeds      w2mind.org

On Internet since 1987.