School of Computing. Dublin City University. Home Blog Teaching Research Contact My big idea: Ancient Brain 


Help on displaying equations
If other agents' actions are meaningless to it, all an agent can do is observe what r and y they generate, as Wlearning does. It could perhaps assume that unknown actions have the effect of "do nothing" or "stay still", if they have a Qvalue for such an action (§2.1.2), but it might be unwise to assume without observing.
However, if agents share the same suite of actions, and the other agent's action is recognised by the agent, it already has built up an estimate of the expected reward in the value . So rather than learning a Wvalue from samples, it can assign it directly if the successful action is communicated to it. We can do this in the House Robot problem, since all agents share the same suite of actions ("move" 08). In other words, we can follow the walk in §5.6 exactly, we do not have to approximate it.
In Negotiated Wlearning, the creature observes a state x, and then its agents engage in repeated rounds of negotiation before resolving competition and producing a winning action . It is obviously to be preferred that the length of this "instant" competition will be very short. In pseudocode, the Negotiated Wlearning method is, each time step:
This algorithm discovers explicitly in one timestep what Wlearning only learns over time.
It also gives us the high accuracy of telling states apart, as in Wlearning with full statespace, yet without needing to store such a space in memory. In fact its accuracy will be better than Wlearning with full statespace since the latter method has to use a generalisation, whereas Negotiated Wlearning can tell states apart perfectly. We deal with each full state x as it comes in at runtime. The agents recognise different components of x and compete based on this oneoff event.
We have no memory requirements at all for W. The are just n temporary variables used at runtime. In fact, note that "Negotiated Wlearning" is actually not learning at all since nothing permanent is learnt.
The best combination found, scoring 18.212, was:
As noted before, the Negotiated Wlearning walk may have a different winner depending on which random agent we start the walk going with. Since we run a new competition every timestep, this means that Negotiated Wlearning has a stochastic control policy. Note that Wlearning may have multiple possible winners too, depending on who takes an early lead. But once a winner is found it has a deterministic control policy.
We could make Negotiated Wlearning have a deterministic control policy by recording the winners k(x) for each full state x so that we don't have to run the competition again. On the other hand, we might have dynamic creation and destruction of agents, in which case we would want to rerun the competition every time in case the collection of agents has changed.
Negotiated Wlearning could also be used during the learning of Q itself, in which case we will want to rerun the competition each time round with better Q estimates.
Theorem 5.1 shows that the instant competition (or walk through the matrix) is bounded. To be precise, in the algorithm described above, the shortest possible loop would be: Start with random leader. All other , e.g. all agents agree about what action to take, . Loop terminates, length 1.
The longest possible loop would be: Start with random leader, whose Wvalue is zero. Some other agent has and takes the lead. Try out all other agents, only to come back to the original leader, whose Wvalue is now nonzero, and it wins. We tried out the original agent, then (n1) other agents, then the original agent again. Total length (n+1).
So the competition length is bounded by 1 and (n+1). Here n = 8 so competitions will be of length 1 to 9. With the combination above, the competition lengths seen over a run of 40000 steps[1] were:
This gives a (reasonably reactive) average competition length of 2.3, as illustrated in Figure 11.1.
Figure 11.1: The "reactiveness" of Negotiated Wlearning.
This is a typical snapshot of 200 timesteps,
showing how long it took to resolve competition at each timestep.
The theoretical maximum competition length here is 9.
Figure 11.1 also gives us some idea of how quickly our original method of Wlearning actually gets down to a competition between only one or two agents.
Return to Contents page.