School of Computing. Dublin City University.
Help on displaying equations
By Action Selection we do not mean the low-level problem of choice of action in pursuit of a single coherent goal. Rather we mean the higher-level problem of choice between conflicting and heterogenous goals. These goals are pursued in parallel. They may sometimes combine to achieve larger-scale goals, but in general they simply interfere with each other. They may not have any terminating conditions.
These two problems have often been confused in the literature - this thesis shall argue in favour of carefully separating them. All systems solve the first problem in some form. The second one has been given far less thought.
Action Selection is emerging as a central problem in the simulation of whole creatures (as opposed to the solution of isolated uninterrupted tasks). It is clear that many interesting systems possess multiple parallel and conflicting goals, among which attention must endlessly switch. Animals are the prime example of such systems.
Typically, the action selection models proposed in ethology are not detailed enough to specify an algorithmic implementation (see [Tyrrell, 1993] for a survey, and for many difficulties in translating the conceptual models into computational ones). The action selection models that do lend themselves to algorithmic implementation (e.g. see [Brooks, 1991, Rosenblatt, 1995, Blumberg, 1994, Sahota, 1994, Aylett, 1995]) then typically require a considerable design effort. In the literature, one sees formulas taking weighted sums of various quantities in an attempt to estimate the utility of actions. There is much hand-coding and tuning of parameters (e.g. see [Tyrrell, 1993, §9]) until the designer is satisfied that the formulas deliver utility estimates that are fair.
In fact, there may be a way that these utility values can come for free. Learning methods that automatically assign values to actions are common in the field of Reinforcement Learning (RL), or learning from rewards. Reinforcement Learning propagates numeric rewards into behavior patterns. The rewards may be objective, external value judgements, or just internally generated numbers. This thesis compares a number of different methods of further propagating these numbers to solve the action selection problem.
The low-level problem of pursuing a single goal can be solved by straightforward RL, which assumes such a single goal. For the high-level problem of choice between conflicting goals we try various methods exploiting the low-level RL numbers. The methods range from centralised and cooperative to decentralised and selfish.
Instead of presenting yet another domain-specific program, this thesis will be searching for general principles across domains. It will be trying to categorize all the different ways in which the action selection of behaviors can be organized automatically.
Terminology in this area is a problem. And one I'm not necessarily going to solve here. These are the crucial terms I use and their explanation:
Behavior (e.g. [Sahota, 1994, Mataric, 1994]) is probably the term in most common use, but is often used for modules that are not autonomous actors. For somewhat-autonomous, somewhat-competing modules within a single body, [Brooks, 1986] uses layer (though [Brooks, 1994] also uses process), [Blumberg, 1994] uses activity (as does the ethologist McFarland), [Tyrrell, 1993] uses system and subsystem (after the ethologist Baerends) and [Selfridge and Neisser, 1960] use demon. Other names, from psychology, include simpleton and homunculus (of the non-intelligent kind).
The word agent is still not altogether satisfactory since it is being claimed by other fields. Currently multi-agent learning implies multi-bodies (e.g. [Mataric, 1994, Tan, 1993, Grefenstette, 1992]). Symbolic AI agents imply communication, where agents negotiate among themselves (this is not altogether against the meaning of the term here). And in the long term, agent may mean Internet alter egos acting as an "agency" on behalf of the user.
The usage is from Brooks [Brooks, 1991]. The creature may be inhabited by a single monolithic agent (a unitary mind) or a family of agents (a decentralised mind).
Throughout this thesis, the notation:
means we adjust the estimate D once in the direction of the sample d. For example, if we store D explicitly we may update:
where is the learning rate. See appendix §A to understand how sampling with a learning rate works. Alternatively, if we are using a neural network to return an estimate D, we may backpropagate the error D - d .
means that over many such updates the estimate D converges to the expected value of the samples d.
This dissertation can be divided into the following parts:
The reader familiar with RL may wish to skip Chapters 2 and 3.
First we introduce Reinforcement Learning.
Return to Contents page.