Dr. Mark Humphrys School of Computing. Dublin City University. My big idea: Ancient Brain Search:

# Basic Idea of - Learning from Rewards

Instead of supervised learning (exemplars), we don't tell it correct "class" / action. Instead we give sporadic indirect feedback (a bit like "this classification was good/bad").

e.g. Move your muscles to play basketball. I can't articulate what instructions to send to your muscles / robot motors and in what order. But a child could sit there and tell you when you have scored a basket. In fact, even a machine could detect and automatically reward you when a basket is scored.

```

```

Rebound Rumble robot basketball competition (part autonomous, part remote-controlled).

```
```

# Clocksin and Moore - Traffic Junction

Clocksin, William F. and Moore, Andrew W. (1989), Experiments in Adaptive State-Space Robotics, Proceedings of the 7th Conference of the Society for Artificial Intelligence and Simulation of Behaviour (AISB-89).

```
```

Translated into the terms we will be using:

1. Observe state of the world x = (p,s)
position and speed of car on main road
p - 21 values
s - 20 values
x has 420 possible values

2. Take action a = (c,n)
c - which pedal - 2 values (accelerate, brake)
n - how much (press pedal this hard) - 5 values
10 possible actions a

3. Observe if situation = not crossed, crossed, or collision.

1. Much more states than actions.
2. Multi-dimensional both.
3. Definition of x and a is very much under our control. Could make it more coarse-grained / fine-grained.

If tried out every possible action in every possible state, 4200 experiments to carry out.

```

```

Build model of Physics.
Take distance (p - junction)
Time for car to cover distance given speed s
Time it takes agent to cross road

```
```
Problems / Restrictions:

1. Need model in first place. Need a controlled world. e.g. Factory environment.

2. Model must be accurate.
e.g. Dynamics of robot arm:

3. World changes / Arm friction increases - Have to re-program.
But programmer is long gone.

```

```

# State-Space Approach

Look at consequences of actions.
"Let the world be its own model"
If action a worked, keep it.
If not, explore other action a2.
After many iterations, we learn the correct action patterns to any level of granularity.
And we never had to understand how the world worked!

We learn the mapping:

x, a -> y
initial state, action -> new state
```
```
1. This approach will work whether we cross the road using wings, fins, or view the world through reverse glasses.

2. Can adjust (re-learn) as world changes.

3. More plausible that evolution could have worked this way (fill in the "boxes") rather than building physics models.

4. Another reason to use state-space (or other) learning is simply when the task is tedious to program. Which may mean expensive to program - Programmers aren't free.

```

```

Learning adapting to actual laws of physics and body:
Faith, a dog born with no front legs. Learned to walk on two legs.

```
```

# Can you do exhaustive search?

If one can do exhaustive search, you don't need RL or any complex learning.

• This is from: Abstraction of State-Action Space Utilizing Properties of the Body and the Environment, Kazuyuki Ito, So Kuroe and Toshiharu Kobayashi, 6th IEEE International Conference on Intelligent Systems (IS'12), Sofia, Bulgaria, September 6-8, 2012.
• This research tried to get a robot to learn to navigate towards a light over randomly placed rubble.
• Their problem with learning what action a to take in each state x was it never saw exactly the same state twice.
• Their solution (perhaps dodging the issue!) was: Solve the climbing problem in the hardware. Hardware allows it move over any obstacle.
• Software problem becomes just: x = (direction of light), a = (go to light).
• Only 25 states x. Could do it by exhaustive search now. Try every action a in every state x. Don't even need learning.

More usual: Only have time to try some actions in some states.

```
```

```
```

```

```

# x,a -> y

Multiple y's:
e.g. If you are in state x and take action a
50% of the time you will end up in state y1
and 50% of the time you will end up in state y2

e.g. x = (7,5)
a = (1,5)
y1 = (6,5)
y2 = (7,5)

```

```

```
```

# E(r) exists, E(y) doesn't exist

We can add rewards - to get "expected reward" (average reward you will get over many events).

"Expected next state = ½ (y1 + y2)"

In example above, ½ (y1 + y2) = (6 ½, 5)
Expected state?
If you take action a, do you ever go to this state?
Does this state even exist?

```

```

# Clocksin and Moore paper

They mention the following without noting that both of these may be difficult:
1. Find "neighbouring" states x.
2. Get "average" of multiple actions a.
They make connections to what animals, children, adults do in:
```
```

# Writing a program to write a program

Machine writes a program x -> a only if we can think of a program that will write this program.

This may require restricting the domain. e.g. Below we will restrict ourselves to writing a stimulus-response program - well-understood model, our program will provably write an optimal solution.

Genetic Programming is a program to write any general-purpose program - Too far too fast?

```

```

# Sample applications of Reinforcement Learning

```
```

Robot playing air hockey by Reinforcement Learning.
From Darrin C. Bentivegna.

Robot learning to flip pancakes by RL.
From Petar Kormushev.

Google DeepMind's Deep Q-learning (RL) playing Atari Breakout.

```
```
```
```

Feeds      w2mind.org      ancientbrain.com

On the Internet since 1987.