Dr. Mark Humphrys School of Computing. Dublin City University. My big idea: Ancient Brain Search:

```
```

# The "Back-propagation" learning algorithm

Remember notation:

We are actually going to get rid of the thresholds for the moment (explained later), so that we have:

and:

```

```

# Error term

We sent in an input, and it generated, in the output nodes, a vector of outputs yk.
The correct answer is the vector of numbers Ok.
The error term is:

We take the squares of errors, otherwise positive and negative errors may cancel each other out.

## Q. Show example of where error term fails if we don't take the squares.

There are other possible measures of error (recall Distance in n-dimensions) but we can agree that if this measure   -> 0   then all other measures of error   -> 0

Q. Prove that if this measure of E = 0 then yk = Ok for all k.

```

```

# The Learning algorithm - Back-propagation

To look at how to reduce the error, we look at how the error changes as we change the weights. We start at the layer immediately before the output. Working out the effects of earlier layers will be more complex.

First we can write total error as a sum of the errors at each node k:

E = Σ k   Ek
where Ek = 1/2 (yk - Ok)2

Now note that yk, xk and wjk each only affect the error at one particular output node k (they only affect Ek).
So from the point of view of these 3 variables, total error:

E = (a constant) + (error at node k)
hence:
(derivative of total error E with respect to any of these 3 variables) = 0 + (derivative of error at node k)
e.g.
E/yk = 0 + Ek/yk
We can see how the error changes as yk changes, or as xk changes. But note we can't change yk or xk - at least not directly. They follow in a predetermined way from the previous inputs and weights.

But we can change wjk

```
```

# partial derivatives

E is a function of many variables.
We will be repeatedly holding all except one constant and seeing how E changes as just one variable is changed.
This is what we mean by using partial derivatives (also here).
```
```

# Derivatives of E - output layer:

```
```

As we work backwards, the situation changes. yj feeds forward into all of the output nodes. Since:

E = (sum of errors at k)
we get:
(derivative of E) = (sum of derivatives of error at k)
xj and wij then only affect yj (though yj affects many things).

We can't (directly) change yj or xj

But we can change wij

```
```

# Derivatives of E - previous layer:

To spell it out:
E/yj   = Σ k   Ek/yj
= Σ k     Ek/xk     xk/yj
= Σ k     E/xk     xk/yj

```
```

# Changing the weights to reduce the error

Now we have an equation for each - how error changes as you change the weight.

Note some things:

1.   E >= 0   - it can't be negative.
2. So we can't have the line just going down forever. It must level out. Get slope = 0 at some point (perhaps many points). Although could in theory get it asymptotic to x-axis (but see infinite weights).
3. In fact, might not be able to get E = 0 for all exemplars given limited net architecture, so best E might still be   > 0.
4. W can be negative. In fact, best W might be.
5. As   W -> infinity   (or minus infinity) error almost certainly gets very bad. See infinite weights. Best W will be something finite.
```
```
Now, to reduce error:
On RHS, slope is positive:   > 0.
Move left to reduce error:
W := W - C
where C is a positive constant.

On LHS, slope is negative:   < 0.
Move right to reduce error:
W := W - C
= W - C (negative quantity)
= W + (positive quantity)

Hence the same update rule works for both positive and negative slopes:

W := W - C

The constant C is the LEARNING RATE   0 < C < 1.

```
```

```
```

# The Back-propagation algorithm [REFERENCE]

```
```

```
```

Compare Perceptron Learning Rule.

1. Note that   yj is positive, and yk(1-yk) is positive, so if (yk-Ok) is positive (i.e. output is too big), then is positive, and our learning rule reduces the weights. This will reduce the output.

2. Similarly you can see that if the output yk is too small, the learning rule increases the weights, which will increase the output.

3. Note that if E = 0, the update rule will cause no change in weights wij or weights wjk. Why?

4. Note that if Ii = 0, none of the weights wij (for all j) will be changed. Why? Why is this a good thing?

5. Note that if yj = 0, none of the weights wjk (for all k) will be changed. Why? Why is this a good thing?

```
```

# Hill-climbing/descending

1. If slope is large then our rule causes us to make large jumps in the direction that seems to be downhill.

2. We do not know this will be downhill. We do not see the whole landscape. All we do is change the x value and hope the y value is smaller. It may turn out to have been a jump uphill.

3. As we approach a minimum, the slope must approach zero and hence we make smaller jumps. As we get closer to the minimum, the slope is even closer to zero and so we make even smaller jumps. Hence the system converges to the minimum. Change in weights slows down to 0.

4. If slope is zero we do not change the weight.

5. As long as slope is not zero we will keep changing w. We will never stop until slope is zero.
```
```

## This is Hill-climbing/descending (aka "gradient descent/ascent"). How do we know this won't get stuck in local minima? Answer - we don't.

Note that is large when:

• yk(1-yk) is large (in the middle of the sigmoid curve), or:
• (yk-Ok) is large.
That is, we have the greatest change in weights when:
1. the output is least certain (as to whether it should fire or not fire).
2. the error E is large (we are high up the mountain). For large errors, we have greater change in weights, so we get down the mountain quickly, and only slow down on the lower slopes. If get stuck in local minimum it will be at least lower down.
3. the slope is large.

```
```

# Multiple descents

1. Remember this is just the landscape for this weight. There are other landscapes for the other weights.

2. Also, for this weight, this is just the landscape for this exemplar. On the next time step we will be adjusting this weight in relation to the error landscape for the next exemplar. We mix up the exemplars. Rather than single-mindedly descending one mountain, we are trying to descend multiple mountains at once. Different weights do eventually end up specialised on different zones of the input space, i.e. specialised on one particular descent (or rather a family of similar descents). But this takes time. At the start, things are much more noisy. Forward to How does specialisation happen?

3. Since all the other weights are constantly changing, even next time round the same exemplar is shown to the same weight, the error landscape will be different.

4. We don't actually have an equation for the landscape. All we have is a single point on the x axis and a corresponding value for E and the slope (at this point only). We have to make a decision based on this. Next time round there will be a different landscape and a different point.

What is also often done is that the learning rate C starts high and declines as learning goes on.

```

```

# "Noisy" descent

Let us consider why a "noisy" gradient descent (large jumps, large weight changes, including possible jumps uphill) - where the noise dies down as time goes on (approach area of global minimum, smaller jumps, smaller weight changes, settle in minimum) - is often a better idea than a strict downhill descent. Consider a landscape like the following:

If the ball strictly moves downhill, and it starts at a random point, it is more or less equally likely to end up in the local minimum as in the global one.

If we shake the whole system, however, we are more likely to shake it from local to global than vice versa. If we shake violently, getting it out of minima becomes very probable, but at the cost of making it more likely to actually shake it out of the global minimum into a local one. If we shake gently, getting it out of minima is harder but when it does it is more likely to go from local to global.

So what we do is start by shaking violently, and gradually shake less as time goes on.

```

```

# No method of finite sampling can guarantee to find the global minimum (Blindness)

It seems unsatisfactory that we have a method that cannot guarantee to find the global minimum.

The problem comes from the blindness of our exploration. Below is the actual landscape. But we are never actually shown the overall shape of the landscape. All we do is suggest points on the X-axis (weights), and are told the point on the Y-axis. It's like we have a torch that shows us just the point on which we are standing, and everything else is in darkness.

The problem is we do not actually roll down hill (continuously). Rather we make a finite number of jumps, and we only ever explore a finite number of points.

We can repeat this for lots of points, but we need an infinite number of points to be sure that we have explored the whole curve.

If the curve is sufficiently fine-detailed (recall discussion of chaotic function), then we are not guaranteed to find that global minimum without an infinite amount of exploration:

X-axis - weight parameter.
Y-axis - error. Change weight parameter to find minimum error.
Hard to find minimum error unless program can see the whole function (as human can here).
If explore by finite sampling of a few points, may never find the pit.
Note the pit cannot have vertical sides. To be a function (one y for each x) its sides must have slope, no matter how steep.
Image thanks to Victor Terron, replacing my poorly hand-drawn version.

```
```
Note if by chance we land in the pit, a point on one of its steep walls, we will find very steep slope, hence big change in w. Hence we jump out of the pit!

1. Should we then modify rule so for steepest slopes, make small w change instead of large?
• But you need small w change for small slope too, otherwise it will never converge to slope 0. So your algorithm is small w change for all slopes?
• Problem is that here, all things being otherwise equal, high E means high slope. Hence small change in w for high E!

2. Could you remember altitude? If jump brings you to higher altitude, go back.
• Danger (if jump range is restricted) of being same as strict move downhill - end up in local minimum. We want to be able to jump over barriers to get to global minimum.
• Could go back and try another jump, and repeatedly try jump (over the entire range?) until get lower.

In any case, we can make the mouth of the pit arbitrarily small, reducing to near zero the prob. of ever finding it no matter what you use. The point remains that no algorithm of finite sampling can ever guarantee to find the global optimum.

```
```

We will return to this issue with evolutionary methods, which are even more blind. At least here, we have a measure of E and a measure of

In a genetic algorithm we do not even have this information. But of course that is the whole point of these methods - to be able to learn despite the fact that we have less and less information / feedback.

```

```

# Convergence Proof [DOES NOT EXIST, HEURISTIC]

With single-layer nets, for a function that is linearly separable, where we know that some configuration of the network can separate them, there can be a Convergence Proof, in the sense that we can prove that the network will learn to separate the exemplars.

For a multi-layer net, what does a Convergence Proof mean? That we can approximate the function f? To what level of accuracy? Anyway, which function? A more complex function requires a larger network. Maybe with the current network architecture we can only represent f crudely. Convergence Proof can prove that it converges to something, i.e. stops changing. But perhaps to local optimum rather than global optimum.

Rumelhart et al, 1986, do not have a proof like Rosenblatt, but can only argue for back-prop as a heuristic, which normally (but not always) works well.

```
```
The situation is summed up in Back Propagation is Sensitive to Initial Conditions, John F. Kolen, Jordan B. Pollack, 1990:

Rumelhart et al. made a confident statement: for many tasks, "the network rarely gets stuck in poor local minima that are significantly worse than the global minima."(p. 536)

The convergence claim was based solely upon their empirical experience with the back propagation technique. Since then, Minsky and Papert (1988) have argued that there exists no proof of convergence for the technique, and several researchers (Judd 1988; Blum and Rivest 1988; Kolen 1988) have found that the convergence time must be related to the difficulty of the problem, otherwise an unsolved computer science question (P=NP) would finally be answered. We do not wish to make claims about convergence of the technique in the limit (with vanishing step-size), or the relationship between task and performance, but wish to talk about a pervasive behavior of the technique which has gone unnoticed for several years: the sensitivity of back propagation to initial conditions.

Kolen and Pollack basically point out that, as a heuristic, it will find some local optimum. Given the same training set, which local optimum it finds is decided entirely by the initial weights. In general it is not feasible to perform a global search to obtain the optimal set of weights.

```
```

## Notes on Exhaustive search

More on that point. If we did not have a value for we could in theory search for the correct network by just doing an exhaustive search on all possible sets of weights. Search space too big.

constrains our search, but does so by limiting from the start the solutions we can find.

```
```

Also worth reading is No Harm Intended: A Review of the Perceptrons expanded edition, Pollack, 1989, where he points out that this is just normal in AI (to have a heuristic that cannot search the whole space):

Minsky and Papert point out, rightly, that Rumelhart et al have no result, like the Perceptron Convergence Theorem (Rosenblatt, 1962), that the generalized delta rule is guaranteed to find solutions when they exist, and that "little has been proved about the range of problems upon which [the generalized delta rule] works both efficiently and dependably." (p. 260.)

On the other hand, we also know that for certain classes of very hard problems, there are no algorithms which scale significantly better than exhaustive search, unless they are approximations. But there are many such impractical and approximate methods which are in practical use worldwide. Despite the fact that the Traveling Salesman Problem is NP-complete, salespersons do travel, and often optimally.

```

```

# Bounds

Note that yk and Ok are both 0 to 1.

So is from -1 to 1.

is from -1 to 1.
So our actual update of the final layer weights wjk is from:
wjk := wjk - C
to:
wjk := wjk + C

```

```

# That's how we learn the weights. How do we learn thresholds?

Remember definition of threshold:

Biasing: Just make the thresholds into weights, on a link with constant input -1.

we change it to:

and we just learn the thresholds like any other weights.
See again Perceptron Learning Rule for thresholds.
See full Sample code for backprop.

i.e. Every single hidden unit (and every single output unit) has an extra input line coming apparently from nowhere with constant input -1.

```
```

# We need thresholds

Remember we need thresholds, otherwise the sigmoid function is centred on zero. e.g. If no thresholds, then, no matter what the exemplars are, no matter what the I/O relationship is, and no matter what the weights are, if all inputs are 0, then output of every single hidden node is ..

Q. Is what?

Similarly, for a node that specialises on n of the inputs (weight = 0 for others), then if that subset of inputs are all 0, that node's output must be ..

```
```

# Designing the Network

We said we don't have to design the network, and this is true, as regards weights and thresholds.

However, we still had to make sure that there were enough input nodes to actually receive the input vector, and same with output. And there are further design decisions we made before learning even started:

• Number of hidden nodes.
• Coding (separation) of inputs.

```
```

# Designing the Network

A neural net is an approximation of a non-linear continuous function.
How good that approximation is depends crucially on how you designed it in the first place.

Backprop can't do everything.
e.g. For the function:

```f(x) = sin(x) + sin(2x) + sin(5x) + cos(x)
```
you can't design a network with 1 input, 1 hidden unit, 1 output, and expect backprop to finish the job. There is only so much such a network can represent.

With any neural net, you should design it with the approximation in mind. Design it with some a priori estimate of what the approximation will look like.

Design is part of the approximation process.
Backprop finishes the details.

```
```
And it's not simply a matter of having thousands of hidden units. That would tend towards a lookup table with limited generalisation properties. It is an empirical loop:
```repeat
design network architecture
use backprop to fill in weights

if too much interference (can't form representation)
increase number of hidden units

if too little interference (can't predict new input)
reduce number of hidden units
```
Further reading - Designing the Inputs.

```
```

# Test - Function approximator

See Using net as function approximator.
1 real input, n hidden, 1 real output.
Never sees the same exemplar twice!

Consider the function:

```f(x) = sin(x) + sin(2x) + sin(5x) + cos(x)
```
that we talked about earlier. We can now reveal that the learner of this function was an initially-random neural network with 1 input, 12 hidden units, and 1 output. We saw its performance after 1 million exemplars, and how it had improved by the time it had seen 5 million exemplars.

It has difficulty representing f because there are basically too few hidden units, so it can only form a crude representation.

Remember the network has not increased in size to store this more accurate representation. It still has just 12 hidden units. It has merely adjusted its weights.

```
```
```
```

Feeds      w2mind.org      ancientbrain.com

On the Internet since 1987.