School of Computing. Dublin City University.
It must be more than just memory:
Human: The capital of Ireland is Dublin. Machine: OK. Human: The capital of France is Paris. Machine: OK. Human: What is the capital of Ireland? Machine: Dublin. Human: Brilliant.The machine should learn something new.
The problem of induction:
We have experience of a finite number of situations. We must generalise our experience to a new situation, where we have never been told the answer, and make a good guess, where we are not simply recovering a past answer from memory.
Human: In situation x1, answer is a1. Human: In situation x2, answer is a2. Human: In situation x3, answer is a3. Human: In situation x4, what is answer? Machine: f(x4).where the machine is "learning" the function f.
Each Input-Output pair (xi, ai) that we show to the machine is called an exemplar. Learning from exemplars like this is called supervised learning.
f(x4) might be an interpolation (if x4 is between 2 of the points x1,..,x3) or an extrapolation (if x4 is outside the previous points).
Human: In situation x1, answer is a1. Machine: OK. I am building a function f(x). Currently, f(x) = a1 for all x. Human: In situation x2, answer is a2. Machine: f is becoming a more complex mapping, that gives different results for different x. e.g. Perhaps currently f((x1+x2)/2) = (a1+a2)/2.
Human: In situation x3, answer is a3. Machine: f is getting more complex.Maybe now f is no longer a straight line. We still may estimate f((x1+x2)/2) = (a1+a2)/2, but from now we restrict ourselves to interpolations rather than extrapolations.
Human: In situation x4, what is answer? Machine: Using my current definition of f, my answer is f(x4).
If x=0, answer is 0. If x=100, answer is 100. If x=50, answer is 95. If x=25, what is answer?
The idea of Representing a function from exemplars is that "nearby" Input should generate "nearby" Output.
But some functions defeat this simple idea:
We do not expect that the learning method from exemplars will be able to learn
a chaotic function
a function in a region where it is behaving chaotically).
To represent the chaotic area the function would have to be learnt to an impractically (perhaps infinitely) high level of granularity.
However we do hope that our unknown function is not actually perverse enough to be chaotic, at least in some local regions.
We expect that the learning method can cope with any reasonably well behaved non-linear, continuous function. For instance, the function:
f(x) = sin(x) + sin(2x) + sin(5x) + cos(x)Here is the kind of thing we want - a learning method approximating the above function after having seen 1 million exemplars (top) and 5 million (bottom):
See details later.
Fractals - The same pattern repeats at different scales. Repeatedly magnify it and it looks the same. Here we could possibly represent it to reasonable accuracy (not perfect) using a finite data structure, because the range decreases. A small change does not bounce across the global range, as in the chaotic functions above.e.g. Mandelbrot set:
Note that if we have an explicit equation for f, as above, then we don't need supervised learning at all! - Represent f by equation.
Supervised learning from exemplars is entirely for situations where we have an unknown f - we simply postulate that some mapping from our inputs to our outputs must exist. e.g. Hand-writing recognition. - Represent f by adaptive data structure.
So why do I use explicit f? Because they make nice examples. Use them to imagine what an unknown f might look like. With explicit f, we can see how well the machine is doing. Use explicit f to debug your learning machine when first write it.
A typical application for which there is a learnable Input-Output relationship:
Equation for f, but can't represent it:
Whereas with the Chaos Theory demo function, we actually know explicitly what the function is, and it is completely deterministic, yet we will not be able to learn to represent it.
It is not whether a mapping is knowable explicitly or not that makes it learnable or not.
"Pattern recognition" can mean lots of things. It could mean learning a mapping from Input "state" to Output "behaviour".
e.g. Input state = (wall at certain angle, velocity at certain value), Output behaviour = one of (turn left, turn right, stop, keep going).
It means recognising a dangerous situation on the road. Medical diagnosis. Financial prediction. It means learning to walk, balance, run, jump, climb. Speech (both hearing and production). Vision. Anything where explicit rules are often hard to come by, but correct feedback is at least possible to construct.
A typically inaccurate OCR transcript of the above.
Note the points where it has trouble.
Some mappings of Input to Output - where there is only a disjoint, discrete mapping of certain inputs to certain outputs, and in no areas any continuous (or even approximation of continuous) mapping - do not lend themselves to learning, only memorisation.
Human: The capital of Ireland is Dublin. Machine: OK. Human: The capital of France is Paris. Machine: OK. Human: What is the capital of the UK (the country between Ireland and France)? Machine: Emm, my guess is "Dublaris". Human: What is the capital of Iceland? Machine: Emm, "Dublin"?There may be a function in that every argument has a unique value generated for it. But it is not continuous - no interpolation is possible.
Do we know a priori if our function f is of this sort? i.e. Is there anything to learn?
Remember f might be simply a postulated function - a function that is assumed to exist, but whose properties are unknown (and possibly unknowable):
With jockey of weight 8 stone, our horse won the race. With jockey of weight 8.5 stone, our horse lost the race. With jockey of weight 9 stone, our horse won the race. With jockey of weight 8.25 stone, will our horse win or lose? With jockey of weight 8.75 stone, will our horse win or lose?
Are these the correct inputs? Are we missing some inputs? How do we know?
If x=0, answer is 0. If x=100, answer is 100. If x=0, answer is 10. If x=100, answer is 30. If x=100, answer is 100.May be caused by missing inputs.
In fact, must be caused by missing inputs, unless the universe as a whole is probabilistic.
If input x is multi-dimensional, how do you build up this function f? How do you represent f?
f could be a function of the series of inputs x1,x2,...,xn that led up to the current input xn, rather than just a function of the immediate input xn.
No interpolation. No extrapolation. Just memory.
Given a new input x, find the past exemplar with the smallest distance from x. Return the output for that past exemplar.
If x is multi-dimensional - Then what is "nearby"?
Here is one possibility. Let x = (x1,..,xn) and y = (y1,..,yn). Define the Hamming distance between x and y as:
If x and y are vectors of binary digits, then the Hamming distance is just equal to the number of points at which x and y differ.
Now, our algorithm could be:
Given a new input x, find the past exemplar with the smallest Hamming distance from x. Return the output for that past exemplar.
The problem with this can be seen in the following. Let the exemplars be:
Temperature 22 degrees, pressure 1000 millibars. No rain. Temperature 15 degrees, pressure 990 mb. Rain.Then the question is:
Temperature 17 degrees, pressure 1010 mb. Will there be rain?The Hamming distance from the first exemplar is 5 + 10, the distance from the second is 2 + 20. So we say it is nearest to the first exemplar.
However, we have just added together two different quantities - degrees and millibars! We could just as easily have said the exemplars were:
Temperature 22 degrees, pressure 1 bar. No rain. Temperature 15 degrees, pressure 0.990 bars. Rain.and the question:
Temperature 17 degrees, pressure 1.010 bars. Will there be rain?Then the distance is 5 + 0.010 from the first, 2 + 0.020 from the second. The second exemplar is closer.
Which is correct? Which one are we closer to?
The answer is really: If it rains, we must have been closer to the second exemplar. If it doesn't rain, we must be closer to the first. "Closer" is defined by observing the output, rather than pre-defined as in a distance metric.
The same applies to any other distance metric, such as the "real" distance metric - the distance between 2 points x = (x1,..,xn) and y = (y1,..,yn) in n-dimensional Euclidean space:
(e.g. consider n=2 or n=3). Again, we can change the scale of one axis without the other, and so change the definition of the "nearest" exemplar. It is just easier to show this with the Hamming distance.
Even if we do not have the above problem of heterogenous inputs on different scales, a distance metric is still a very primitive form of predictor.
e.g. Consider classifying a bitmap of 100x100 pixel images. The components of the input are all the same type of thing - pixels that are, say, 0 (black) or 1 (white). The distance metric is the number of pixels that are different. This strategy of looking at distance from exemplars is called template matching.
The basic problem with it is that it weighs every dimension (every pixel) as of equal importance.
Consider bitmaps of faces. We are trying to distinguish happy faces from sad faces. Say we introduce a new, happy face with a similar haircut to one of the exemplars with a sad face (i.e. we were told: "This face is sad"). The haircut area of the bitmap scores so many hits under the distance metric that this becomes the exemplar that matches, and we guess that the new input is a sad face.
In fact, matching the exemplar with the same haircut / background / ears is a reasonable first guess - after all, how does the machine know in advance that it should be looking at the mouth and eyes, and ignoring the other parts?
The problem is the system cannot learn this. The distance metric is fixed. It cannot learn that some parts of the input space are irrelevant.
It would be more useful, and could be applied to much more problems, if the machine could learn which are the most important inputs, and learn to weigh some higher than others because they have a stronger affect on output. We just feed inputs into it. If they are irrelevant it will learn to ignore them.
Another issue, in the previous, is does our algorithm have to compare current input to all previous exemplars? It would be very laborious, and take a long time, to compare every pixel of the input with every pixel of a large number of previous exemplars (and it must be compared to them all, every time). And the list of exemplars to compare to keeps getting larger.
Neural Networks are a type of data structure that:
You may have heard of Neural Networks because they have been advertised as "a type of program that can learn". But they are far more than that. Traditional programs can learn too. Neural Networks have some very specific features (above) that mean they can be used for this problem.
Example - How do you play chess?
You can easily recognise a 10 euro note (no.3 recogniser machine).
But you probably cannot draw a 10 euro note (no no.2 explicit memory).
The neural network will, when finished, actually embody such a function.IF pixel 0 = white AND pixel 1 = black AND .... OR if pixel 0 = white AND pixel 1 = white AND .... OR if ... then return "3" ELSE if pixel 0 = white AND ... OR if .... then return "5" ELSE if ...