That is, multiple layers of links. Must involve what are called "hidden" nodes - nothing to do with security, this just means they are not "visible" from the Input or Output sides - rather they are inside the network somewhere.
These allow more complex classifications. Consider the network:
The 3 hidden nodes each draw a line and fire if the input point is on one side of the line.
The output node could be a 3-dimensional AND gate - fire if all 3 hidden nodes fire.
Consider the 3-d cube defined by the points: (0,0,0), (0,1,0), (0,0,1), (0,1,1), (1,0,0), (1,1,0), (1,0,1), (1,1,1)
A 3-dimensional AND gate perceptron needs to separate with a 2-d plane the corner point (1,1,1) from the other points in the cube - this is possible.
Construct a triangular area from 3 intersecting lines in the 2-dimensional plane.
To only fire when the point is in one of the 2 disjoint areas:
we use the following net (Just Hidden and Output layers shown. Weights shown on connections. Thresholds circled on nodes.):
Q. Do an alternative version using 4 perceptrons, 2 AND gates and a final OR gate.
2-layer network can classify points inside any n arbitrary lines
(n hidden units plus an AND function).
i.e. Can classify:
To classify a concave polygon (e.g. a concave star-shaped polygon), compose it out of adjacent disjoint convex shapes and an OR function. 3-layer network can do this.
3-layer network can classify any number of disjoint convex or concave shapes. Use 2-layer networks to classify each convex region to any level of granularity required (just add more lines, and more disjoint areas), and an OR gate.
Then, like the bed/table/chair network above, we can have a net that fires one output for one complex shape, another output for another arbitrary complex shape.
And we can do this with shapes in n dimensions, not just 2 or 3.
2 connections in first layer not shown (weight = 0).
We have multiple divisions. Basically, we use the 1.5 node to divide (1,1) from the others. We use the 0.5 nodes to split off (0,0) from the others. And then we combine the outputs to split off (1,1) from (1,0) and (0,1).
Question - How did we design this?
Answer - We don't want to. Nets wouldn't be popular if you had to.
We want to learn
these weights.
Also network is to represent unknown f,
not known f.
Q. Do an alternative XOR using 2 perceptrons and an AND gate.
Also interesting to note that HLLs primitive or non-existent. Computer models often focused on raw hardware/brain/network models.
Meanwhile, HLLs had been invented, and people were excited about them, seeing them as possibly the language of the mind. Connectionism went into decline. Explicit, symbolic approaches dominant in AI. Logic. HLLs.
Also HLLs no longer so exciting. Seen as part of Computer Engineering, little to do with brain. Increased interest in numeric and statistical models.
Computer Science may still turn out to be the language for describing what we are (*), just not necessarily HLL-like Computer Science.
(*) See this gung-ho talk by Minsky (and local copy) at ALife V, 1996 - "Computer Science is not about computers. It's the first time in 5000 years that we've begun to have ways to describe the kinds of machinery that we are."
Note: Not a great drawing. Can actually have multiple output nodes.
where is the sigmoid function.
Input can be a vector.
There may be any number of hidden nodes.
Output can be a vector too.
Typically fully-connected. But remember that if a weight becomes zero, then that connection may as well not exist. Learning algorithm may learn to set one of the connection weights to zero. i.e. We start fully-connected, and learning algorithm learns to drop some connections.
To be precise, by making some of its input weights w_{ij} zero or near-zero, the hidden node decides to specialise only on certain inputs. The hidden node is then said to "represent" these set of inputs.
We need interference so we can generate a "good guess" for unseen data.
But it does seem strange that, having been told the correct answer for x, we do not simply return this answer exactly anytime we see x in the future. Why "forget" anything that we once knew? Surely forgetting things is simply a disadvantage.
We could have an extra lookup table on the side, to store the results of every input about which we knew the exact output, and only consult the network for new, unseen input.
However, this may be of limited use since inputs may never actually be seen twice. e.g. Input of continuous real numbers in robot senses. Angle = 250.432 degrees, Angle = 250.441 degrees, etc. Consider when n dimensions. Need every dimension to be the same, to 3 decimal places.
If exact same input never seen twice, our lookup-table grows forever (not finite-size data structure) and is never used. Even if it is (rarely) used, consider computational cost of searching it.
In the above, if the feedback the neural net gets is the same
in the area 240 - 260 degrees,
then it will develop weights and thresholds so that any
continuous value in this zone generates roughly the same output.
On the other hand, if it receives
different feedback in the zone around 245 - 255 degrees
than outside that zone, then it will develop weights
that lead to a (perhaps steep) threshold being crossed at 245,
and one type of output generated,
and another threshold being crossed at 255,
and another type of output generated.
The network can learn to classify any area of the multi-dimensional input space in this way. This is especially useful for:
We asked How did we design the XOR network? In fact, we don't have to design it. We can repeatedly present the network with exemplars:
Input 0 0 Output 0 Input 1 0 Output 1 Input 0 1 Output 1 Input 1 1 Output 0and it will learn those weights and thresholds! (or at least, some set of weights and thresholds that implement XOR)