Dr. Mark Humphrys School of Computing. Dublin City University. My big idea: Ancient Brain Search:

# Knuth–Morris–Pratt string search algorithm

```
```

## How to speed up brute force string search

Suppose pattern = 1000011
string = ... 1010001000011 ...
Let's say in our brute force search from LHS of string, we currently have the green bits aligned and have just discovered the mismatch at the red bit = string[i]
We note that pattern[0] is different from pattern[1] .. pattern[4]

but we know that currently:
string[i-4] == pattern[0]
string[i-3] == pattern[1]
string[i-2] == pattern[2]
string[i-1] == pattern[3]

So no point moving right by 1 position and trying to match pattern[0] with string[i-3]
Instead can jump straight to string[i] and start trying to match pattern[0] from there.

To be precise, the prefix "1" of the pattern cannot be found in any of the bit we skip over "000".
Obviously a lot of choices for prefixes of the pattern we could be looking at.

```
```

## A rule that won't work

A first attempt at a rule to speed up brute force search:

"After a mismatch at string[i], restart comparing the pattern from string[i] onwards."

Why won't it work?
This will fail if beginning of pattern is in the bit we skip over.
Example 1:

pattern = 101001
string = ... 1010100111111
Let B = beginning of pattern that we lose.
B = "10"
Jumps over the match and misses it.

Obviously a lot of choices for prefixes of the pattern we could be looking at.
In above case, we should reset string index to i-2, but is it possible to leave string[i] alone and just change pattern[j]?

Example 2:

pattern = 10101001
string = ... 101010100111111
B = "1010"

Rule fails if beginning of pattern B is in the bit we skip over - This must mean that a beginning part of the pattern C (not necessarily the same beginning part as B!) must repeat twice at start of the pattern (since we matched all of the bit we skipped over).
That is, pattern = C C ..
In both above C = "10"

In Example 1, pattern starts with a repeat:
pattern[0] = pattern[2]
pattern[1] = pattern[3]
and:
string[i-4] == pattern[0]
string[i-3] == pattern[1]
string[i-2] == pattern[2]
string[i-1] == pattern[3]
that is:
string[i-2] == pattern[0]
string[i-1] == pattern[1]
So we can keep i as it is, and reset j from 4 to 2 and try matching from there

```
```

## No backing up

Leaving i the same means no backing up in the (huge) string to be searched (e.g. a file).
Only changing j position in the (small) pattern.
Don't need an ever-changing buffer of the last m characters of the string.
Just store an array of the pattern at startup, and proceed through the string always forward with no buffering.
```
```

## Knuth–Morris–Pratt string search algorithm

Start at LHS of string, string[0], trying to match pattern, working right.
Trying to match string[i] == pattern[j].

Given a search pattern, pre-build a table, next[j], showing, when there is a mismatch at pattern position j, where to reset j to.
If match fails, keep i same, reset j to position next[j].

```
```

# How to build the table

Everything else below is just how to build the table.
```
```

## Construct a table showing where to reset j to

1. If mismatch string[i] != pattern[0], just move string to i+1, j = 0
2. If mismatch string[i] != pattern[1], we leave i the same, j = 0

pattern = 10
string = ... 1100000

3. If mismatch string[i] != pattern[2], we leave i the same, and change j, but we need to consider repeats in pattern[0] .. pattern[1]

pattern = 110
string = ... 11100000
i stays same, j goes from 2 back to 1

pattern = 100
string = ... 10100000
i stays same, j goes from 2 back to 0

4. If mismatch string[i] != pattern[j], we leave i the same, and change j, but we need to consider repeats in pattern[0] .. pattern[j-1]
Given a certain pattern, construct a table showing where to reset j to.

```
```

## Construct a table of next[j]

For each j, figure out:
next[j] = length of longest prefix in "pattern[0] .. pattern[j]" that matches the suffix of "pattern[1] .. pattern[j]"
That is:
1. prefix must include pattern[0]
2. suffix must include pattern[j]
3. prefix and suffix are different

Example for pattern "ABABAC":

 j 0 1 2 3 4 5 substring 0 to j A AB ABA ABAB ABABA ABABAC longest prefix-suffix match none none A AB ABA none next[j] 0 0 1 2 3 0 notes no prefix and suffix that are different i.e. next[0]=0 for all patterns

Given j, let n = next[j]
"pattern[0] .. pattern[n-1]" = "pattern[j-(n-1)] .. pattern[j]"

e.g. j = 4, n = 3,
"pattern[0] .. pattern[2]" = "pattern[2] .. pattern[4]"

If match fails at position j+1, keep i same, reset pattern to position n.
Have already matched pattern[0] .. pattern[n-1]

e.g. We have matched ABABA so far.
If next one fails, say we have matched ABA so far and then see if next one matches.
That is, keep i same, just reset j to 3 (= precisely length of longest prefix-suffix match)
Then, if match after ABA fails too, by the same rule we say we have matched A so far, reset to j = 1, and try again from there.
In other words, it starts by trying to match the longest prefix-suffix, but if that fails it works down to the shorter ones until exhausted (no prefix-suffix matches left).

```
```

## Algorithm to construct table of next[j]

Do this once, when the pattern comes in.
pattern[0] ... pattern[m-1]
Here, i and j both index pattern.

 ``` i = 0 next[i] = 0 i++ j = 0 while ( i < m ) { if ( pattern[j] == pattern[i] ) { next[i] = j+1 i++ j++ } else ( pattern[j] != pattern[i] ) { if ( j > 0 ) j = next[j-1] else ( j == 0 ) { next[i] = 0 i++ j = 0 // redundant, just to make it clear what we are looping with } } } ```

Outputs next[0], next[1], next[2] , .... in that order.

```
```

## Step-through of table-building algorithm for pattern "ABABAC"

 step i j k = (i - j) calculating substring notes output output j 1 0 0 0 next[0] A always = 0 next[0] = 0 0 2 1 0 1 next[1] AB pattern[0] != pattern[1] failed to match pattern[0] == pattern[1] next[1] = 0 0 3 2 0 2 next[2] ABA pattern[0] == pattern[2] this is our first match, pattern[0] == pattern[2] pattern can be shifted right by k = i = 2 j = 0 means can't be shifted right by anything less than k = i = 2 let's see how long we can do this for i.e. how long we can start at pattern[0], with ever growing length, and shift right by k next[2] = 1 1 4 3 1 2 next[3] ABAB pattern[1] == pattern[3] we have already matched pattern[0] == pattern[2] so if we can match pattern[1] == pattern[3] then we get a match of length 2 note that the difference between j and i encodes a memory of: (a) when this process started (step 3) and (b) how much we are shifting right by (k) next[3] = 2 2 5 4 2 2 next[4] ABABA pattern[2] == pattern[4] we have already matched pattern[0] == pattern[2] and pattern[1] == pattern[3] so if we can match pattern[2] == pattern[4] then we get a match of length 3 next[4] = 3 3 6 5 3 2 next[5] ABABAC meaning of j: output j = 3 from last round tells us that: pattern[0]..pattern[j-1] can be shifted right by k = (i-j) to match pattern[i-j]..pattern[i-1] pattern[0]..pattern[2] = ABA can be shifted right (by 2) to match pattern[2]..pattern[4] so only need to check if pattern[3] == pattern[5] pattern[3] != pattern[5] so now our shifting right of the ever-growing pattern since step 3 fails maybe there is another pattern we can shift right though but this will be a smaller pattern (i.e. we were correct to look for larger one first) look for smaller pattern than ABA to shift right, ending in pattern[4] we know ABA itself shifts right to end in pattern[4] so the smaller pattern must be a prefix of ABA and a suffix of ABA find the longest one (in j = 2) j = next[2] = 1 7 5 1 4 output j = 1 from last round tells us that: pattern[0] can be shifted right by k = (i-j) to match pattern[i-1] pattern[0] = A can be shifted right (by 4) to match pattern[4] so check if pattern[1] == pattern[5] pattern[1] != pattern[5] j = next[0] = 0 8 5 0 5 no history back to start check if pattern[0] == pattern[5] pattern[0] != pattern[5] next[5] = 0 0

```
```

• Knuth–Morris–Pratt string search algorithm

• Once table is built, this algorithm has complexity O(n) even for binary alphabet (compare with O(mn) for brute force).
• Table is built very quickly at start since pattern is usually very small.
• Table-building part is O(m). Search part is O(n). Total complexity O(m+n) even on binary alphabet.
• Compare with O(mn) for brute force on binary alphabet.
```
```
```
```

Feeds      w2mind.org      ancientbrain.com

On the Internet since 1987.