Dr. Mark Humphrys

School of Computing. Dublin City University.

Home      Blog      Teaching      Research      Contact

Search:

CA216      CA249      CA318

CA400      CA651      CA668


Knuth–Morris–Pratt string search algorithm


How to speed up brute force string search

Suppose pattern = 1000011
string = ... 1010001000011 ...
Let's say in our brute force search from LHS of string, we currently have the green bits aligned and have just discovered the mismatch at the red bit = string[i]
We note that pattern[0] is different from pattern[1] .. pattern[4]

but we know that currently:
string[i-4] == pattern[0]
string[i-3] == pattern[1]
string[i-2] == pattern[2]
string[i-1] == pattern[3]

So no point moving right by 1 position and trying to match pattern[0] with string[i-3]
Instead can jump straight to string[i] and start trying to match pattern[0] from there.

To be precise, the prefix "1" of the pattern cannot be found in any of the bit we skip over "000".
Obviously a lot of choices for prefixes of the pattern we could be looking at.



A rule that won't work

A first attempt at a rule to speed up brute force search:

"After a mismatch at string[i], restart comparing the pattern from string[i] onwards."

Why won't it work?
This will fail if beginning of pattern is in the bit we skip over.
Example 1:

pattern = 101001
string = ... 1010100111111
Let B = beginning of pattern that we lose.
B = "10"
Jumps over the match and misses it.

Obviously a lot of choices for prefixes of the pattern we could be looking at.
In above case, we should reset string index to i-2, but is it possible to leave string[i] alone and just change pattern[j]?

Example 2:

pattern = 10101001
string = ... 101010100111111
B = "1010"

Rule fails if beginning of pattern B is in the bit we skip over - This must mean that a beginning part of the pattern C (not necessarily the same beginning part as B!) must repeat twice at start of the pattern (since we matched all of the bit we skipped over).
That is, pattern = C C ..
In both above C = "10"

In Example 1, pattern starts with a repeat:
pattern[0] = pattern[2]
pattern[1] = pattern[3]
and:
string[i-4] == pattern[0]
string[i-3] == pattern[1]
string[i-2] == pattern[2]
string[i-1] == pattern[3]
that is:
string[i-2] == pattern[0]
string[i-1] == pattern[1]
So we can keep i as it is, and reset j from 4 to 2 and try matching from there



No backing up

Leaving i the same means no backing up in the (huge) string to be searched (e.g. a file).
Only changing j position in the (small) pattern.
Don't need an ever-changing buffer of the last m characters of the string.
Just store an array of the pattern at startup, and proceed through the string always forward with no buffering.


Knuth–Morris–Pratt string search algorithm

Start at LHS of string, string[0], trying to match pattern, working right.
Trying to match string[i] == pattern[j].

Given a search pattern, pre-build a table, next[j], showing, when there is a mismatch at pattern position j, where to reset j to.
If match fails, keep i same, reset j to position next[j].



How to build the table

Everything else below is just how to build the table.

Construct a table showing where to reset j to

  1. If mismatch string[i] != pattern[0], just move string to i+1, j = 0
  2. If mismatch string[i] != pattern[1], we leave i the same, j = 0

    pattern = 10
    string = ... 1100000

  3. If mismatch string[i] != pattern[2], we leave i the same, and change j, but we need to consider repeats in pattern[0] .. pattern[1]

    pattern = 110
    string = ... 11100000
    i stays same, j goes from 2 back to 1

    pattern = 100
    string = ... 10100000
    i stays same, j goes from 2 back to 0

  4. If mismatch string[i] != pattern[j], we leave i the same, and change j, but we need to consider repeats in pattern[0] .. pattern[j-1]
Given a certain pattern, construct a table showing where to reset j to.



Construct a table of next[j]

For each j, figure out:
next[j] = length of longest prefix in "pattern[0] .. pattern[j]" that matches the suffix of "pattern[1] .. pattern[j]"
That is:
  1. prefix must include pattern[0]
  2. suffix must include pattern[j]
  3. prefix and suffix are different

Example for pattern "ABABAC":

j 0 1 2 3 4 5
substring 0 to j A AB ABA ABAB ABABA ABABAC
longest prefix-suffix match none none A AB ABA none
next[j] 0 0 1 2 3 0
notes no prefix and suffix that are different
i.e. next[0]=0 for all patterns
         


Given j, let n = next[j]
"pattern[0] .. pattern[n-1]" = "pattern[j-(n-1)] .. pattern[j]"

e.g. j = 4, n = 3,
"pattern[0] .. pattern[2]" = "pattern[2] .. pattern[4]"

If match fails at position j+1, keep i same, reset pattern to position n.
Have already matched pattern[0] .. pattern[n-1]

e.g. We have matched ABABA so far.
If next one fails, say we have matched ABA so far and then see if next one matches.
That is, keep i same, just reset j to 3 (= precisely length of longest prefix-suffix match)
Then, if match after ABA fails too, by the same rule we say we have matched A so far, reset to j = 1, and try again from there.
In other words, it starts by trying to match the longest prefix-suffix, but if that fails it works down to the shorter ones until exhausted (no prefix-suffix matches left).



Algorithm to construct table of next[j]

Do this once, when the pattern comes in.
pattern[0] ... pattern[m-1]
Here, i and j both index pattern.



i = 0

next[i] = 0
i++
j = 0


while ( i < m )
{
 if ( pattern[j] == pattern[i] )
 {
	  next[i] = j+1
	  i++
	  j++
 }
 else ( pattern[j] != pattern[i] )
 {
	 if ( j > 0 )
	  j = next[j-1]
	 
	 else ( j == 0 )
	 {
	  next[i] = 0
	  i++
	  j = 0  			// redundant, just to make it clear what we are looping with
	 }
 }
}


Outputs next[0], next[1], next[2] , .... in that order.



Step-through of table-building algorithm for pattern "ABABAC"

step i j k = (i - j) calculating substring notes output output j
1 0 0 0 next[0] A always = 0 next[0] = 0 0
2 1 0 1 next[1] AB pattern[0] != pattern[1]
failed to match pattern[0] == pattern[1]
next[1] = 0 0
3 2 0 2 next[2] ABA pattern[0] == pattern[2]
this is our first match, pattern[0] == pattern[2]
pattern can be shifted right by k = i = 2
j = 0 means can't be shifted right by anything less than k = i = 2
let's see how long we can do this for
i.e. how long we can start at pattern[0], with ever growing length, and shift right by k
next[2] = 1 1
4 3 1 2 next[3] ABAB pattern[1] == pattern[3]
we have already matched pattern[0] == pattern[2]
so if we can match pattern[1] == pattern[3]
then we get a match of length 2
note that the difference between j and i encodes a memory of:
(a) when this process started (step 3) and (b) how much we are shifting right by (k)
next[3] = 2 2
5 4 2 2 next[4] ABABA pattern[2] == pattern[4]
we have already matched pattern[0] == pattern[2] and pattern[1] == pattern[3]
so if we can match pattern[2] == pattern[4]
then we get a match of length 3
next[4] = 3 3
6 5 3 2 next[5] ABABAC meaning of j:
output j = 3 from last round tells us that:
pattern[0]..pattern[j-1] can be shifted right by k = (i-j) to match pattern[i-j]..pattern[i-1]
pattern[0]..pattern[2] = ABA can be shifted right (by 2) to match pattern[2]..pattern[4]
so only need to check if pattern[3] == pattern[5]
pattern[3] != pattern[5]
so now our shifting right of the ever-growing pattern since step 3 fails
maybe there is another pattern we can shift right though
but this will be a smaller pattern (i.e. we were correct to look for larger one first)

look for smaller pattern than ABA to shift right, ending in pattern[4]
we know ABA itself shifts right to end in pattern[4]
so the smaller pattern must be a prefix of ABA and a suffix of ABA
find the longest one (in j = 2)

  j = next[2] = 1
7 5 1 4     output j = 1 from last round tells us that:
pattern[0] can be shifted right by k = (i-j) to match pattern[i-1]
pattern[0] = A can be shifted right (by 4) to match pattern[4]
so check if pattern[1] == pattern[5]
pattern[1] != pattern[5]
  j = next[0] = 0
8 5 0 5     no history
back to start
check if pattern[0] == pattern[5]
pattern[0] != pattern[5]
next[5] = 0 0






Feeds      w2mind.org

On Internet since 1987.