Dr. Mark Humphrys

School of Computing. Dublin City University.

Home      Blog      Teaching      Research      Contact

Search:

CA216      CA249      CA318

CA400      CA651      CA668


String searching



grep

grep defines a worldview on UNIX:
  1. Files in text format, not binary.
  2. Not forced to go through other people's applications.
  3. Can directly search the files yourself on the command line.
My search engine based on grep

find on DOS.

But how does grep (or other string-finding programs) work?



Introduction



Naive (brute force) algorithm

string[0] ... string[n-1]
search for:
pattern[0] ... pattern[m-1]


i = 0  	// start by trying to match from LHS of string
j = 0

repeat
{
 if ( string[i] == pattern[j] )
 {
  i++
  j++
 }
 else
 {
  i = i - j + 1   	// (*)
  j = 0    			// reset pattern to start
 }
}
until ( (j >= m) or (i >= n) )

if ( j == m ) return match  		// matches string[i-m] .. string[i-1]
 else return no match

(*) say i = 50 and we have been successfully matching pattern for a bit
started at i = 50, j = 0
now i = 57, j = 7, when suddenly it fails
go to i = 57 - 7 + 1 = 51 and try with j = 0 from there


Performance of naive algorithm

  1. Large alphabet (e.g. Alphanumeric)
    Imagine if pattern = "STING"
    string = "A STRING SEARCH CONSISTING OF SOME TEXT"
    S gets triggered 3 times before start of the match
    ST gets triggered 1 time before the match
    j++ executed 4 times before match
    in typical text search, the if condition above is true only rarely
    pointless j++'s (for aborted matches) are few
    running time is O(m+n)

  2. Small alphabet (e.g. Binary)
    However, if alphabet is small, this scales worse.
    e.g. pattern = 000000001
    string = 000000000000000000000000000000000000000000000000000000000000000100
    worst case is O(mn)



Knuth–Morris–Pratt string search algorithm




Boyer-Moore string search algorithm



What does grep use?


Boyer-Moore is used, for example, in programming language libraries for methods without regular expressions, things like:
substring(pattern,string) - return if pattern is found within string



Links



Feeds      w2mind.org

On Internet since 1987.