Dr. Mark Humphrys

School of Computing. Dublin City University.

Home      Blog      Teaching      Research      Contact

Search:

CA216      CA249      CA318

CA400      CA651      CA668


How to write a search engine in 9 lines of Shell

The following is a fully working search engine for your web pages in 9 lines of Shell:



#!/bin/sh

echo "Content-type: text/html"
echo

echo '<html> <head> <title> Search results </title> </head> <body>'

argument=`echo "$QUERY_STRING" | sed "s|q=||"`

cd /users/homes/me/public_html

echo '<pre>'
grep -i "$argument" *html */*html		 |    sed -e 's|<|\&lt;|g'   |   sed -e 's|>|\&gt;|g'   
echo '</pre>'


Notes:

  1. This is an online program. It is a server-side CGI script. It accepts input through a HTML form.

  2. "q=" assumes that your input variable is called "q" in the HTML form.

  3. Your web directories need to be readable for the wildcard to work.

  4. We pipe the result of grep into an ugly-looking sed command. This sed command is needed because there are HTML tags in the results returned by grep. These will be interpreted by your browser, displaying a mess.
    To just print the HTML tags without interpreting them, we need to pipe the results through a sed command that:

    1. converts all   < characters to   &lt;
    2. converts all   > characters to   &gt;

    The command is tricky to write because "&" has special meaning to sed and must be escaped.




Some rather essential enhancements

  1. Some extra security would be wise, e.g. process the argument with a C++ script before passing it to grep, check your PATH, etc.

  2. Consider also where there are spaces in the argument (multiple search words), etc.

  3. Change the output so the user can actually click on the pages returned.




Some further enhancements

  1. If you have more than 2 levels of web pages you may write them out explicitly as   */*/*html etc., or get a recursive grep, or use recursive find first to build the filespec:
    cd /users/homes/me/public_html
    
    filespec=`find . -type f -name "*html" | tr '\n' ' '`
    
    grep -i "$argument" $filespec
    
    Since each search will be using the same file list, it would be more efficient to pre-build the list once, and cache it in a file, and then:
    read filespec < filelist.txt
    
    grep -i "$argument" $filespec
    
    (I hope you realise that a heavy-duty search engine would go further and pre-index all the files in advance, rather than grep-ing them on the spot. But simple grep is alright for a personal website.)

  2. The pages are not ranked in order of relevance, but only in the order in which grep finds them.
    Q. How would you solve this?



But the principle is that in Shell you can rustle up a quick search engine for your personal pages, or any subset of them, in a few lines.
e.g. My search engine in about 55 lines of Shell (with a C++ input pre-processor for security) has most of the above enhancements.



Feeds      w2mind.org

On Internet since 1987.