Dr. Mark Humphrys

School of Computing. Dublin City University.

Home      Blog      Teaching      Research      Contact

Search:

CA216      CA249      CA318

CA400      CA651      CA668


Search engine


Overview

  1. Write an offline search engine to search offline web pages in your file system, and produce an offline output web page where you can click on links.
  2. The search engine will be an offline version of my online search engine.
  3. You may not have web pages to search, so we will test it on my sample corpus of the works of Shakespeare.
  4. Call it gweb ("grep web"). Usage like:
    gweb string
  5. It searches the test corpus for the input string.
  6. It sends its output into an (offline) output web page:
    $HOME/tmp/gweb.output.html

Start with this

Start with this template:

# when testing, comment/uncomment the following line
# comment - output goes to screen
# uncomment - output goes to file


	exec > OUTPUTFILE           


cd SHAKESPEAREDIR

echo '<pre>'
grep -i "$1"  */*html  
echo '</pre>'



40%

  1. Change OUTPUTFILE to the desired output file.
  2. Change SHAKESPEAREDIR to the location of my Shakespeare corpus.
  3. Test that it works with a sample search.

  4. When the above is working, pipe the grep line to a sed command like in my online search engine to print the HTML tags without interpreting them.
  5. Test that it works with a sample search.


100%

  1. When the above is working:
    Make the files clickable.

  2. The basic grep above gives output like this:

    file.html: hit

  3. Pipe the grep output to a second script called "clickable", which constructs links to the files.
    "clickable" looks like this:
     
    while read line
    do
     file=`echo "$line" | [CUT BEFORE THE COLON]`
      hit=`echo "$line" | [CUT AFTER THE COLON]`
     
     echo "<a href=[URL] > [FILE]</a>: [HIT] <br>"
    done
    
    The bits in [BOLD] you need to work out yourself.
    See how to use cut with grep output.

  4. You can now click on hits in the output page to see them (offline).
    Check the link works! You may need to adjust the path.

Test

  1. For example:
    gweb northumberland
    will show all lines in the corpus where "northumberland" appears in any case.



Feeds      w2mind.org

On Internet since 1987.