Dr. Mark Humphrys

School of Computing. Dublin City University.

Home      Blog      Teaching      Research      Contact

Search:

CA249      CA318      CA425      CA651

w2mind.computing.dcu.ie      w2mind.org


Search engine


Overview

  1. Write an offline search engine to search your web pages and produce an offline output web page where you can click on links.
  2. The search engine will be an offline version of my online search engine.
  3. You may not have web pages, so we will test it on a sample corpus of the works of Shakespeare in one of my directories.

Test corpus

  1. We will test it on the works of Shakespeare from the University of Adelaide.
  2. I have a copy here:
    file:///users/gdf1/mhtest09/share/shakespeare/home.html

  3. Note you may have to paste the file:// link into your browser address bar.
    Firefox and Chrome do not allow links from http:// to file://
    Once you are in file:// mode however, you can follow links from file:// to file://

  4. Note I am sharing this through the shared file system, not through http.
    Permissions need to be:
    drwx--x--x    /users/gdf1/mhtest09
    drwxr-xr-x    /users/gdf1/mhtest09/share
    (Q. Why?)


For pass mark

  1. Call it gweb ("grep web").
    gweb string
  2. It searches the test corpus for the input string:
     
       cd /users/gdf1/mhtest09/share/shakespeare
       grep -i string  */*html    
    

  3. Use <pre> and sed as in the online version.

  4. N.B. You must find and delete the parts of the online version that are irrelevant to the offline version.

  5. The script sends its final output into an (offline) output web page:
    $HOME/tmp/gweb.output.html

For full marks

The above is for a pass mark. For full marks, make the files clickable.
  1. The basic grep above gives output like this:
    file.html: hit

  2. To make the files clickable, pipe the output to a second script, which does this:
     
    
    while read line
    do
     file=`echo "$line" | [CUT BEFORE THE COLON]`
      hit=`echo "$line" | [CUT AFTER THE COLON]`
     
     echo "[LINKABLE FILENAME]: $hit <br>"
    done
      
    

    The bits in capital letters inside square brackets you need to work out yourself!
    See cut.

  3. You can now click on hits in the output page to see them (offline).
  4. You will have to adjust the href address if you are to click on links to my files from an output file in your directory.


Test

  1. For example:
    gweb northumberland
    will show all lines in the corpus where "northumberland" appears in any case.



Feeds      HumphrysFamilyTree.com

Bookmark and Share           On Internet since 1987.