Search engine
Overview
- Write an offline search engine to search your web pages and produce an offline output web page where you can click on links.
- The search engine will be an offline version of
my online search engine.
- You may not have web pages, so we will test it on a sample corpus of the works of Shakespeare
in one of my directories.
Test corpus
- We will test it on the
works of Shakespeare
from the
University of Adelaide.
-
I have a copy here:
file:///users/gdf1/mhtest09/share/shakespeare/home.html
- Note you may have to paste the file:// link into your browser address bar.
Firefox and Chrome
do not allow links from http:// to file://
Once you are in file:// mode however,
you can follow links from file:// to file://
- Note I am sharing this through the shared file system, not through http.
Permissions need to be:
drwx--x--x
/users/gdf1/mhtest09
drwxr-xr-x
/users/gdf1/mhtest09/share
(Q. Why?)
For pass mark
- Call it gweb ("grep web").
gweb string
- It searches the test corpus for the input string:
cd /users/gdf1/mhtest09/share/shakespeare
grep -i string */*html
- Use
<pre>
and
sed
as in the online version.
-
N.B. You must find and delete the parts of the online version that are irrelevant to the offline version.
- The script sends its final output into an (offline) output web page:
$HOME/tmp/gweb.output.html
For full marks
The above is for a pass mark.
For full marks, make the files clickable.
- The basic grep above gives output like this:
file.html: hit
- To make the files clickable, pipe the output to a second script, which does this:
while read line
do
file=`echo "$line" | [CUT BEFORE THE COLON]`
hit=`echo "$line" | [CUT AFTER THE COLON]`
echo "[LINKABLE FILENAME]: $hit <br>"
done
The bits in capital letters inside square brackets you need to work out yourself!
See cut.
- You can now click on hits in the output page to see them (offline).
- You will have to adjust the href address if you are to click on links to my files
from an output file in your directory.
Test
- For example:
gweb northumberland
will show all lines in the corpus where "northumberland" appears in any case.