Practical


Link checker

Write a Java program to:
  1. Take a URL as a command-line argument.
  2. Figure out automatically (by looking at your IP address) if you are in DCU and if so automatically switch to using the DCU proxy.
  3. Return if the URL exists or not.
  4. If it exists, download its content, examine all the pages it links to, and display broken links.
  5. Pages it links to can be found by searching for: <a .. href

  6. See Parsing XML / HTML

  7. We will define a "broken" link as any link with a HTTP return code other than 200, or a link that times out.
  8. For timeout settings try here.

  9. Produce an output file listing:
    1. All the broken links.
    2. Display their HTTP return codes.
    3. For every link that your program claims is broken, test the link manually in a browser. If you can view the link, explain (in your printout) why your program thinks it is broken.
  10. Do not bother listing the links that worked (return code 200).


Test on these URLs:

Your final output should demonstrate your program working on these URLs:

http://computing.dcu.ie/~humphrys/publications.html
http://computing.dcu.ie/~humphrys/ai.links.html
http://computing.dcu.ie/~humphrys/robot.links.html
http://computing.dcu.ie/~humphrys/evolution.links.html
http://computing.dcu.ie/~humphrys/computers.internet.links.html
http://computing.dcu.ie/~humphrys/news.links.html
http://humphrysfamilytree.com/links.html
http://humphrysfamilytree.com/sources.html

Hint: Get your program working on smaller pages first, before testing it on larger pages. Some of these pages are huge!


Check for these:

These pages may contain:
  1. Relative links, like:
    <a href="subdir/file.html">
    <a href="../index.html">

  2. Links to a label on a page, like:
    <a href="#label">
    <a href="file.html#label">
    Check if the label exists on that page.

  3. href links to files that are not web pages, like:
    <a href="pic.jpg">
    Check if file exists.

  4. Embedded image src links, like:
    <img src="pic.jpg">
    Check if image file exists.

  5. href followed by mailto, ftp, telnet, news or gopher.
  6. Forms (check the ACTION= link)


Skip Google searches

  1. Google doesn't allow scripting of search results.

  2. So ignore all Google searches:
      http://www.google.DOMAIN/search?ARGS
    
    These searches never break anyway. If this search link once worked (i.e. is formatted correctly), it will always work.

  3. Do test other links to Google, such as links to its directory:
      http://www.google.DOMAIN/Top/PATH
    
    You can script these. And they need to be tested, since sometimes they break.

To hand up:

What to hand up (Include a printout of the output when run on the URLs above.)