Practical
Link checker
Write a Java program to:
- Take a URL as a command-line argument.
- Figure out automatically
(by looking at your IP address)
if you are
in DCU
and if so automatically switch to using
the DCU proxy.
- Return if the URL exists or not.
- If it exists, download its content, examine all the pages it links to,
and display broken links.
- Pages it links to can be found by searching for:
<a .. href
- See Parsing XML / HTML
- We will define a "broken" link
as any link with a
HTTP return code
other than 200,
or a link that times out.
- For timeout settings try
here.
- Produce an output file listing:
- All the broken links.
- Display their HTTP return codes.
- For every link that your program claims is broken,
test the link manually in a browser.
If you can view the link, explain (in your printout)
why your program thinks it is broken.
- Do not bother listing the links that worked (return code 200).
Test on these URLs:
Your final output should demonstrate your
program working on these URLs:
http://computing.dcu.ie/~humphrys/publications.html
http://computing.dcu.ie/~humphrys/ai.links.html
http://computing.dcu.ie/~humphrys/robot.links.html
http://computing.dcu.ie/~humphrys/evolution.links.html
http://computing.dcu.ie/~humphrys/computers.internet.links.html
http://computing.dcu.ie/~humphrys/news.links.html
http://humphrysfamilytree.com/links.html
http://humphrysfamilytree.com/sources.html
Hint: Get your program working on smaller pages first,
before testing it on larger pages.
Some of these pages are huge!
Check for these:
These pages may contain:
- Relative links, like:
<a href="subdir/file.html">
<a href="../index.html">
- Links to a label on a page, like:
<a href="#label">
<a href="file.html#label">
Check if the label exists on that page.
- href links to files that are not web pages, like:
<a href="pic.jpg">
Check if file exists.
- Embedded image src links, like:
<img src="pic.jpg">
Check if image file exists.
- href followed by
mailto, ftp, telnet, news or gopher.
- Forms
(check the ACTION= link)
Skip Google searches
-
Google doesn't allow scripting of search results.
- So ignore all Google searches:
http://www.google.DOMAIN/search?ARGS
These searches never break anyway.
If this search link once worked (i.e. is formatted correctly), it will always work.
- Do test other links to Google, such as links to its directory:
http://www.google.DOMAIN/Top/PATH
You can script these.
And they need to be tested, since sometimes they break.
To hand up:
What to hand up
(Include a printout of the output when run on the URLs above.)