Dr. Mark Humphrys

School of Computing. Dublin City University.

Home      Blog      Teaching      Research      Contact

Search:

CA216      CA249      CA318

CA400      CA651      CA668


XML and HTML

Machine readable v. human readable content.




XML



XML example

Pure data file (not web page). Contains US states:

 
<?xml version="1.0"?>
<choices xml:lang="EN">
        <item><label>Alabama</label><value>AL</value></item>
...
        <item><label>Wyoming</label><value>WY</value></item>
</choices>





XML example - Microsoft Office XML


Part of a Word file in XML format (DOCX):


See explanation: Microsoft Office XML file formats.






XML example - Flickr XML feeds




Note invention of own tags:

...
<entry>
                <title>by Nathan Coley</title>
                <link rel="alternate" type="text/html" href="http://www.flickr.com/photos/jeremydp/3992143711/"/>
                <id>tag:flickr.com,2005:/photo/3992143711</id>
                <published>2009-10-08T12:00:32Z</published>
                <updated>2009-10-08T12:00:32Z</updated>
                <dc:date.Taken>2009-10-03T21:59:32-08:00</dc:date.Taken>
                <content type="html"> .... </content>
                <author>
                        <name>jeremyDP</name>
                        <uri>http://www.flickr.com/people/jeremydp/</uri>
                                        </author>
        <link rel="enclosure" type="image/jpeg" href="http://farm3.static.flickr.com/2479/3992143711_1353c6f932_m.jpg" />

                <category term="paris" scheme="http://www.flickr.com/photos/tags/" />
                <category term="2009" scheme="http://www.flickr.com/photos/tags/" />
                <category term="butteschaumont" scheme="http://www.flickr.com/photos/tags/" />
                <category term="nuitblanche" scheme="http://www.flickr.com/photos/tags/" />
                <category term="nathancoley" scheme="http://www.flickr.com/photos/tags/" />                        
</entry>
...
  


Machine readable web






RSS (XML web feeds)






Parsing XML / HTML

There is support in many programming languages for parsing XML / HTML.
The problem is they may fail on badly-formed XML / HTML (i.e. lots of HTML).


  1. Javascript


  2. Java - Parsing HTML in Swing


  3. Shell - Command-line tools that you can use in shell scripts
    • xpath - Parse well-formed XML
    • example XML file
      # extract nodes delimited by <choices>
            cat test.ajax.xml | xpath //choices
      
      # extract nodes delimited by <item> within those
            cat test.ajax.xml | xpath //choices//item
      
      # get first node only
            cat test.ajax.xml | xpath "(//choices//item)[1]"
            cat test.ajax.xml | xpath "//item[1]"
      
      # get text inside tags
        cat test.ajax.xml | xpath "//item[1]" | xpath "//label[1]"		
        cat test.ajax.xml | xpath "//item[1]" | xpath "//label[1]/text()"   > outputfile
      
      

    • Exercise - XML parsing in Shell


  4. List of HTML parsers
    • Many in this list are designed to be error-tolerant and able to parse the badly-formed HTML that is found "in the wild". See ones called names like "soup" and "tidy".
    • Some are program libraries (e.g. Java libraries). Some are stand-alone command-line tools.
    • The "tag soup" concept.
    • The TagSoup Java library by John Cowan

Strategy for parsing HTML:
  1. Use error-tolerant readers to convert badly-formatted HTML to well-formatted HTML.
  2. Can now parse the well-formatted HTML with other, more picky programs like xpath.



XHTML




Screenshot from XHTML ebook shows the utopian vision of XHTML.


  

Re-write the Web?

 


My first decent mobile Internet device: XDA Exec = HTC Universal (2005).
This had no problem rendering malformed HTML.




Strict, well-formed data is good (so long as not compulsory)

There is a good rule: "Be conservative in what you send, be liberal in what you accept". i.e. For a new project, why not output strict, well-formed data (XHTML, or validated HTML). It will make it easier for your team to re-purpose your content in the future.

I am only pointing out that this cannot be the entire world ("be liberal in what you accept"). One must also consider:

  1. Old pages.
  2. New pages written by people who do not conform to standards. (You might say "amateurs". Or you might say "people with other jobs".) There are millions of such pages and sites. There are new such pages and sites created every day. Consider even just the web pages of all computer lecturers at DCU. How many validate their HTML?
Unlikely the web will ever be well-formed. And maybe it doesn't matter.



2009 post tries to validate HTML on major websites, and suggests that the majority of the web is malformed. And it doesn't seem to matter.


  

HTML5 instead of XHTML



Human-readable web and machine-readable web stay separate



Feeds      w2mind.org

On Internet since 1987.