BROWSING TECHNICAL DOCUMENTS: DOCUMENT MODELLING AND USER INTERFACE DESIGN

Donal Fitzpatrick & A. I. C. Monaghan
School of Computer Applications, Dublin City University
dfitzpatrick@ncirl.ie , alex@compapp.dcu.ie


Abstract

Though non-technical publications are now readily available to blind people the degree of access diminishes as the amount of technical information increases. This makes the procurement of such information both difficult and time-consuming. It is also true to say that once the technical documents have been acquired their reading is a struggle and far from a pleasurable experience. The following discussion presents some research currently in progress at Dublin City University which aims to alleviate many of the problems encountered by blind readers who need to access highly technical material.

Keywords: speech synthesis; document modelling; prosody; blind users; Braille.

Résumé

Un grand nombre de textes non-techniques est enregistré sur cassettes audio ou imprimé en Braille, et donc accessible pour les aveugles: mais plus les textes sont techniques, moins ils sont exprimables par ses moyens conventionnels. Alors, l'accès aux documents techniques pour les aveugles nécissite un effort assez important. Une fois accédés, la lecture de ces documents pose encore des gros problèmes au niveau du temps employé et de la difficulté d'en tirer les informations désirées. A Dublin City University, nous dévelopons actuellement un système de codage pour les documents techniques qui s'inserre entre le document et un synthétiseur conventionnel afin de faciliter l'accès aux informations techniques pour des lecteurs aveugles.

Mots Clés: synthèse de la parole; modélisation de documents; prosodie; aveugles; Braille


1. Introduction

    1.1. Underlying concepts

      The emergence of the personal computer has ensured that many people have gained access to vastly increased amounts of information. This is especially true in the case of blind computer users who, before the technological advances of the last two decades, relied on such media as Braille or cassette tape to access all information. These modes of reading, though slow to produce, were effective and ensured that at least some information was available.

      Nowadays, information is more widely available to blind people. However much of the research effort in the area of Access Technology to date has been directed towards the literary rather than the technical area. This being the case, educational prospects have been quite limited. For example, the realms of higher mathematics have been unattainable for most blind students, while scientific courses were not readily accessible. The main reasons are that the vendors of access technology have been more interested in giving blind people access to the more commonly used applications such as Word Processors and Spreadsheets, and have not placed as much emphasis on applications like programming tools or mathematical software.

      A commonly asked question, which should be dealt with at this point, is "How do blind people access computers?". There are several means to do this. The first mode of interaction involves the use of synthetic speech. Here, a device is plugged into the computer and software communicates with it via a serial port. Such software is known as a screenreader, which is responsible for relaying what appears on screen to the blind user. This software relies on user input to determine what is actually sent to the synthesiser, i.e. if a certain key sequence is pressed by the user, the current line could be read, whereas if another were chosen then the entire contents of a document could be spoken. The problem with this means of communication is that the focus can be quite narrow. For example, the user is confined to hearing what is being spoken at the current time. Thus, if a blind user is in the MS Windows 95 desktop and has highlighted "My Computer", they have no idea where other icons are located in relation to the one on which they are currently focused. The use of a Refreshable Braille Display in some way alleviates this problem. A Refreshable Braille Display consists of a line of eighty characters, consisting of solenoids which may be raised. When raised, they take the shape of the dot patterns used to form the 64 Braille characters. Here, the user has access to a single line of Braille on which a line of the screen is displayed. Using this type of appliance the user could see where the "My Computer" icon was on screen, and if there was no other icon on the display then they would be aware of the fact that there were no other icons in a direct horizontal line from the one on which they were focused. The disadvantage of such a hardware device is the cost. Speech synthesisers are sold at less than 10% of the cost of Braille displays, and with the emergence of software synthetic speech devices, the cost of this option is set to fall even further.

      There is thus a need for software which will allow speech synthesisers to produce intelligent-sounding output from technical documents. The TechRead system is intended as a prototype of such a system.

      1.2. TechRead

        TechRead aims to alleviate some of the problems encountered by blind people when attempting to gain access to technical material. The system is currently under development at Dublin City University. It is based on several key concepts. The system will take documents marked up in the TEX [Knuth, 1986] family of mark-up languages and derive an internal representation of the structure and content of the document. Once this has been achieved, the material can be transformed into either Braille or spoken output. Presently, it has been decided to output the Braille version of the document using the established rules for both transcription and formatting, as specified by the Braille Authority of the United Kingdom (BAUK). This organisation governs such aspect as the positioning of various document components on the page, and the circumstances when a given Braille symbol can be used. The spoken output, on the other hand, is the main thrust of our research. Existing vendors of screen access technology do not make use of the formatting commands found in technical documents to produce intelligent-sounding speech, and as a consequence the resulting output can often be very monotonous. TechRead will interpret document mark-up, and utilise the prosodic capabilities of current commercial synthesisers to express this mark-up in speech. The spoken output will thus reflect changes in the visual characteristics of the input.

        An example of where the characteristics of a voice could be changed is as follows. Let us assume that this article is being read by a blind user of the TechRead system. Perusal of this document reveals that the start of sectional units is denoted by a portion of text which is visually more striking than the surrounding text. When reading this article using a standard screenreader this fact is not apparent, and it is only through the intuition of the listener that the headings can be identified. TechRead will alter the characteristics of the reading voice to indicate to the listener that they have reached a portion of text which in some way stands out from the rest of the material in the document. This alteration of the spoken output will depend on the nature of the visually enhanced text. Thus, a section heading will be spoken differently from a portion of highlighted text, though the visual appearance of both may be identical.

        Another feature of TechRead is the incorporation of facilities to browse mathematical equations. Mathematics is a highly visual area, and to date the presentation of this type of material in both spoken and Braille translations has been less than perfect(Raman 1994, Stevens 1996). TechRead will use pauses, changes in the speaking rate and alterations in the pitch of the voice to indicate mathematical content (Monaghan 1991, Monaghan 1998). Also, as discussed below, the user interface will incorporate the features necessary to browse through highly complex mathematical content.

        The rest of this paper concentrates on the interaction between blind users and technical documents. It describes the interface used in the TechRead system, and outlines the off-screen model used to represent the underlying document structure. The interface uses this model to give rapid access to all parts of the document.


        2. Representing Document Structure.

        The first occurrence of the off-screen model in the area of screen access for blind computer users is to be found in the IBM Screenreader for OS/2. Here, a complex database of the screen attributes is built up, and from this information a model is built. Using this model, the information pertinent to a blind user can be easily and efficiently conveyed to a an output device. We decided to follow this approach when designing the internal representation of the hierarchical components of a document. Thus it was essential that this model incorporate all the structures of document design (sectional units, font attribution etc) and be easily accessible to an interface for the purposes of browsing. Before embarking on a description of this model, it is first necessary to describe the problems it was needed to solve.

        When sighted readers read, they can scan a document at a rate of thousands of words per minute. Their eyes can move rapidly over a page of printed text, focusing on the important pieces of material which stand out by virtue of visual attributes. These characteristics are manipulated by the author or publisher to reflect the nature of the information they wish to convey. Blind readers, on the other hand are not currently provided with these distinctions. The information is conveyed to the user of the speech synthesiser at a far slower rate than to sighted readers; usually at 300 words per minute, although experienced users of this technology can read at speeds of up to 450 words per minute. Existing software does not harness the prosodic capabilities of speech synthesisers, thus impoverishing the audio version of the text and rendering it tedious and often incomprehensible. An example will suffice to illustrate the differences between the two modes of reading.

        Let us assume that both a sighted and a blind reader wish to peruse a daily newspaper. The sighted person can pick up the paper, and can rapidly scan down the page to see which article they wish to read. Each article has one distinguishing feature to indicate its starting point; namely a headline. This piece of formatted text stands out from the surrounding text by virtue of its increased font size, and the emboldened nature of the typesetting. By virtue of this, the newspaper can be rapidly scanned and the desired information read. The blind reader, on the other hand does not have this facility and their means of locating important information is primarily based on the medium in which the newspaper is presented. The most primitive means of reading a document for a blind person is by use of a standard audio cassette. The information flows past the blind reader (or listener) and the book (or tape) is the active partner in the process. This is in direct contrast to the nature of sighted reading, where the roles revert to their more normal state where the reader is active and the book passive.

        In order to locate a headline, the reader must know what the text is, or the document structure must be indicated in some way. The most common means of accomplishing this is to incorporate tone indexing into audio recordings. These high-pitched beeps can be audible on machines with "queue" or "review" modes. However, the nature of an audio cassette is that it can only be advanced or re-wound along a time line, thus the tape must be moved forward or backward until the required passage is found. This is both very time-consuming and frustrating. An alternative is to access the material electronically. This is a slight improvement on standard audio cassette, as, through the use of search and replace utilities, the text can be rapidly scanned for the required information. However, there are disadvantages to this form of presentation. Suppose that "Section 5" were being sought: the user then executes a "find" and may retrieve the text "see Section 5". If the document is large, there can be any number of such occurrences before the actual material of section 5 is encountered.

        The TechRead system overcomes this problem by deriving an internal model of the document. This is constructed by examining the mark-up of the input and determining both the hierarchical and logical structure contained therein. Even the simplest ASCII text file contains a degree of mark-up: this is to be found in the punctuation, quoted passages of text, paragraphs and other devices which are commonly used in the preparation of documents. However, the more extensive the mark-up the more useful the model of the document can be.

        The means that TechRead uses to represent the structure is to produce a graph of the document. Traditionally, it has been considered the norm to use a tree-based structure to represent any document, and TechRead to some extent adheres to this principle. However, the essential difference is that the tree is cross-linked to form an interconnected network comprising both the structure and visual attributes of the document. The root node of this derived model consists of all global settings for the document. This would include such items as would ordinarily go in the preamble of a LaTeX (Lamport 1994) document. Below this are the internal nodes of the tree-based architecture. These nodes are used to contain the structural elements which comprise the document. Thus, such items as sectional units, paragraph units, or actual mathematical objects (such as equations) would be contained in this type of node. The actual text of the document would be found in the terminal, or leaf nodes of the tree. As was stated above, though the concept of a tree-based hierarchy is used in the TechRead system, it also includes non-hierarchic links. Thus, all nodes on a given branch are linked together, while the last node in one branch is linked to the first in the next. This strategy applies at all levels of the structure and is used to link sectional units of all types. Our intention is to capture non-hierarchic relations between diverse document elements such as sequential ordering, cross-referencing and footnotes.

        This model has been designed to incorporate the formatting changes which authors use to indicate passages of their text which differ in importance from the main body of the text. This model is able to represent such structural elements as lists, tables and passages of centred text and many other complex entities in a compact and flexible manner. This is perfectly illustrated by the storage of complex mathematical equations. The internal nodes can be further sub-divided into two main categories; namely textual or mathematical objects. The mathematical objects are those which comprise both the actual mathematical content of the given equations, and their spoken or Braille equivalent. The only step necessary to incorporate the mathematical objects into the document is to mark them as mathematically based rather than textual in nature. As discussed below, the interface uses this information to enable the user to rapidly and easily peruse this type of data.

        A brief example will illustrate how this model works. Let us return to the analogy of a daily newspaper. There are several obvious hierarchical elements to be found in this type of document; sections, articles, paragraphs and words. Let us also assume that the paper has been prepared in the LaTeX mark-up language. The portion of the TechRead system responsible for the preparation of the internal model would examine the mark-up of the newspaper and verify that the structural elements described above were present. It would then construct a cross-linked tree (or graph structure) of the newspaper. At the root of the tree would be the global formatting of the document, including the font predominantly used throughout the document, the title and other information found at the beginning of documents. The mark-up would then be examined, and the first section encountered. A new node would be created at the level below the root, and bi-directionally linked to it. The various articles would be found and placed at a lower hierarchic level than the overall sectional unit. Each node would be bi-directionally linked to its siblings and its parent and children. As was stated previously, once the next sectional unit were encountered, a link would be formed with its predecessor. The reason for the cross-linking is to avoid complex navigation algorithms. Thus it is possible to proceed directly from the last paragraph in the Business Section to the first in the Sports Section, without having to ascend until a common parent is found, move across to the next branch and descend once again to the level from which one originally started.

        In the next section we discuss how the interface used in the TechRead system uses the features of this model to make document browsing an intuitive and efficient process.


        3. Human interaction with spoken documents

          3.1 Current trends.

          As was stated previously much of the effort of screenreader developers has been aimed at the literary rather than technical areas. As a result, although access to simple documents is now of a high standard, when technical material is introduced this information is generally inaccessible to a blind reader. The interpretation of mathematical, tabular and graphical information is largely beyond the abilities of current screenreaders, and therefore the resulting synthetic speech is garbled at best and frequently incomprehensible.

          Even when the information can be sensibly rendered by the screenreader, the interface design may make finding and extracting this information extremely difficult. In the past number of years the trend in interface design has been to use as simple a means of interaction between the blind person and the screen reading software as possible. Thus, developers have tended towards the use of a centralised interface rather than a scattered one such that most of the keystrokes needed to obtain information about the contents of a screen are located in roughly the same portion of the keyboard. Where this is not possible, because the number of commands required outnumbers the available keys, the extra commands should be related to the rest in some logical and intuitive way. This has meant that the screenreaders employing this strategy have tended to be less functionally rich, but more user friendly. Two contrasting examples will serve to illustrate this point.

          The two most popular screenreaders currently available are JAWS for Windows and Slimware WindowBridge. The former is very simple to operate, but is functionally limited, whereas the latter is much less intuitive to use but is functionally very rich. JAWS primarily relies on the numeric keypad of a standard 101 key keyboard: it uses the "Insert" key located at the bottom of this keyboard as its control key, in much the same way as MS Windows uses the "Alt" key in its shortcuts. WindowBridge, on the other hand, is not so well organised: this program uses several seemingly unrelated key-combinations to perform similar tasks.

          Empirical testing has shown that blind students within Dublin City University could master JAWS more rapidly than WindowBridge, and that many of the more advanced features offered by the latter screenreader were never learned and remained unused.

          3.2. The Interface to TechRead

          The interface used in the TechRead system was designed with three main criteria in mind (Fitzpatrick & Monaghan, 1998). Firstly, it was necessary to offer the user a simple means of browsing complex information. Secondly, it had to be flexible enough to enable future expansion, and thirdly it needed to be capable of navigating the off-screen model used to represent the document structure. In order to maintain a maximal degree of simplicity it was decided to follow the trends in modern screenreader design outlined above and to employ the numeric keypad for the primary reading and orientation functions. The reasons why the interface is based around this portion of the keyboard are twofold. Firstly, it ensures that most commands are located in the same area of the keyboard. This makes the various commands both easy to remember and execute. Secondly this particular interface is easy to expand. The number of overlays which can be placed on the keypad is theoretically infinite, though in all practicality it is limited to approximately 4 by considerations of learnability and recall. The following paragraphs outline the nature of the TechRead interface, and explain how it will assist the blind reader to browse a document in an efficient and intuitive manner.

          There are two aspects to the TechRead interface. The first is the process of displaying the document on the screen, and the second is the manner in which a blind user of the system will access the document. It should be stated here that the reason for our emphasis on the visual display is that blind computer users do not work in a vacuum. There is no point, therefore, in the blind reader being able to browse the document in an efficient manner, while their sighted colleagues are totally unable to use the system. Thus, it was decided to employ the standard "look and feel" principles used in the design of Windows-based products. The document will be displayed in two panels on the screen. The leftmost panel will display the structural elements which make up the document. However, this will only go down to the lowest level of sub-sectional units, as the actual paragraphs of text will be displayed in the right-hand panel. To return to our newspaper analogy, the left-hand panel would consist of the sections and article headlines while the other would contain the actual text of the current article. The interface will make use of all the standard Windows techniques for file access, and will also contain buttons which the sighted user can activate using the mouse.

          However, the more important aspect of this system will be the manner in which the blind user will be able to browse the document. As was described above , the model used to represent the document internally is primarily tree-based. Consequently, the use of a standard tree-control display to represent the underlying structure of the document seems sensible both from an implementation perspective and from the user's viewpoint. This structure is also conducive to interaction with the numeric key-pad.

          Two alternative key-mappings are proposed. The first approach would be to use the cursor movement keys (keys 2, 4, 6 and 8 respectively) to navigate through the structure. The left-arrow would move horizontally through the hierarchy in one direction, while the right-arrow would move in the opposite direction: the down-arrow would expand the sectional unit to reveal those lower-level units contained within it, while the up-arrow would collapse this section and return up the tree to the preceding level. The alternative mapping also revolves around the cursor movement keys outlined above: however, this mapping would utilise these keys in the manner specified in the MS Windows 95 Look and Feel Guide. In this mapping, down-arrow would proceed from one sectional unit to the next at the same hierarchic level while right-arrow would expand the section to reveal those sub-sections or other structural elements contained within it: conversely, left-arrow would collapse the sectional unit once again and return the user to the level above.

          Both these key-mappings have advantages and disadvantages. The former mapping is logically simple. It is obvious what each key does, and intuitive that the direction of the cursor movement causes navigation in that direction. The disadvantage, however, is that it may confuse those users of the system who are already familiar with the keyboard shortcuts for the Windows 95 operating system. This is in fact precisely where the second mapping scores over the first. It is following the precepts laid down by Microsoft and again is logically easy to follow. However, the disadvantage is that it is not intuitive. It is not obvious, for example, why down-arrow should go to the next section while right-arrow moves to the first sub-section.

          We described above how blind people read documents. Using the interface described in this section, it will be both easy and efficient to navigate through documents of arbitrary complexity such as technical papers and web pages. The user need only press the cursor movement keys (irrespective of which mapping is chosen) to move to the portion of the document they wish to read. However, one issue which has not yet been addressed is the means by which a user can actually "read" the document. The standard means to do this is to use overlays on the numeric keypad. A solution to the reading question would be to use "Control+num-pad 5" (i.e., the centre key on the numeric keypad) to read the current paragraph, while the key pressed on its own would read the current word. The disadvantages to such a system are that it would not prove possible for mobility-impaired computer users to use the system. An alternative approach would be to follow the lead taken by Henter-Joyce in their JAWS for Windows screenreader. As was mentioned above, they use the "insert" key found at the bottom of the numeric keypad (marked also with a 0) as their control key. Thus, if the "insert+5" were pressed, then the current sentence would be read, while once again if the "5" key were not pressed in conjunction with another then the current word would be read.

          The use of the numeric keypad lends itself extremely well to the browsing of both mathematical equations and tabular based information. If one assumes that a designated key is the "read current" key, and others are assigned to "read next" and "read previous", then the mathematical expressions can be broken up into terms and sub-terms in the same way that the document is decomposed into sections and sub-sections. Hence all the problems associated with the various key mappings outlined earlier in the context of textual information apply in a similar manner to mathematically oriented data. When looking at tabular material it is best to consider the numeric keypad as a matrix-based arrangement of keys. Thus, the numbers 1-9 on this portion of the keyboard can be used to easily manipulate the information contained in the various rows and columns of a table. For example, if one places the middle finger of the right hand on the "5" key, then the other keys fall easily and naturally under the remaining fingers. Thus, the adjacent rows and columns of the tabular information can be readily accessed.


          4. Conclusions

          Work to date has shown that there is a significant need to improve the accessibility of technical documents to those who cannot read them using ordinary methods. However, what is still not clear is the best means to do this. The TechRead system aims in some measure to alleviate the problems encountered by blind users when navigating the treacherous waters of this type of document. It is hoped that the system will not only deal successfully with documents marked up using the TeX family of languages but will be expanded to cater for other mark-up languages such as SGML or Rich Text Format. It is not inconceivable that the TechRead system could form the basis for a revolutionary new approach to providing cross-platform access to various operating systems. We do not intend to suggest here that the provision of accessible technical documents is the core issue in designing such a system, but we believe that the techniques used in the TechRead system could be applied to form the backbone of new forms of screen access technology.


          5. References

          FITZPATRICK, D. & MONAGHAN, A. I. C. (1998), "TechRead: A system for the derivation of Braille and spoken output from LATEX Documents", Proceedings of ICCHP 1998, Vienna-Budapest.

          KNUTH, D. E. (1986), "The TEX Book", Addison Wesley, Reading.

          LAMPORT, L. (1994), "LATEX -- A Document Preparation System -- Users Guide and Reference Manual", Addison Wesley, Reading.

          MONAGHAN, A. I. C. (1998), "Des Gestes Ecrits aux Gestes Parlés", in Santi et al. (eds), Oralité et Gestualité, L'Harmattan, Paris.

          MONAGHAN, A. I. C. (1991), "Intonation in a Text-to-Speech Conversion System", Ph.D thesis, University of Edinburgh.

          RAMAN, T. B. (1994), "Astor: Audio System for Technical Reading", Ph.D thesis, Cornell University.

          STEVENS, R. D. (1996), "Principles for the Design of Auditory Interfaces to Present Complex Information to Blind People", Ph.D thesis, University of York.