|Home Teaching Areas of Research C.V. Publications Personal Webpage Assessment Submission|
TechRead: A System for Deriving Braille and Spoken Output from LaTeX Documents
Donal Fitzpatrickand Alex Monaghan
Computer Applications, Dublin City University
One of the most difficult aspects of research for a blind student is the unavailability of technical material in a format accessible to them. To date, much of the effort of transforming documents into either Braille or spoken output has been in the literary rather than in the technical or scientific areas. For example, the majority of the spoken text produced by existing screen access technology does not harness the capabilities of synthetic speech devices but instead outputs the material using a monotone. TechRead, on the other hand, is a system which, it is hoped, will render technical documents more accessible to blind people. This software will take LaTex documents as input and produce Braille or spoken output from them.
The main aims of Techread are as follows:
This paper discusses the fundamental principles underlying this system. It aims to show how the LaTeX document is transformed into an internal representation of the document, and from this to either Braille or spoken output. The final section discusses how the system will be expanded to cater for mathematical material, and our beliefs that the ideas contained in the system can be used to improve screen access technology.
Keywords: LaTeX, Spoken Documents, Braille
For many years, the focus of those writing software to translate material into Braille has been on non-technical documents. Much of the effort has gone into producing software which will translate literary material into Braille while ignoring the more technical documents. Therefore, as can be imagined, the procurement of technical data for those who cannot read the printed version is both a time-consuming and tedious affair.
TechRead aims to solve this problem. The purpose of the system is to take a file in LaTeX [2,3], and to produce an output file in some medium accessible to blind people. A two-fold approach is taken here. Firstly, the system will take the input source file and derive a Braille file from it. The user will then be able to obtain a hard-copy of this file using a standard Braille Embosser. The reason for using Braille is simple. Many people think that it is an out-dated, archaic system, which has no use in the modern age of advanced technology. However, it is a fact that for many people Braille has been their standard means of reading for most of their lives, and in our view they must be catered for by TechRead. The translator will, as closely as possible, represent the structure of the document i.e., the Brailled document will be as close a replica of the printed version as it is possible or sensible to be. There will, by the very nature of Braille, be some discrepancy here particularly in the area of lay-out.
While Braille to many is the only means of reading a document, it is limited. By its very nature as a tactile system, Braille cannot convey many aspects of documents which sighted people find so important, and which they take for granted. For example, it is only possible to show emphasis in Braille in one way. In order to do this, the emphasised text is italicised no matter what the printed font might be. Speech on the other hand allows for a far wider scope. It is possible for example to have different voice characteristics for emboldened or italicised text. Another major advantage of speech is its ability to convey both document structure and layout characteristics of a document by the use of prosodic characteristics.
Techread's second mode therefore combines existing speech synthesis technology with an analysis of document markup information to produce a "speaking document browser". Our current strategy is to take LaTeX as an input source and to produce an off-screen model of the document from this. Using this model, the blind person will be able to read a document in as similar a manner to their sighted colleagues as possible. An example of how this system will work is as follows.
Let us assume that the document being browsed is a newspaper, with sections, headlines and articles. The sighted person will read the section if it interests them. However, there might be headlines in this section that they wish to skip, or paragraphs in the articles which they do not wish to read. The document browser aims to provide the blind user with exactly the same functionality. The browser will allow the blind user to skip sections, paragraphs etc. Also, to return to the analogy of the newspaper, if the sighted person wishes to find the next headline, then they can simply scan down the page to see it. The document browser will allow the blind user to go directly to the next/previous section or sub-section of the document.
The advantages of such a system would be many. Unlike the current situation the blind reader would not have to read superfluous and extraneous information. They could "scan" the document using the browser, until the relevant material has been reached and then read it.
One of the key underlying ideas of the TechRead system is that of conveying the structure of a document to the blind user. In order to achieve this goal, an off-screen model (OSM) of the document must be constructed. The strategy employed in producing the accessible documents is based on a three level architecture as shown below (Fig. 1). As can be seen, it consists of an input or source file (LaTeX) which is passed into a pre-processor. This pre-processor will then convert this raw LaTeX material into an internal representation of the document, which can then be passed on to either the translator or the system for producing the documents used by the Browser. Before embarking on this discussion however, it would be useful to outline the reasons for selecting this type of architecture. Firstly, such a system lends itself to a very modular design. The layered structure means that one component can be changed without altering the overall structure or logic of the entire system. Secondly, though at the time of writing the input source file will be in LaTeX, there is no reason why other file formats cannot be added at a later date. All that will be necessary will be to write the conversion routines to transform the input file into the internal format used by the translator and the generator for browsable documents.
2. Representing structure.
The off-screen model of the document is constructed at the input stage. At this point the structure of the document is encoded. For example, to return to the analogy of the newspaper, it is at the input stage of translation that the structural information such as the whereabouts of the starting points of sections or headlines would be deduced. This system will enable the blind user to browse a document in as close a manner to a sighted colleague as possible. The off-screen model will enable the reader to do this. The following paragraphs will describe the model used by TechRead, and the manner in which this will influence the design of an interface to the document.
Figure 1: The Three Layer Architecture.
Previous systems  used a tree based architecture to represent the structure of a document. TechRead uses a complex hierarchical structure to represent this. We assume two types of node in the system; one being a terminal while the other is an internal node. The terminal nodes will be used to hold the actual text of the document, coupled with any formatting information associated with that text, while the internal nodes will be used to hold the material relating to headings, sub-headings etc. The architecture can be best described as a cross-linked tree. The root is a node containing all global formatting for the document. Below this are the first level headings (if they are present), or simple terminal nodes containing the text of the document otherwise. At all levels below the root, the nodes are inter-linked both downwards and across the same level. For example, each section node is linked to the preceding and following sections as well as dominating the sub-sections contained within it: this allows the user to browse any chosen level of the document. In addition, the left-most terminal node on any branch of the tree is linked to the left-most terminals on the preceding and following branches: this allows the user to skip forward or back a paragraph. All terminal nodes on any given branch are linked to each other in the form of a list. Finally, the right-most or final terminal on any given branch is linked to the left-most terminal on the next branch for smooth continuous reading. This combination of links directly models a range of different reading strategies.
During construction of the OSM, any formatting changes are passed up the hierarchy to enable rapid processing of the document. For example, if a portion of emphasised text appeared in paragraph 4 of section 3, a flag would be set in both the section and paragraph nodes. Thereafter, if the browser encounters that section it will examine the paragraph nodes to find the one which contains a formatting command. Similarly, browsing the paragraph level would lead to an examination of the terminal nodes to discover where the formatting change occurs. The algorithm which ultimately produces the spoken version of the document would then alter the characteristics of the voice appropriately, instead of simply outputting the text in the normal reading voice. As a consequence of this model, the interface to the document can be very flexible.
It was initially decided to display the LaTeX on screen in an un-interpreted form and to design the interface to the spoken version of the document such that it was based around the numeric keypad on a standard IBM-compatible computer. This is in keeping with trends in the design of modern screen access technology, where developers attempt to ensure that the time taken by users to learn the system is kept to a minimum. However, use of the numeric keyboard also has an inherent logical basis. Firstly, navigation through the document is intuitively related to the directional keys (up = 8, down = 2, etc.). Secondly, the use of meta-keys in combination with the numeric keypad allows functions at different levels of the document. Let us assume, for example, that the "5" key on this keypad reads the current character. When pressed in conjunction with the "shift" key it could read the current word, and in conjunction with the "control" key could read the current paragraph.. The flexibility of such an interface leaves a wide scope for expansion or customisation. The number of overlays which can be placed on the numeric keypad is (theoretically) infinite, while the fact that only a small number of keys are at the core of the system means that should the user desire to alter the key mappings it will be relatively straightforward to do so.
3. Translation Algorithms.
We have seen how the TechRead system takes the raw LaTeX documents and produces an off-screen model from them. Then next phase of producing accessible documents is to transform this model into either speech or Braille. The following sections detail how this will be achieved.
3.1 Braille Translation
For many years, the means for producing Braille material from various types of document have been known. However, as was stated in section 1, the material produced to date has been of a literary rather than a technical nature. There are still very few translation packages which can take technical documents with embedded mathematical formulae and render accurate Braille. This is one of TechRead's main aims.
The translation process simply involves a character substitution algorithm, consisting of a rule based engine which replaces patterns of characters found in the input document with their grade II Braille equivalents. The rules are of the form:
"input_string" => "Braille symbol"
"input_string" => "braille_symbol"+ "remaining_input"
Either the entire string has a Braille equivalent, or the first part has a Braille equivalent.
However, more important than the translation process is the actual material which is translated. How for example should emphasis be conveyed to the Braille reader. Traditionally Braille has used only one form of emphasis, namely italics, and this has been used to denote emphasis irrespective of the form of visual enhancement of the document. This notion of "what" rather than "how" to translate is particularly important when dealing with mathematical material. Unlike the spoken version of the document where an almost infinitely varied set of alterations in the characteristics of a voice can convey much of the semantics of the formulae, Braille has only one way to translate mathematical material. One of the means in which it is hoped that this translator will improve on others currently in existence is the means in which it uses spatial location of formulae on a page. For example, as any student using the British mathematical notation will know it is customary to simply write equations in a line across the page (as though it were literary text) instead of using the conventions adopted by typesetters for displaying printed mathematics. While it is true that not all of these conventions will have relevance to Braille mathematics some of them, such as the use of vertical as well as horizontal orientation to display formulae will improve readability.
3.2 Producing Spoken Documents
By far the more interesting portion of the system is that concerned with the production of spoken documents. While the derivation of Braille from the LaTeX input may indeed be very useful there is far more variety in the output which can be obtained by actually using alterations in the characteristics of the voice to convey the material to the user.
To begin with, let us consider the text-to-speech commands which are available in LaTeX and similar document preparation systems actually correspond to a much smaller number of linguistic categories which can be realised prosodically. These categories include subordination, aside, change of topic, list, and emphasis. For example, an aside may be encoded in a document as a footnote, a parenthesised passage, a margin note or simply some text between commas: all of these might receive the same prosodic treatment in a spoken rendition. Similarly, in the spoken version it may not be desirable to distinguish between bold, italic, underlined, capitalised and quoted text: it seems unrealistic to expect the listener to keep all these emphasis types distinct. Moreover, TechRead is limited by the possibilities of the synthesis devices which will produce the spoken version: not all synthesisers offer the same degree of control over the prosodic realisation, and the granularity of control also varies. However, previous work [4-6] has shown that a small number of prosodic categories allows the construction of quite complex hierarchies which should be sufficient to express all the relations which sighted users extract from formatted documents.
Starting with a core set of LaTeX commands, we will derive a model of the possible functions performed by different formatting commands. Each of these functions will relate a set of formatting commands to a unique combination of prosodic symbols. These symbols will be given a translation in terms of the control sequences for each output device. The acoustic phonetic details of the spoken output will therefore depend on the particular synthesiser in use. The core set of LaTeX commands will then be expanded to include almost all the standard commands , although this will always be a subset of full LaTeX: we cannot cope with user defined macros.
It is hoped to construct a formal model for the alterations in the voice characteristics to show proper semantic interpretation of mathematical equations in particular. It should be noted that to date much of the work relating to this portion of the system has gone in the direction of enhancing the spoken text, as opposed to mathematical equations. (For a discussion of our future work in the area of mathematics see section 4.)
In order to translate efficiently from the format used in the OSM to a synthesiser-independent format for spoken output, we have devised an algorithm based on the off-screen model generated for each specific document. As was said in section 2, flags are stored as part of the internal nodes which indicate whether changes of formatting occur at a lower level. The algorithm simply checks the formatting of each level as it goes, and, if no change occurs at that level, the text at the levels below this is output to the browser with no additional control codes. However, if a formatting change is detected, the translator drops down a level and goes through the same process, until a point is reached when the text contained in the terminal nodes is found. At this juncture the algorithm simply scrutinises the formatting of the text and, where necessary computes the prosodic changes necessary to convey the visual appearance of the material to the blind user. An example will suffice to explain this algorithm.
Let us assume that a default reading voice (V1) has been chosen on a DECtalk  synthesiser, and that we are translating a document of two sections with no sub-sections. The browser encounters the starting point for "section 1" in the OSM, and checks the information stored to determine whether there is a need to alter the voice characteristics within that section. Let us assume that in section one there is no such alteration needed. The material contained in this section can now be simply output for use by the document browser. However if in "section 2" there is some emphasised text in the first paragraph, the translator will not now simply output the text, but will discover that a change occurs in the section, so will examine the paragraph level. Here it will become apparent that there is a change in the first paragraph, so the translator will now examine each of the words within this paragraph to deduce where the alteration in voice occurs. When the start of the emphasised text is found, V2 replaces V1 as the reading voice, until the attributes change again, when the voice is returned to V1 for normal reading.
Though it is intended to incorporate mathematical translation into the TechRead system the majority of the work done thus far has been in the realms of conveying the structure and visual enhancement of a document to the listener. Therefore, as can be seen from previous sections, the algorithms devised to date have been for the production of both Brailled and spoken text. However, these algorithms have been designed with mathematics in mind, and it is our belief that minor modifications will ensure that this type of material will be translated successfully.
It is our intention to conduct a study over the coming months to determine what information sighted people extract when reading equations or other formula-based material. We intend to show them a series of mathematical expressions for various fixed lengths of time, and get them to write down what they recall. The use of varied lengths of time is intended to simulate different types of reader. For example, it is hoped that the replies we get after the subjects have seen the equations for the shortest time will indicate what a sighted person sees when they glance at an equation, while those we obtain after the longest period of observation will indicate what they recall after examining the mathematical material in depth. It is then hoped to analyse these results to determine the best and most effective way to give a listener a "glance" at an equation.
Much work has been done in this area to date. The method used in The Maths Project , for instance, was to convey the "glance" using musical tones. It is hoped to find a more natural alternative to this.
We currently envisage several reading modes for equations: verbose, overview and glance. In "glance" mode the system will announce the presence of each equation, followed by an indication of the components of the equation e.g., "Equation: summation followed by integral followed by fraction". This information should be available from the formatting commands. TechRead is intended to process equation of the type and complexity encountered in pre-university examinations.
We also foresee uses for this system in the realms of screen access technology. The traditional approach to designing screen reading software has been to start from the operating system, and then design add-ins which will cope with various types of package. It is our belief that TechRead could be adapted to cope with many different types of documents. For example, a spreadsheet is simply a table, and a document produced by a word processor is simply a document marked up in different ways. Accordingly, we believe that instead of designing screenreaders to cope with various operating systems, it may be possible to incorporate "style sheets" into the TechRead system, thus rendering many different types of document accessible. Finally, though LaTeX is being used at present, there is no reason why the input source could not be amended to SGML at a later date.
 DECtalk, a trade mark of Digital Corporation, http://www.ultranet.com/~rongemma/
 KNUTH, Donald E., The TeXbook, Addison-Wesley, 1993
 LAMPORT, Leslie, LaTeX: A Document Preparation System, Addison-Wesley, 1986.
 MONAGHAN, A. I. C., Intonation in a Text-to-Speech Conversion System, Edinburgh University Ph.D. thesis, 1991
 MONAGHAN, A. I. C., Intonation Accent Placement in a Concept-to-Dialogue System. Proceedings of AAAI/ESCA/IEEE Conference on Speech Synthesis, New York, September 1994, pp. 171-174.
 MONAGHAN, A. I. C. & LADD, D. R., Manipulating Synthetic Intonation for Speaker Characterisation., ICASSP 1991, vol. 1 pp. 453-456.
 RAMAN, T. B., ASTOR Audio System for Technical Readings, Cornell University Ph.D. thesis, 1994
 The Maths Project