### School of Computing - Research - Working Papers - 1998




# Working Papers for 1998




CA-0198:
String Processing Algorithms using Reconfigurable Computing,
Paul Jordan, David Sinclair.

The use of dedicated hardware to increase the performance of computationally intensive applications, such as string processing has it's drawbacks, mainly that the logic is dedicated to just that application, and can be costly. Using FPGA-based reconfigurable hardware, it is possible to implement the specific logic required for a given task without having to construct new hardware for each application. FPGA-based reconfigurable computing is quickly gaining acceptance due to the obvious performance advantages of implementing complex algorithms directly in hardware. Systems created with SRAM-based FPGAs can combine the flexibility of a programmable solution with the performance of dedicated hardware.
This paper covers the design and implementation of various string processing algorithms, using a Xilinx XC6216 reconfigurable FPGA based demo board. A performance comparison between the general-purpose computing implementation (the Software model) and the Reconfigurable Computing implementation is made. The analysis and comparison of these two implementations is used to determine what factors contributed to the hardware's performance advantage over the software and what the relative contributions of these factors were. This thesis also provides some insight into which kinds of algorithms can be effectively accelerated using FPGAs. The problems associated with Reconfigurable Computing machines are also discussed as well as the challenges and tools that need to be developed to enable Reconfigurable Computing to be used in a wider variety of applications.

CA-0298:
The Semantics of a Design Methodology For Hybrid Systems,
Aboud Hussein, David Sinclair.

The work presented here is part of a bigger effort to produce an integrated methodology for the development of hybrid systems which exploits benefits of object-orientation and techniques for designing real-time embedded systems. In the proposed design methodology for hybrid systems in [HS97] the syntax for declaration of objects' behaviours and of writing different identifiers have been presented. The work presented here is an attempt to describe the static semantics of object design model (ODM) using setwise notations and describe the dynamic semantics of an ODM specifications by its translation into hybrid automata specifications.

CA-0398:
Ad hoc Retrieval Using Thresholds, WSTs for French Mono-lingual Retrieval, Document- at-a-Glance for High Precision and Triphone Windows for Spoken Documents ,
Alan Smeaton, Fergus Kelledy, Gerard Quinn.

This paper describes work done by a team from Dublin City University as part of TREC-6. In this TREC exercise we completed series of runs in 4 categories.

• The first was the mainline ad hoc retrieval task in which we repeated our entry for TREC-5, without modification. This is based on applying various thresholds to processing a query including query term and posting list thresholds, in order to improve retrieval efficiency. As our previous work has shown, this can be done without any loss in retrieval effectiveness.
• Our second set of submitted runs were as part of the cross-lingual retrieval track where we ran French topics against French texts, effectively mono-lingual retrieval. What is novel about our approach is that it is based upon matching word shape tokens derived from character shape codes, rather than matching word stems or base forms. This technique is useful for retrieving from scanned document images rather than full texts and is something we are currently refining for English texts (and English queries). With those other experiments we have obtained surprisingly effective retrieval and this venture in TREC-6 was to see how effective WST-based retrieval could be for French.
• The third series of experiments we submitted were based on the high precision track in which we used a graphical representation of a ranked list of documents and the positional occurrences of search terms within those top-ranked documents, relative to each other.
• Our final experiments were as part of the spoken document retrieval track in which we removed the tags used for story bounds, turned transcripts and topics into a phonetic representation using a phoneme dictionary and we then retrieved story identifiers based on a triphone match between topic and fixed-width windows of triphones in the transcripts. We also applied a weighting function to triphones as they occurred in story windows based on their offset within those windows.

CA-0498:
Key Exchange using Lucas exponentiation,
Mike Scott.

By using Lucas exponentiation we improve a key distribution scheme first proposed by McCurley into one which is secure against passive attack assuming the intractability of integer factorisation and a discrete logarithm problem of equal complexity.

CA-0598:
Providing Views and Closure for the ODMG Object Model,
Jessie B. Kennedy, Peter J. Barclay, Mark Roantree.

The ODMG Object Model offers a standard for object-oriented database designers while attempting to address some issues of interoperability. Our research is focused on the viability of using the ODMG data model as a canonical data model in a multidatabase environment, and where weaknesses are identified we have proposed enhancements to enable the model to suit the specific needs of this type of distributed database system. This paper describes our efforts to extend its relational style algebra, and to provide query closure and a viewing mechanism for OQL to construct multidatabase schemas

CA-0698:
Urban Network Performance Under Simple CA Microsimulation,
A. Hammad, M. O'hEigeartaigh, Heather Ruskin.

In this paper we use a CA model to simulate traffic flow in Urban Networks. Different network sizes have been tested, where all nodes are controlled and the signal operating condition is assumed to be a pre-timed signal (Fixed Time Control). Various traffic conditions are considered at each intersection. We present two types of simulations; the first considers the transient movement of the cars while in the second cars are allowed to park inside the network, using underground parks. A stochastic feeding mechanism, in which car arrivals follow a Poisson Process, has been implemented throughout the simulation using different rates of arrivals. It appears that there is a critical arrival rate above which the transportation through the network is not efficient any more and this rate is independent of the network size. Long -term events, such as the blocking of some network exits are modelled in our simulation, which successfully reproduces the jamming effect typically seen in highly congested networks at peak- times. We demonstrate that cellular automata provide the ability to perform microscopic simulation of realistic urban networks, given that the computational requirements do not depend on the number of cars inside the networks but only on network size.

CA-0798:
A Generative Communication Service for Database Interoperability,
W. Hasselbring, Mark Roantree.

Parallel and distributed programming is conceptually harder to undertake and to understand than sequential programming, because a programmer often has to manage the coexistence and coordination of multiple concurrent activities. The model of Generative Communication' in Linda --- a paradigm that has been developed for parallel computing --- emphasizes the decoupling of cooperating parallel processes; thus, relieving the programmer from the burden of having to consider all process inter-relations explicitly. In many application areas, data is distributed over a multitude of heterogeneous, autonomous information systems. These systems are often isolated and an exchange of data among them is not easy. On the other hand, support for dynamic exchange of data is required to improve the business processes. Cooperative information systems enable such autonomous systems to interoperate. They are complex systems of systems which require a well designed and flexible software architecture. The Linda model had a great influence on research in parallel programming languages. Stimulated by this success, a Generative Communication Service, which offers a very flexible associative addressing mechanism based on metadata matching, has been developed for supporting interoperability of cooperative information systems. Some design patterns guided the construction of the resulting communication service that has been implemented on top of CORBA for an ODMG canonical data model.

CA-0898:
User-Chosen Phrases in Interactive Query Formulation for Information Retrieval,
Alan Smeaton, Fergus Kelledy.

The impact of using phrases as content representation for documents and for queries has generally been accepted as a desirable feature in information retrieval systems because phrases are generally regarded as being more content-bearing than their constituent words. This has been borne by experiments in which the impact of phrases on retrieval performance has usually been found to be positive. However, most of the experimental results reported have derived phrases from documents and from queries in a fully automatic way. While this is acceptable for document indexing it is less acceptable for query formulation which is increasingly heading towards being an iterative process with users investing time in browsing the term space to choose appropriate search terms. In this paper we report a series of experiments in which two users, one experienced and the other a novice, formulate their queries by browsing the term space in advance of issuing a retrieval request. For these users we analyse the relative contributions and the impact of single words and multi-word phrases as search terms, on overall retrieval performance. Our results have implications for how choosing phrases as search terms should be presented to novice and to experienced searchers.

CA-0998:
Developing Online Virtual Lectures for Course Delivery: A Case Study and an Argument In Favour,
Alan Smeaton.

In this paper we briefly describe how we replaced traditional lectures by an online counterpart which had audio and synchronised visuals for students to "play" in an undergraduate course driven by student-directed learning. Previous analysis of usage and exam performance has shown that our mode of virtual lectures behaves just like traditional lectures with good study habits or technical know-how not necessarily being rewarded by better exam performance. We present details of an end-of-course survey we carried out on 115 students who have completed the course and the exam, and we present results of that survey. We also discuss the costs associated with creating our virtual lectures and we argue that the virtual lecture mode of online delivery of material, as opposed to a hypertextual one, or one based on concept maps, is at least as attractive as other organisations because of the comparative ease of the task on the lecturer yet we still provide numerous benefits to the student in terms of access. Our organisation of material uses the principles of deconstruction of material into components but not the re-construction into a concept map or network and the argument of this paper is that the comparative ease with which this can be done on the part of the lecturer may encourage more teaching staff to become involved in making course material available online.

CA-1098:
Cryptographic Techniques for Databases,
Gary McCabe, Mike Scott.

Unlike encryption of data for transmission or for straight file storage, the application of cryptography to databases presents us with its very own set of problems. For example, we must consider such issues as the "unit of encryption/decryption": will it be at the Field or Record level? As the data is highly structured, database encryption reveals a most distinctive plethora of peculiarities. The diverse range of techniques for database encryption includes: Field Value Manipulation, Privacy Homomorphisms, Membership Lists, Schemes with Subkeys, Integrity Locks and Secure Multilevel Schemes. I will discuss the benefits of these various approaches and some of their limitations. Another trend that warrants discussion deals with "State Profiling" and the counteracting "Anonymous and Verifiable Databases". Both sides of this battle make use of Smart Cards.

CA-1198:
A Distributed Framework for Implementing a Multi-agent Assistant System,
Rory O'Connor, Eamon Gaffney

In this paper we report on our experiences of developing a CORBA based distributed multi-agent system as part of the Prompter project. The Prompter project aims to develop a decision-support tool to assist in the planning and managing of a software development project. This paper focuses on the architecture and design of a framework for the development of a set of distributed expert agents, which will provide the decision support and guidance features of the Prompter tool.

CA-1298:
Using Example-Based Reasoning for Selective Move Generation in Two Player Adversarial Games,
David Sinclair.

Given the processing speed of the best chess computer is 200 million times faster in terms of positions evaluated per second than a human chess expert, a grandmaster, the question is not how can a computer beat a chess grandmaster but rather how do chess grandmasters beat computers? In computing terms, the human expert's strength lies in the ability to significantly prune the search tree and to correctly evaluate the resulting positions. This paper addresses the first of these strengths and proposes an example-based reasoning mechanism to select candidate moves from a given chess position. The mechanism automatically generates an example-base from a database of grandmaster chess games using Principal Component Analysis to characterise the positions that occurred in the games database. Given a new position, its characterisation is compared to those in the example-base and a ranked list of n "similar" moves are returned. This forms the basis of an effective forward pruning mechanism in a two player adversarial game.

CA-1398:
MIPSMARK -- Software for Automatic Assessment of Programming Assignments for Teaching Computer Architecture and Assembly Language Programming,
John Waldron.

This paper discusses a method of teaching introductory computer architecture with a strong emphasis on assembly language programming tests. It then describes the design of the particular software used to automatically correct assembly language programs, together with software used by a lecturer to administer the course.

CA-1498:
Collection Management Using Various Object Models,
Adrian Skehill, Mark Roantree.

This paper introduces object-oriented databases, and explains the need for the capability to define and maintain collections of objects within these object oriented databases. An evaluation of using five different object models define and maintain collections is presented, together with a commentary of the strengths and weaknesses of each of these object models is presented

CA-1598:
Implementing Predictive Combinator Parsers in Standard ML,
James Power.

In this paper we present and discuss an implementation of combinator parsers in Standard ML. Several parsers are presented, ranging from a conventional backtracking parser to a predicative parser based on dynamically calculated lookahead sets. All of these parsers are based on the same principle of re-interpreting the standard union and concatenation operators of a context-free grammar, with the necessary overloading of these operators being provided via ML's module language.

CA-1698:
User-Mediated Word Shape Tokens for Querying Document Images,
Alan Smeaton, Jerh O'Connor.

Word Shape Tokens (WSTs) are tokens used to represent words based on the overall shape or contour of a word as it appears in printed text. A character shape code (CSC) mapping function is used to aggregate similarly shaped letters such as g" and y" into one single code to represent those letters. The rationale behind this is that it is far easier and more accurate to map a scanned image of a word or letter into its WST representation than it is to map into full ASCII. WSTs were initially applied to the task of language recognition and have proved useful in implementing a computationally lightweight form of OCR. In previous work, we have applied WST representations to information retrieval based on automatically deriving query WSTs from topic descriptions. In the work reported here we extend this to allow a user to judiciously select WSTs as search terms based on the number of surface forms of words which share that WST. We also factor into our experiments for the first time, the WST recognition errors found from an implementation of the WST recognition process. Our results encourage us to further develop the idea of using WSTs for retrieving scanned images of text documents.

CA-1798:
Retrieving Images of Scanned Text Documents,
Alan Smeaton.

Information retrieval is the task of finding documents, usually text, which are relevant to a user's information need. A conventional approach to information management of paper documents is normally based on classifying them into a hierarchical classification structure. More recently we have seen electronic document management systems which manage scanned images of documents in the same way as paper, or which do some OCR to determine machine-readable document content which may then be used for content-based information retrieval. In this paper we present an alternative to full-scale OCR of scanned document images in which words in scanned images of documents are represented as word shape tokens (WSTs) representing the approximate {\it shape} of words in text. We describe an approach to WST-based information retrieval that can be entirely automatic or can involve the user in refining the retrieval process. To measure the effectiveness of WST-based retrieval we have implemented and refined it on two collections of documents/queries/relevance judgments, one in English, the other in French. Each of these has over 250 Mbytes of ASCII text and in indexing documents we have faithfully re-created the kind and frequency of errors expected in a WST recognition process. When compared to word-based retrieval, the effectiveness of WST-based retrieval is surprisingly good, though erratic across different queries. We outline some of the directions we are pursuing in order to address this inconsistency of performance across queries and our results show real potential for WST-based retrieval of scanned document images.

CA-1898:
Taiscealai: Information Retrieval from an Archive of Spoken Radio News,
Alan Smeaton, Mike Morony, Gerard Quinn, Ronan Scaife.

In this paper we describe Taiscealai, a web-based system which provides content-based retrieval on an up-to-date archive of RTE radio news bulletins. Taiscealai automatically records and indexes news bulletins twice daily using a stream of phones recognised from the raw audio data. A user's typed query is matched against fixed length windows from the broadcasts. A user interface allows the news bulletins most likely to be relevant to be presented and the user to select sub-parts of the bulletins to be played. Many of the parameters we have chosen to use such as the size and amount of overlap of windows and the weighting of phones within those windows, have been determined within the framework of the TREC Spoken Document Retrieval track and are thus well-founded. We conclude the paper with a walkthrough of a worked example retrieval and an outline of our plans for extending Taiscealai into an integrated digital library for news.


`