School of Computing - Research - Working Papers - 1995


Working Papers for 1995



CA-0195:
Database Rules and Time: Some Extensions to the SQL Standard,
Liam O'Neill and A.F. Smeaton.

The subject of this paper is the incorporation of temporal semantics into database rules and how the syntax needed to support this might be reconciled with the evolving SQL standard. A working syntax, free of any limitations of the SQL standard, is defined and presented and an operational definition for this syntax is evolved through the application of this working syntax. A graphical representation of the domain is given using an adapted object-oriented modelling. When an attempt was made to re-cast our working syntax into SQL, a satisfactory mapping which would have succeeded in preserving the semantics of the original could not be achieved. This suggests a shortcoming in the current standards. Support for time-based triggers; cyclic operations; delayed actions and rule lifetimes necessitated the development of appropriate modifications to the basic SQL3 draft syntax. The proposed extensions capture all of the semantics required for the specification of time-based rules.

CA-0295:
Experiments on the Automatic Construction of Hypertext from Text,
A.F. Smeaton and Patrick J. Morrissey

The problem of (semi-)automatically turning text into hypertext is one that has been identified as important to the growth and development of hypertext as a way of organising information. In this paper we describe an approach we have developed to semi-automatically generate a hypertext from linear texts. This is based on initially creating nodes and composite nodes composed of mini-hypertexts. Following this we then compute node-node similarity values using standard information retrieval techniques. These similarity measures are then used to selectively create node-node links based on the strength of similarity between nodes. What makes our process novel is that the link creation process also uses values from a dynamically computed metric which measures the topological compactness of the overall hypertext being generated. Thus link creation is a selective process based not only on node-node similarity but also on the overall layout of the hypertext. Experiments on generating a hypertext from a collection of 846 software product descriptions comprising 8.5 Mbytes of text are described. Our experiments with a variety of IR techniques and link creation approaches yield some guidelines on how the process should be automated. Finally, this text to hypertext conversion method is put into the context of an overall hypertext authoring tool currently under development.

CA-0395:
Using WordNet in a Knowledge-Based Approach to Information Retrieval,
R. Richardson and A.F. Smeaton

The application of natural language processing tools and techniques to information retrieval tasks has long since been identified as potentially useful for the quality of information retrieval. Traditionally, IR has been based on matching words or terms in a query with words or terms in a document. In this paper we introduce an approach to IR based on computing a semantic distance measurement between concepts or words and using this word distance to compute a similarity between a query and a document. Two such semantic distance measures are presented in this paper and both are benchmarked on queries and documents from the TREC collection. Although our results in terms of precision and recall are disappointing, we rationalise this in terms of our experimental setup and our results show promise for future work in this area.

CA-0495:
Thresholding the Postings Lists in Information Retrieval
F. Kelledy and A.F. Smeaton

A variety of methods for speeding up the response time of Information Retrieval processes have been put forward and one of these is the idea of thresholding. The basic assumption behind the concept of thresholding is that in information retrieval storage structures, data can be organised in such a way so as to allow cut-off points to be employed during processing. These cut-off points or thresholds are designed and used to reduce the amount of information to be processed while maintaining the quality or allowing minimal degradation of the response to the users query. In this paper we report experiments we have carried out with a portion of the TREC data where we introduced some features into the retrieval process designed to improve response time and we found that these features would significantly improve response time while maintaining the same level of retrieval effectiveness.

CA-0595:
Automatic Word Sense Disambiguation in a KBIR Application
R. Richardson and A.F. Smeaton

In this paper we discuss the implementation and design of an automatic word sense disambiguator. This semantic tagger is used in an overall Knowledge Based Information Retrieval (KBIR) system which uses a knowledge base derived from WordNet and two independent semantic similarity estimators. The KB is used as a controlled vocabulary to represent documents and queries and the semantic similarity estimators are employed to determine the degree of relatedness between these KB representations.

The process of arriving at a KB representation of documents and queries involved locating KB nodes equivalent to index and query terms. It was found that on average, 72% of index terms had more than one KB sense and the semantic tagger was subsequently developed to determine the most likely sense in these cases. The surrounding words in the text were used as a context when disambiguating ambiguous terms. the tagger itself is made up of four independent components, each of which allocate scores to the senses of ambiguous terms. The sense(s) with the highest scores are deemed the "winners". Evaluation of the semantic tagger has so far been within the context of results from the KBIR system, however an independent evaluation using a manually tagged corpus is currently taking place.

CA-0695:
A Method for producing LOTOS Specifications from Real Time Structured Analysis Specifications
D. Martin and R. Verbruggen.
To be presented at the 17th International Conference on Software Engineering Workshop on Formal Methods Application in Software Engineering Practice in Seattle, April 25 1995

Despite the advantages associated with the use of formal (mathematical) techniques for the specification of computerised systems, industry has been slow to abandon existing informal techniques. A suggested solution to this situation involves integrating informal and formal techniques within the development process. Such an integration would allow the advantages of both techniques to be exploited. This paper describes a method for integrating the widely used models of Real Time Structured Analysis with the formal specification technique LOTOS. The approach suggested uses the Real Time Structured Analysis specification as a cognitive vehicle for producing a LOTOS specification, and is illustrated by means of a small case study.

CA-0795:
Steganography its history and its application to computer based data files
P. Davern and M. Scott.

Steganography is the science of hiding data in otherwise plain text or images. This document is split into two sections. Section one gives a brief history of the use of steganography by mankind. It describes interesting events in history where steganography was used to great or sometimes disastrous effect. Section two describes some applications that exist on computer which apply steganographic techniques to allow a user to hide information in data files.

CA-0895:
Building Hypertext under the Influence of Topology Metrics
A.F. Smeaton.

It is well-known that there are many problems inhibiting the spread of hypertext and hypermedia and among those problems there is the cognitive overload or strain on the authors of such material. In addition to authoring the content of each node in a hypertext, an author is responsible for creating information links and link anchors while also having to maintain an overall balanced hypertext document. Our recent work at Dublin City University has addressed this aspect of the authoring problem by investigating whether the values of graph topology metrics can be used as an aid to a hypertext author in maintaining an overall balanced document. In this position paper we report on the progress we have made to date towards the goal of realising such an aid. As an exercse we have applied the compactness metric to a number of hypertexts which have been hand-cracted, constructed semi-automatically or generated in a completely automatic way and each of these hypertexts has been created for different purposes. The results of these exercises are reported here. Some of these documents have been marked up in HTML and can thus form part of the World Wide Web.

CA-0995:
Regular and Context-Free Institutions
J. Power and T. Moynihan.

Institutions are abstract frameworks used for the description of formal systems; one common application is giving formal semantics for languages used in formal specification. This facilitates comparison and integration of specifications written using these formalisms.
Two of the main formalisms used in the description of programming language syntax are regular expressions and context-free grammars. In this report we treat these as separate specification formalisms, and construct an institution for each.

CA-1095:
Institutions for Programming Language Semantics
J. Power and T. Moynihan.

Building on previous work in programming language syntax, this paper gives institutions for some formalisms used in the description of programming language semantics. Specifically we construct an insitution for attribute grammars and van-Wijngaarden grammars, and examine the possibility of extending the framework to include denotational- and axiomatic-style definitions.

CA-1195:
New Tools for Financial Forecasting
S. Donohue, M. O'hEigeartaigh, and F. Bradley.

This paper attempts to asess the potential of back-propagation neural networks in predicting spot foreign exchange rates. A new and more recently developed technique is also introduced an appraised: genetic programming based symbolic regression. both techniques are used to predict 1,2,3,4 and 20 day ahead spot U.S. Dollar/ Japanese Yen exchange rates using four years of daily sampled values taken from the period 1990-1994.

CA-1295:
A Model for Information Retrieval from Free Text in a Police Environment
L. Baxter and A.F. Smeaton.

The system propsed here is intended to address two aspects of a police investigation. The first is the need for a facility to extract and compare information and the second is the necessity for an advanced text retrieval tool.

CA-1395:
Check-Out Checkin Replication
E. Hanley and D. Sinclair.

Check-out Check-in Replication is an attempt to address the need of the more volatile mobile computing environment. It allows data to be 'checked-out' of a server database and down-loaded onto a PC or lap-top. While the data is checked-out, it can be independently updated both at the Server site and on the PC (or 'Client' site). At some stage, the data is later 'Checked-In' again. The check-in process updates the Server database with any changes made to the checked out copy. The aim of this project is to simulate this situation by building and testing a prototype implementation.

CA-1495:
Cost Estimation in a Medium Sized Software House
G. Gray and R. Verbruggen.

This paper examines how best to carry out software cost estination. The aim of the paper is to recommend a cost estinating methodology suitable for a medium sized software house which develops and maintains customised packages. Thus the focus is on a stable environment where there is a lot of relevant expertise, and many projects are similar in nature.

CA-1595:
A Formal Approach to HW/SW Co-Design: The INSYDE Project
D. Sinclair, L. Cuypers, K. Verschaeve, E. Holz, A. Birbas, V. Mariatos, N. Kyrloglou, J.L. Roux.

This paper presents a formal approach to the co-design of hybrid systems based on object-oriented analysis and design, and the formal description languages VHDL and SDL. This methodology covers the whole development process from requirements capture, through design and implementation, to validation. The paper also presents some of our experiences todate with the methodology.

CA-1695:
Resolving Anchor Locations for Pre-defined Links in Hypertext Nodes
P. Donnelly and A. Smeaton.

This work presents a method by which hypertexts may be initially authored by presenting anchors as a list of nodes separate from the information content of the nodes and subsequently automatically resolving locations for replacement anchors which form part of the actual information content of the nodes thus allowing the cognitive overload problem for the author to be eased while not compromising the useability of the final product.

CA-1795:
Information Retrieval Applications: Using Linguistic Resources or using Language Processing
A.F. Smeaton, F. Kelledy , R. O'Donnell, I. Quigley, R. Richardson and E. Townsend

In the work we report here we are concerned with delivering information retrieval applications to users where our over-riding constraints are for handling large volumes of domain independent texts. Our observations from published work and our own efforts previously reported are that NLP techniques, beyond simple tagging and morphological analysis, do not yet offer enhancements to retrieval effectiveness and certainly not to retrieval efficiency when dealing with large volume, domain-independent applications. With this in mind we have adopted a philosophy of using NLP resources rather than NLP processes in our work. Here we outline details of 4 of our ongoing projects, retrieval on +2Gbytes English texts, retrieval on +200 Mbytes Spanish newspaper texts, retrieval on +3000 multimedia images and information filtering of USENET NeWS articles.
This article appears in the proceedings of IA95, June 1995, Montpellier, France

CA-1895:
Fast Machine Code for Modular Multiplication
M. Scott.

Techniques are described to speed up the calculation of modular products, which are required at the heart of software implementations of many cryptographic protocols such as the well-known RSA Public Key algorithm. By combining ideas from Comba and Dusse and Kaliski, and exploiting Montgomery's method for modular reduction, very efficient machine code routines can be generated. The method is illustrated for the popular 80386/80486/Pentium Family of processors, but uses only a very small subset of the instruction set, and so can easily be adapted to many other conventional processors with a word length of 16 bits or more.

CA-1995:
Distributed Multimedia QoS Parameters from Presentation Modelling by Coloured Petri Nets
A.F. Smeaton and A. Gregan.

We address the problems associated with storing and retrieving multimedia presentations where the multimedia objects making up the presentation are stored in a distributed manner on a network of computers. In order to deliver multimedia presentations to users in an efficient manner, quality of service (QoS) parameters for various components like networks, disks, etc, need to be defined before a multimedia presentation is launched. It is our hypothesis that a multimedia presentation authoring tool should use an internal representation known as a coloured petri net, in order to represent the synchronisation and the spatial aspects of a multimedia presentation. The CPN representation for a presentation would be represented internally only and need not be visable. In this report we outline how some commonly used multimedia sychronisation features can be represented using CPNs and we discuss what kinds of persistent storage would be needed for this.

CA-2095:
Running TREC-4 Experiments: A Chronological Report of Query Expansion Experiment s Carried out as Part of TREC-4
A.F. Smeaton and Catherine Berrut.

This report describes work done as part of the TREC-4 benchmarking exercise by a team from Dublin City University with help from the CLIPS-IMAG laboratory while the first author was on sabbatical leave at the CLIPS-IMAG laboratory in Grenoble, France. In the first section we briefly outline what TREC is and what its modus operandi is and following that we describe the experimental environment in which we worked. The bulk of the work reported is based on an automatic expansion of the original query using a knowledgebase derived from WordNet. The knowledgebase, WordNet and the previous applications we have used this for are presented also. In section 4 we give a chronological presentation of the query expansion methods we tried and the experimental results, vis-a-vis retrieval effectiveness, we obtained. This presentation is accompanied by an analysis of results as they are presented and in a concluding section we analyse and try to explain the results we obtained.

CA-2195:
An Overview of Discourse Representation Theory
E. Mahon and A. Way.

Discourse Representation Theory (DRT) developed by Hans Kamp (Kamp 1981), works on a discourse processing it sentence by sentence building on what has gone before to produce a semantic representation, the Discourse Representation Structure (DRS). It is a departure from traditional methods which treat sentences in isolation giving each sentence a separate semantic representation, and so fail to find antecedents for anaphoric pronouns within a discourse. In DRT anaphora are analysed not as a relationship between pronouns and other NPs, but as one between pronouns and the discourse referents that are already present in the semantic representation under construction. In this paper the basic mechanisms, construction rules and antecedent prediction methods of DRT will be discussed.

CA-2295:
TREC-4 Experiments at Dublin City University: Thresholding Posting Lists, Query Expansion with WordNet and POS Tagging of Spanish
A.F. Smeaton, F. Kelledy and R. O'Donnell

In this paper we describe work done as part of the TREC-4 benchmarking exercise by a team from Dublin City University. In TREC-4 we had 3 activities as follows:

  1. In work on improving the efficiency of standard SMART-like query processing we have applied various thresholding processes to the postings list of an inverted file and we have limited the number of document score accumulators available during query processing. The first run we submitted for evaluation in TREC-4 (DCU951) used our best set of thresholding and accumulator set parameters.
  2. The second run we submitted is based upon a query expansion using terms from WordNet. Essentially, for each original query term we determine its level of specificity or abstraction; for broad terms we add more specific terms, for specific original terms we add broader ones; for ones in-between we add both broader and narrower terms. When the query is expanded we then delete all the original query terms in order to add to the judged pool, documents that our expansion would find that would not have been found by other retrieval. This is run DCU952.
  3. The third run we submitted was for Spanish data. We ran the entire document corpus through a POS tagger and indexed documents (and queries) by a combination of base form of non- stopwords plus their POS class. Retrieval is performed using SMART with extra weights for query and document terms depending on their POS class. The performance figures we obtained in terms of precision and recall are given at the end of the paper.

CA-2395:
Preprocessing and the General Valid Inequalities for Mixed Integer Linear Programs,
A. Mushi and M. O'hEigerataigh.

Recent advances in mathematical programming include the improved performance of Integer Programming (IP) using strong Linear Programming formulations. One focus is on the area of reformulation and preprocessing of the LP relaxation. Preprocessing refers to elementary operations that can be performed automatically to tighten a given formulated model. These strategies can lead to vast improvements in solution times of IP programs. The aim of this paper is two fold: Firstly we present a brief survey of the early efforts on the search for the general solution of IP and MIP problems by the cutting planes approach. We give a brief description of the general valid inequalities as derived for this purpose. Secondly, we present procedures that transform a given formulated model automatically into a tighter equivalent representation (preprocessing). We implement these procedures and present computational results.