25 Aug 1997 ............... Length about 4800 words (29000 bytes).
This is a WWW version of a document. You may copy it.
To ftp a postscript version of it to print click this.

Temporal and Situated Aspects of Data: Final report

Stephen W. Draper, Richard Cooper
Department of Psychology
University of Glasgow
Glasgow G12 8QQ U.K.
email: steve@psy.gla.ac.uk
WWW URL: http://www.psy.gla.ac.uk/~steve

Contents (click to jump to a section)

Introduction

This is the detailed report (section 16 of the final report), written in August 1997, for EPSRC grant GR/K25021 "Temporal and situated aspects of data", which ran October 1994 (actually April 1995 because of a late start) to May 1997.

This project aimed to study a number of problems in data modelling, particularly those associated with the treatment of time and with varying validity in the data. The strategy however was not to identify these problems, develop a general solution, and then to illustrate it by reference to some applications, but to use one particular problem domain as the focus and attempt to solve as many problems as possible that occurred in that domain.

The problem domain chosen was that of simple accountancy: for instance managing a personal bank account, the budget for a research project, or the accounts of a small company. These simple accounts are currently usually handled by spreadsheets. We were interested in identifying and solving those issues in this domain that are poorly supported by spreadsheets and represent quite general problems for data modelling and databases. A number of these problems and issues were described in the grant application. In addition we interviewed experienced users in our domain (responsible for managing small budgets and producing simple accounts), and this produced an important improvement in our understanding of the domain and the problems it presents to software attempting to support it.

We have developed a data model based on three types of object: time varying functions of money, status variables, and an explicit representation of the network of formulae linking money items. We can show how these are sufficient to solve many issues in our problem domain, supporting facilities useful to users and far in advance of what can be accomplished by spreadsheets or indeed many complex accounting packages. We have implemented a program incorporating these objects and demonstrating their application to accounting, and how they solve in this particular domain a variety of data modelling problems of wide interest.

In the rest of this report we first describe what we learned of the nature of our problem domain; then define our data model objects; then how these can be applied in many ways to address the problems; then the general issues in data modelling these can be seen as solving; and finally report on our practical work.

The domain of simple accountancy

We are concerned with managing small budgets such as personal bank accounts, research grants, a department or firm with two or three employees. Many items are recurrent e.g. standing orders for mortgage repayments. Many in fact are recurrent but with varying amounts e.g. salaries incremented each year, direct debits for gas bills whose amount is adjusted annually.

More importantly, many items go through a standard set of states, rather than being one-off items like a grocery bill that is only entered at the time payment is made. For instance, standing orders are committed to in advance (when the order is signed), but payment is made on a particular date. Cheques are committed to when signed and handed over, but only paid from the bank account many days later. The first implication from this is that such items must be modelled as entities that persist over time, but change status. The second implication is that account holders want more than one view of their account: a view for reconciling bank statements which deals only in cleared funds, and another view (which the users we interviewed told us is called "commitment accounting") that shows how much money you have left after all the payments in the pipeline have been made.

In addition to these aspects of the domain, our interviews taught us the existence and importance of the following three issues which we had not foreseen. Firstly, that even with simple accounts, users need multiple views of what to them are "the same" accounts. Committed versus cleared funds is just one example, others being the requirement to produce accounts for different financial years such as the tax year (from April), the university financial year (from September), and from the start of a grant (often October, but could be any month). Secondly, the facility to produce accounts at varying levels of detail. Committees and individuals typically look at the totals first, then the first level of subdivision. If there is something unexpected, then more and more details will be required for the subheading producing the anomaly down until it is understood. That is, from the user point of view, the level of detail required is not fixed in advance, but ideally should be very easily changed. While control of the level of detail is an issue for the user interface, the associated user task is that of looking for the justification (reasons for) a particular item (such as the budget total).

The third and most important issue, though, is that such accounts are not produced simply for the sake of recording history but mainly to support decisions. Individuals want to know how much money they have to spare, and whether they have overspent, and so does a budget holder. One of our informants produces as a matter of routine totals, next to planned expenditure, and a "variation" column indicating significant differences. The fundamental user action here is not that of reading off a total, but of comparing current totals against previous plans, and if necessary then modifying the plan. Both plans and actual spending actions need to be recorded in the same system, and indeed closely related as frequently they refer to the same items. Supporting this caused a major modification to our initial designs because it corresponds to a change in understanding about the essential nature of this domain: not that of recording historical payment actions and calculating some totals, but comparing two views (planned and actual) of "the same" thing with particular attention to differences.

Our data modelling objects

We found that we could support all the tasks and functions for our domain using only three new kinds of object.

Time varying functions of money

Money amounts are modelled, not as simple integer quantities, but as functions that take a time period (in the world being described by the database) and return a money value. (Of course the same machinery can address domains where the "currency" is not money but, say, exam marks.) One-off payments are a special degenerate case, returing either zero or the one-off amount depending on whether it occurred during the period specifed as the argument. Standing orders pay out repeatedly on specified dates. Totals (e.g. "equipment budget") are functions like spreadsheet formulae that refer to other functions (passing on the time period they received as an argument).

Status variables

Items (time functions of money) can have one or more status variables associated with them. For a cheque, a status variable might be used to represent its state e.g. written, signed, handed over (or sent) to the payee, paid in by payee, cleared by payer's bank. In a small company, a job might have a status variable with values such as: prospect, estimate given, order received, job executed, invoice sent, payment received. In constructing a plan, items might be given a status variable with values such as: residual funds, estimated cost, actual cost.

A status variable is an enumerated set of values, specified by the user in setting up a particular model (schema). The values in general form a directed graph, possibly with cycles; but may have an order defined over them. That is, users are allowed to enter events that change a status value in any way, and in many cases such transitions will be meaningful as representing unusual but possible changes to status (e.g. a cheque having bounced being re-presented). However the defined order is useful to users so that they can ask for a view of the data that reports (only) on items with a status greater than or equal to some value e.g. to reconcile a bank statement ask for all items actually paid, but to calculate spare funds, ask for all items either committed or actually paid. A company chasing unpaid bills would ask for a list of jobs with status billed but not paid, while when it wants to direct its salesmen it might ask for a list of jobs quoted for but not ordered.

The formula DAG

When a money function is created, if the function itself is a formula referring to other items, this is added to the graph relating formulae, with the new item as a parent node to those referred to. The user is also asked to specify what set(s) a new item belongs to (e.g. an item might be named "computer" and added to the set "equipment"), and this tacitly adds it to the formula graph that specifies the sum (subtotal) for that set.

Informally we can think of this graph as a tree whose root is the budget's overall total, but since it is not uncommon for alternative groupings of items to be used simultaneously, the structure of the network is in general a DAG (Directed Acyclic Graph). The formula DAG is stored explicitly, and used in a number of ways.

Additional features

All data (and schema) entry is done by creating an "event". This is time-stamped for the time of data entry, but often also has an associated world time, entered by the user, to which it refers (e.g. the time the payment is made). An event may change the value of a status variable, create a time function, change the time function associated with a given item or its parameters. It may also change the metadata rather than the data in the sense of creating or modifying an item representing a new subtotal i.e. a set that groups other items in a new way e.g. moves an item from equipment to consumables.

The user can ask for a report or view, specifying the period of world time the view is to describe, and the values for any status variables to be used to filter the calculation. Values for all functions (including totals) are then computed. Alternatively, the user can ask for a trace (graphical or textual) of the values of a particular item as it changes over time.

Applying these primitives to the accountancy domain

Managing a budget: comparing plan and actuality

To implement a normal account, we expect a user to create and maintain a "history stream" and a "plan stream" referring to the same budget or account. The history stream can be thought of as a set of items organised bottom-up that can be evaluated together to give the current state of an account, while the plan stream is a corresponding set of items organised top-down from the overall total down to some level. The streams share a common formula DAG: the structure of formulae grouping items into subtotals and totals. All items are time varying functions of money, as explained above. The formula DAG is created as money functions are created: if a function contains a formula that refers to other items, then it is a superordinate node to those items, and when created the user assigns it to one or more sets which are its superordinate sets.

Like the history stream, the plan stream also consists of a set of money functions, but these are tagged as of type "plan" with a given plan version number (so that alternative plans and plans made at different dates can be held together indefinitely). As in the history stream, every item corresponds to one specific node in the formula DAG. A plan item is either a pointer to the formula from that node (i.e. its value depends on the corresponding subordinate plan items), or a pointer to an item that is a money function in the history stream (to represent a plan for a specific item such as the purchase of a major piece of equipment), or else a separate money function (to represent an overall plan for a whole subtree e.g. petty cash, travel expenses at the rate of £1000 per month, etc.). By specifying a world-time and a plan version number, the plan stream can be evaluated to show the planned budget at that date. This can be compared to a view of the history stream. That is, the planned and actual versions of the account can be juxtaposed (textually or graphically) or, alternatively, the history of the actual and planned versions of an item can similarly be compared.

A typical practice might be to review the accounts once a month, but to enter expenditure daily as it arises. At the end of the month, the latest plan (now a month old) would be evaluated for the current date to show what the planned spend should have been under each heading, while the history stream would also be evaluated with the current date to show actual expenditure. If the comparison showed discrepancies then a new plan could be made, by copying and modifying the last plan, to take account of what has happened. For instance if travel were underspent, the function for the rate of spend could be increased, while if equipment were overspent then either money could be moved ("vired") from another heading or the rate of the spend function for the remaining period reduced.

In creating a plan, typically the totals are known and some items or subtotals must be adjusted to conform with this. Functions in the interface are offered to the user to make this easy. A status variable associated with every plan item marks totals and pointers to history stream items as "fixed" and plan-stream money functions as "provisional". Proposed values (for the user to modify and approve) for the provisional items can be calculated by using the formula DAG to work backwards from the "fixed" totals.

Tracking the evolution of items

Status variables can be used so that an item can be represented throughout its financial history. Thus in a research budget, a large item of equipment might be specified in the original application, and have status values "proposed", "ordered", "delivered", "paid for". Thus the specific plan for buying this equipment could be represented by entering this item at the start, even though purchase might not take place for some months. Actions that change its status are represented by entering "events" that refer to the item and specify the change. Its amount i.e. the price of the equipment is likely to be modified (by an event) too, since only when it is ordered is the exact cost likely to become fixed and known.

Similarly cheques and direct debits for, say, regular subscriptions could be handled with their own similar status variable, thus allowing a user to enter an item for an anticipated bill months in advance, while keeping track of what is paid when.

Using views to produce periodic accounts

Besides data entry, the basic user action is to request a view on the data by specifying a period. For instance, to produce the accounts for a month, the user specifies the period (e.g. 1 March 1997 to 31 March 1997) plus status variable values (e.g. to specify whether cleared or committed funds are to be reported) and the software will produce the values of all items including totals for that period. The essential advantage of this approach is that many different views can be produced of the same data: different accounting periods, different status subsets.

Data modelling generalisations

Qualitative and quantitative time

In general there are two approaches to modelling time: qualitative and quantitative. The former, usually associated with logical treatments, describes orderings of events without describing the size of durations, while the latter describes durations. (This is not exactly the distinction between discrete and continuous, as durations may be described in discrete units such as days.) We have provided separate mechanisms to handle both these modelling requirements. Status variables allow qualitative time changes as an ordered sequence of values, specific to a user-defined type of object. Time as a quantity is handled as a basic value, passed as an argument to functions.

Data entry logging

All user actions are time stamped and a non-deletable record is kept. To correct an entry, the user enters a new event superceding the previous one. It is thus possible to do complete roll-back to the state of the database at any given user time, and reproduce the state as the database believed it to be then. In the accountancy domain, this allows a complete record of errors to be retained even after correcting them. This could be used for calculating compensation, calculating error rates with a view to improving the human and overall operations, and showing what the company believed at a given time in order to deal with claims about misleading customers or auditors, or about responsible management. Given that many numbers are subject to change independently of human error (e.g. prices are corrected from estimates to actual billings, plans are modified frequently), this is hard to do in present systems. Currently it is most often done by keeping complete paper or disk dumps of accounts at frequent intervals and keeping them for many years: clearly our approach is more economical.

Kinds of time

Note that that facility is wholly about the user-machine (i.e. user-database) interaction over time. That "user time" is quite separate from "world time": the times stored that represent time in the world being modelled e.g. dates of bank payments. Because in practice, at least as we understand it, the users' main task is to review and manage an account, they are doing equal amounts of planning and of recording definite (past) events. Both of these refer to world time, but view it from opposite directions. Thus planning tries to reason about world times that are still in the future (using estimates). Thus in our design, a user requests a view of the data in a single uniform way, specifying some world date which might be in the past or in the future.

The relationship of the database to the world

This single uniform approach is esthetically pleasing, but it also draws attention to the fact that any model of the world, including of the recent past, is only an estimate: it is only as accurate as the data currently entered. In much database work, it is assumed that the design of a data model or schema defines how much detail the database will hold, but then it is assumed to be in exact correspondence (within those limits) with the world. Data entry errors are held to a low level by validation procedures at data entry. In the world of accounting, people are much more aware that the relationship of validity between a database and the world it models is more complex and more problematic than that. One of the biggest issues is the time delay in acquiring data. Companies typically publish accounts months after the end of the accounting year, reflecting how long it takes to acquire and enter financial data. Government figures on the economy, e.g. for high street spending, are even more subject to both delay and inaccuracy. Thus in fact any given report should be specified in terms of both world time and user time i.e. both the time in the world which the report tries to describe, and the time at which the database was interrogated (not necessarily the present time if rollback were used to undo some data entry actions). Very small companies may also have significant lags in data entry (e.g. accounts may only be updated and bills paid once a week or once a month), just as large companies do.

Users who have a model of the kinds of uncertainty in the data may define status variables to describe this (e.g. provisional vs. confirmed figures). Another example might be, in a personal account, to enter credit card items as they are spent, but use the itemised bill sent by the credit card company to check items and to catch items that had been forgotten. In this way uncertainties in the validity of the data may be modelled.

Asking for justifications

A frequent task for account managers is seek out where a figure comes from: why it is the amount it is. For instance if consumables expenditure seems high this week, then the manager will want to look at the items making it up. Another example is in looking at an estimated total (perhaps in a plan), the manager will want to understand what this estimate depends on, and what items could best be changed in order to change the estimate i.e. which are least certain or least fixed. During the project, we took ideas from the Artificial Intelligence literature on Reason Maintainence Systems and applied them here. In this domain, a figure is calculated by a formula, which in turn draws on other items. If the user asks "why/explain" and points to an item, the formula DAG can be used to identify the set of items on which it depends. If a status variable has been defined marking some items as "certain", others as "estimated" and so on, with an order defined on the values for that status variable, then the mechanism will identify the subset with the lowest values: an inference is only as good as the weakest (least certain) assumption on which it is based.

Fisheye views

A visual approach corresponding to that is the fisheye view approach, another idea from the literature that we have adapted for this domain. A fisheye lens shows the centre of the field at high magnification, and peripheral things at progressively lower resolution. A computer equivalent is to be able to show any item at a number of levels of detail or size on the display, and to use a measure of "interest" to control this. Here, if the user points to an item (say a subtotal), this defines the centre of interest. Other items are assigned interest in inverse proportion to their distance in the formula DAG from the point of interest. Status variables can be used to modify this interest function further.

Dynamic schema modification

In practice, many database users would like to modify the database's schema during its lifetime, raising problems that are in general intractable. However because of the simple structure of our data model, we can allow this. Modifications are changes to the formula DAG. New nodes can be inserted or removed without leaving any ambiguities. Items can be moved from one place in the DAG to another, as long as the interface prevents them being left isolated. Such changes will be logged and timestamped as "events", even though they change the metadata rather than the data modelling the world. Because of the time stamps, it would then be possible to ask for a view of the world specifying one world time and a different metadata time i.e. you could ask for the accounts using a new structure, but referring to a past time, or conversely the accounts of the current time but using an earlier structure. This is likely to prove useful in the accounting world, as practices periodically change making comparisons between different years difficult if they were reported following different standards. Furthermore some companies report using more than one structure: one they think the best reflection of the "true state of affairs", and another reflecting current legal requirements for reporting.

Work done and current status

Because our long term plan was and is to test our software on real users, it is important to be able to deliver it on multiple platforms, particularly the PC (most businesses) and Macs (many local university account managers, and some multimedia businesses). We therefore spent some time at the start of the project identifying a suitable environment, settling on a version of Smalltalk that would support iterative experimental programming, but would run on both Macs and PCs. Early outline designs were implemented using this, as were student projects on graph drawing packages, a Reason Maintainence system, and a Fisheye package. At the end of the project we switched to an implementation in Java which ported instantly between a Sun and a PC running WindowsNT, and also a Mac. All the facilities mentioned above have been implemented.

Of the objectives listed on the application form, we have accomplished all of them except going back to the outside users we interviewed early on, to test the resulting software. We feel disappointed by this, but the persistent ill health of the RA on this project held back the work over and above the two months unpaid leave she asked for. Extensive user testing will also require additional software work, particularly on routines to read in existing spreadsheets and convert them into our data structures in order to persuade users to use our software on their real work.

These delays also mean that we are only now writing up papers to report on this work, and formulating a new grant application for further work.

Future work

The immediate further work required is to test our software on users, both ourselves trying to use it to manage our accounts, and other outside users. This would doubtless lead to many revisions, some to the user interface, perhaps some to our data modelling ideas, and some to our identification of and support for the main user tasks in this domain.

However this would probably be only the start of a new issue. Judging from what we personally have learned about "simple" accounting during this project, a major issue will be teaching users new practices: to understand that what they are really doing is not just writing down sums spent, but managing accounts, that there could be substantial advantages in entering items months in advance when they are first foreseen, and that there isn't really any such thing as "printing out the accounts" but rather the question of what view they want to see for what purpose. It will be one thing to polish up a user interface for a user with a deep understanding of the domain, and quite another to teach accounting-naive users what the domain is about through the medium of the software interface and documentation.

This reflects the issues involved in building a task or domain-specific program as opposed to a general purpose tool like a spreadsheet. A special purpose tool will have better adapted functionality, and save every user having to repeat many pieces of implementation (e.g. calculating salaries with increment dates, having separate spreadsheets for each required accounting period). But it raises rather than solves the increasingly important issue in software design of how to teach users about the domain the tool addresses. This may well be a cruical issue in databases in general, where apparent poor usability may often really be due to a lack of support for the user to understand the view of the domain implied by a given database, and the relationship of the database to the world it is trying to represent. Addressing this issue, of how to present a domain to a user not already familiar with it, is an important and general issue that we would like to explore through this particular domain of simple accountancy.

In our view, this work could inspire commercial software products in future, but is still early in its development phase.

Conclusion

We are pleased with our strategy of focussing on a particular small problem domain as a way of studying general problems. Not only does it force us to pay attention to problems because they matter to real users (at least in one domain) rather than selecting them arbitrarily, but the solutions are nevertheless much more general than the domain we studied, even if they do not have the complete generality that, say, relational calculus does for databases. Time varying functions, status variables, and formula DAGs not only solve multiple problems in this domain, but it seems clear they will be useful in other application areas. Furthermore they capture what is common about what is the same item to users (e.g. buying one item of equipment), yet handled in duplicate in both spreadsheet and manual systems of accounting (e.g. double entry book-keeping) depending on whether the item is being viewed as a plan or a past expenditure, part of the equipment budget or an item with an invoice to be paid.