Last changed 10 Sept 1998 ............... Length about 1,000 words (6,000 bytes).
This is a WWW document maintained by Steve Draper, installed at http://www.psy.gla.ac.uk/~steve/miraplans.html. You may copy it. How to refer to it.

[A successor document is available at http://www.psy.gla.ac.uk/~steve/mmtc2.html]

Project for a multimedia test collection

At the recent MIRA workshop near Grenoble, there was a discussion on planning a future project on building a multimedia test collection. This message represents my ideas on this, and will hopefully be of interest to Miche and others. The ideas owe a lot to the discussion: to how Yves introduced and defined the discussion, and to Peter's point that there should be a definite research question around which to base any proposal.

Main aim(s)

The overall aim of the proposed research would be to investigate whether, and in what way, it could be possible and useful to build a test collection for interactive and multimedia IR. (This is the research question.) The approach would be to build a "test bed", and to make (1) "evaluating the evaluation" one of the main activities i.e. running tests not just on different pieces of software, but on testing the validity of those tests. A second main strand (2) would be to compare and develop alternative proxy measures, of which the traditional precision and recall would be the first candidates. These measures are proxies for (that is they substitute for) more directly valid measures of utility for users doing work. Their point is that they are cheaper to measure, are more easily repeatable and so support direct comparisons between alternative software designs; but they are only useful provided they retain some continuing association (correlation) with "real" measures such as work completed or value received by users.

Work situation

A key lesson from the MIRA workshops, and a problem for the idea of a MMTC, is that the particular work situation and task can have large effects on what makes a retrieval engine useful. A wide spectrum of views is possible here. At one extreme, we might believe that the work situation dominates and that no testbed is possible because no measurements will generalise to or predict performance in particular workplaces. At the other extreme is the position implied (but not really ever tested) by past test collection work: that the users' work is relatively unimportant, and the performance of retrieval engines as measured by test collections dominates how useful they are to users.

MMTC design: compromise redefinition of test items

In this proposal we shall, at least as a starting point, adopt a compromise position as the basis for building a test bed. (Part of the research will be to test how successful this compromise is in reality.) The compromise is to adopt a definition of task and task success that is abstracted from real work (and so can be used to repeat tests in many labs) but takes more account of users than the traditional test collections do. Instead of the old test collection use of queries to represent tasks and relevance judgements to represent task success, we shall use descriptions of information needs (perhaps like those being developed by Pia Borlund) and judgements by a sample of users from the original task domain. (In fact we need to develop this point more. Mizzaro's paper on the different dimensions of relevance could be used here.)

The purpose of this compromise is:
* it allows evaluation of interactive systems in which only a set of retrievals, not a single one, is a natural unit of activity. For this we need a task to set at the start of a session, and measures of the goodness of the results of the session as a whole.
* It allows us to research proxy measures by offering a set of higher level judgements of goodness (utility).
*?In image retrieval remarkably little consensus was found (by Yves' work in FERMI) between users / judges on the relevance of images to a query.

Research actual work situations / tasks

A crucial part of building any test collection is selecting the set of tasks. In the present proposal for a testbed, this has an added importance because of some existing work that suggests that there may be much less agreement between judges on the relevance of each image for a given query. If this is a general fact, then it would seem to make test collections for images (and hence for multimedia) impossible. It needs therefore to be directly investigated. Therefore activity strand (3) would be to research a set of user tasks chosen to be as different in their characteristics as possible, both as part of the testbed and to measure the degree of consensus or lack of it for that task, in order to survey how much different tasks behave in this respect.

One candidate for this set is the use of thumbnail images of screens in a user interface for navigation. Hypercard uses these: it displays a kind of history list where each "place" is represented by a thumbnail reduction of the screen referred to, and the user clicks on one to go back to that screen. This would probably also be extremely useful in a web browser. This seems likely to be an example where user agreement about the correctness of retrieval would be very high. At the other extreme would be a task like Joemon's, where the user is selecting from an image library pictures to use in producing a tourist leaflet. There seems likely to be very low agreement between users on this. Note however that this task could be reinterpreted. The judgements could (in principle) be made not by the users of the image retrieval system who are using it to design the leaflet, but by tourists who use the leaflet. The judgements would then represent success at the real work, rather than at the retrieval alone.

The document collections

It will probably not be necessary to spend resources ourselves on the actual document collections. This is because a) we will need a number of rather different types of document and hence collection; b) we should be able to use other people's. Particularly promising is the collections funded for general research availability by DARPA, as described by Bill Arms.

Conclusion

Clearly much more remains to be said; but those are the ideas I came away with.