Last changed 10 Sept 1998 ............... Length about 1,000 words (10,000 bytes).
This is a WWW document maintained by Steve Draper, installed at http://www.psy.gla.ac.uk/~steve/mmtc2.html.
This document is only intended for those in the MIRA MMTC working group.

(Back up to current central page or to MIRA pages.)

(Towards a) MMTC proposal

draft 2.2 by Stephen W. Draper

Preface

This is a draft of a document representing our thinking on the MMTC proposal by the MIRA working group, chaired by Miche. This version has been edited in the light of some comments by Miche sent to me in August.

Forerunner documents:

*Miche offer

*My report on Grenoble MMTC discussion

First draft of this document for a meeting with Mark and Miche.

MM = multimedia
TC = test collection

A. The problem

Aim/need

TCs have played, and are still playing, an important role in IR research. They allow software designs to be compared under controlled conditions. If testing can be fast and cheap, then designs can be rapidly improved. TCs represent the fact that in IR, testing is cheap for a designer if but only if someone else has built the test facility: the TC. This emphasis on technical software improvement is justified either a) because it includes improvements to the user interface (improve usability i.e. reduce costs to user) or b) because technical performance (e.g. speed, precision, recall) has an important impact on the utility to the user.

The technical basis of IR has changed enormously as computing has changed. It must now additionally address the WWW, multimedia, hypertext, highly interactive systems. That is:

not just text documents but other media documents
queries in media other than text
Cross-media retrieval (query and documents being in different languages or media)
Interactive systems: where retrieval is done by many rapid cycles, and the performance of a whole session not a single cycle is more important.
Multi-collection retrieval sessions and tasks.
Multi-engine retrieval sessions and tasks.
Hypertext (hypermedia) i.e. explicit authored links between documents.
The WWW: by far the most important collection, but dynamic and unsearchable by any current engine.
Small collections are also important: we need good comparative tests for these.

The question is: what can be done to build a TC to cover some or all of these new demands?

Points that must be accommodated

Here is a set of points I have learned during MIRA, that need to be accommodated into any satisfactory solution for new TC work.

There are different types / levels of user goal ("information need").

There is not one but many types of relevance.

IR engines produce an ordered list, but past TCs compare this to binary human relevance judgements: surely a category error.

HCI has always said: test real users. And today would say, study real work situations.

Other media e.g. video still need to catch up with print: don't confuse the huge advantage of adding these simple mechanisms (indices, random access) with IR mechanisms.

Text retrieval is a magic trick using meaning from word-stems without syntax: can we expect a comparable one to appear for non-text media?

Cross-media retrieval (e.g. text queries for image retrieval) is NOT just a shabby trick, but is required by some real work domains

The importance of (studies of) work domains. E.g. to make the last point...

Low consensus in relevance judgements of pictures. (FERMI study; but is this actually also true of text? Does this undermine TCs?)

Precision&recall are clearly of dubious value as measures of utility for real tasks

Even if they are valid, other parts than the retrieval engine have major effects on task success

We must be able to evaluate more radical designs (such as that by Iain Campbell)

Summary of the aim

We would like to have a test facility that had at least the advantages of old TCs: fast, automatic (i.e. no human subjects) lab. tests of engines that can drive rapid technical improvements; but that apply to the new technical scope of IR; and can accommodate the issues listed above. The project is addressed to discovering whether this is possible, what such a facility would be like, and providing demonstrators of parts of such a facility.

B. Outline solution

The research question

To investigate the possibility and usefulness of building a TC evaluation facility for interactive and multimedia IR which would:

take account of real world work domains
integrate systems and user approaches to evaluation

(I.e. not to build a full TC to a pre-specification; but to borrow and build illustrative collections, in order to establish what is useful and in what way.)

Main research activities

Strand 1 To study a number of real work domains / situations.
This will be to allow validity checks (strand 3), exercise various document types (strand 2), and to identify user needs that may not be satisfied by any existing technology.

Strand 2 Gain access to a variety of multimedia document collections.
These will be often borrowed from others rather than created and stored ourselves (e.g. Bill Arms' project). Work described as "digital libraries" focusses on this: our emphasis is on testing (evaluation) not collecting.

Strand 3 Evaluating the evaluations
Testing the validity of the tests e.g. comparing test scores against measures of utility to users in workplaces. A large part of this will be comparing different levels/types of user goal (information need).

Strand 4 Explore alternative proxy measures
Precision and recall are the standard measures, used for convenience (i.e. speed, cheapness, direct comparability between IR engines) by designers to stand in for direct measures of value to the end users. Are there other useful proxy measures?

Overview

We wish to explore the design of a test facility. As will be seen, this is a large space, and we will attempt to sample it rather than cover all parts exhaustively.

The space is generated by the "dimensions":
A. Collection, query, engine, evaluation measures
B. Types (or levels) of information need

A.
The traditional system orientated text retrieval test facility includes four components: query, engine, collection, evaluation measures. In interactive and multimedia retrieval each of these components incorporates new dimensions. To take account of the user, the query needs to be extended to include the concept of information need (see the reference to Mizzaro's paper below). Equally the search engine cannot be considered in isolation. Some consideration needs to be given to the user interface, i.e. how the system and different types of information are presented to the searcher as well as the system/user interaction. In addition the collection component in the context of multimedia incorporates different types of presentational media as well as documents (besides issues of different types of document in a single media e.g. academic paper abstracts vs. newspaper articles). Since we are studying evaluation and wish to measure the validity of measures in various cases, a fourth factor here is that of the evaluation measures used. Consequently in order to reflect of all these different and new dimensions a variety of new measures need to be explored.

B.
Much of our approach can be organised around a dimension of 4 types or levels of information need. We take these from Stefano Mizzaro's paper (which not only defines them, but relates them to the many related distinctions made in the literature). He defines the types in terms of the representation in which the need is expressed. (See also my summary of his framework.)

RIN = Real Information Need. The need external to the users, perhaps not fully graspable by them, partly implicit in the environment.
PIN = Perceived Information Need. The mental representation in the user's mind.
Request: the need expressed in natural language.
Query: the need formalised in a machine language.

Traditional TCs test only the fourth level (of queries). Workplace studies address the first level (RIN). By attempting to measure the validity of laboratory measures, we can study the approximations that have been accepted for so long, and which may or may not apply to multimedia and other new forms of IR. On the other hand, we will also study how to supply if possible the needs of laboratory testing: that is, the advantage of having tests that are fast and cheap and can be applied without human users. One avenue to pursue may be the development of "simulated users", so that some lab. benchmarks can be run on interactive software but without human users. At least a few simple types could easily be constructed: a simulated user that chooses a document (from the list) randomly, one that picks the first document with a keyword in the surrogate, etc. These simulated users may be a) motivated by simulating strategies observed in at least some real users, b) can be checked against real human users to estimate the validity of using them in lab. tests.

This set of levels of information need also allows us to cover the "task-artifact" cycle in IR: that is, the fact that in some cases technology is developed to satisfy pre-existing user needs (traditional task-driven design), and in other cases new needs and even jobs are created to exploit new technology that was invented without any clear need (artifact-created demand). Studying work domains (RINs) will have as one result the identification of user needs not currently well served (manufacturers could identify new markets from this). Studying the performance of new collections and new engines in the lab against artificial measures is one way of studying artifacts independent of users (comparable to building and testing a rocket car that goes faster than anyone could possibly use a car for). It is likely that quite new uses for retrieval from video and images may emerge if they are made possible by new technology e.g. perhaps in 10 years we will attach a picture to every email to illustrate the mood: possible if image retrieval by "mood" became fast and effective. This approach from both ends (from both tasks and from new artifacts) is also important in our research strategy. We do not know, as we write the proposal, what technology will seem vital by the end of the project. Lab tests (the level of the query) will always be important, but they are by definition closely bound to the type of technology (which defines the "query" language and functions). There are very few examples of image retrieval engines at the moment, and so we do not yet know what will come to seem standard as an image query language and function. But we can already study work domains (and RINs) in this area: only later will the query level settle down.

It is essential to develop a test framework (and research strategy) that takes account of the WWW. This is not only now very important in most work domains, it is also a challenge because it is too big for any engine to search fully. Thus the old approach to TCs is now no longer adequate, and we must do work on sampling as well as on exhaustive retrieval, aiming to develop standard test procedures based on sampling.

Tactics

In every aspect of this proposal we should specify one or two precise activities and leave others explicitly unspecified, because technical and human opportunities will change during the project and to be very explicit will condemn our work to be out of date before the project ends. Furthermore, part of the project work should be to explore what is available and possible. In relation to work domain studies, I therefore suggest we pick one or two only that we specify in the proposal and have users already agreed to participate. Tentatively, I suggest we choose the work domain of designers of leaflets and brochures who need to pick images as illustrations. The advantages of this particular work domain are:

In text retrieval, IR adds most value when the goal consists of many weak constraints or requirements, rather than one strong one (which could be found without IR). I think this kind of image illustration is like this, unlike "a picture of Chirac taken yesterday in Paris".

It is a real work domain i.e. we can find people who have this ias a paid job

But we can also simulate this by hiring students who have the task described to them. Thus we could run many tests with substitute users and hope to find that this is a reasonable approximation.

(Back up to current central page or to MIRA pages.)