Last changed
10 Sept 1998 ............... Length about 1,000 words (6,000 bytes).
This is a WWW document maintained by Steve Draper, installed at http://www.psy.gla.ac.uk/~steve/miraplans.html.
You may copy it. How to refer to it.
[A successor document is available at
http://www.psy.gla.ac.uk/~steve/mmtc2.html]
At the recent MIRA workshop near Grenoble, there was a discussion on
planning a future project on building a multimedia test collection. This
message represents my ideas on this, and will hopefully be of interest to Miche
and others. The ideas owe a lot to the discussion: to how Yves introduced and
defined the discussion, and to Peter's point that there should be a definite
research question around which to base any proposal.
Main aim(s)
The overall aim of the proposed research would be to investigate whether, and
in what way, it could be possible and useful to build a test collection for
interactive and multimedia IR. (This is the research question.) The approach
would be to build a "test bed", and to make (1) "evaluating the evaluation" one
of the main activities i.e. running tests not just on different pieces of
software, but on testing the validity of those tests. A second main strand (2)
would be to compare and develop alternative proxy measures, of which the
traditional precision and recall would be the first candidates. These measures
are proxies for (that is they substitute for) more directly valid measures of
utility for users doing work. Their point is that they are cheaper to measure,
are more easily repeatable and so support direct comparisons between
alternative software designs; but they are only useful provided they retain
some continuing association (correlation) with "real" measures such as work
completed or value received by users.
Work situation
A key lesson from the MIRA workshops, and a problem for the idea of a MMTC, is
that the particular work situation and task can have large effects on what
makes a retrieval engine useful. A wide spectrum of views is possible here.
At one extreme, we might believe that the work situation dominates and that no
testbed is possible because no measurements will generalise to or predict
performance in particular workplaces. At the other extreme is the position
implied (but not really ever tested) by past test collection work: that the
users' work is relatively unimportant, and the performance of retrieval engines
as measured by test collections dominates how useful they are to users.
MMTC design: compromise redefinition of test items
In this proposal we shall, at least as a starting point, adopt a compromise
position as the basis for building a test bed. (Part of the research will be
to test how successful this compromise is in reality.) The compromise is to
adopt a definition of task and task success that is abstracted from real work
(and so can be used to repeat tests in many labs) but takes more account of
users than the traditional test collections do. Instead of the old test
collection use of queries to represent tasks and relevance judgements to
represent task success, we shall use descriptions of information needs (perhaps
like those being developed by Pia Borlund) and judgements by a sample of users
from the original task domain. (In fact we need to develop this point more.
Mizzaro's paper on the different dimensions of relevance could be used here.)
The purpose of this compromise is:
* it allows evaluation of interactive systems in which only a set of
retrievals, not a single one, is a natural unit of activity. For this we need
a task to set at the start of a session, and measures of the goodness of the
results of the session as a whole.
* It allows us to research proxy measures by offering a set of higher level
judgements of goodness (utility).
*?In image retrieval remarkably little consensus was found (by Yves' work in
FERMI) between users / judges on the relevance of images to a query.
Research actual work situations / tasks
A crucial part of building any test collection is selecting the set of tasks.
In the present proposal for a testbed, this has an added importance because of
some existing work that suggests that there may be much less agreement between
judges on the relevance of each image for a given query. If this is a general
fact, then it would seem to make test collections for images (and hence for
multimedia) impossible. It needs therefore to be directly investigated.
Therefore activity strand (3) would be to research a set of user tasks chosen
to be as different in their characteristics as possible, both as part of the
testbed and to measure the degree of consensus or lack of it for that task, in
order to survey how much different tasks behave in this respect.
One candidate for this set is the use of thumbnail images of screens in a user
interface for navigation. Hypercard uses these: it displays a kind of history
list where each "place" is represented by a thumbnail reduction of the screen
referred to, and the user clicks on one to go back to that screen. This would
probably also be extremely useful in a web browser. This seems likely to be an
example where user agreement about the correctness of retrieval would be very
high. At the other extreme would be a task like Joemon's, where the user is
selecting from an image library pictures to use in producing a tourist leaflet.
There seems likely to be very low agreement between users on this. Note
however that this task could be reinterpreted. The judgements could (in
principle) be made not by the users of the image retrieval system who are using
it to design the leaflet, but by tourists who use the leaflet. The judgements
would then represent success at the real work, rather than at the retrieval
alone.
The document collections
It will probably not be necessary to spend resources ourselves on the actual
document collections. This is because a) we will need a number of rather
different types of document and hence collection; b) we should be able to use
other people's. Particularly promising is the collections funded for general
research availability by DARPA, as described by Bill Arms.
Conclusion
Clearly much more remains to be said; but those are the ideas I came away with.