Last changed 10 Sept 1998 ............... Length about 1,000 words (10,000 bytes).
This is a WWW document maintained by Steve Draper, installed at http://www.psy.gla.ac.uk/~steve/mmtc2.html.
This document is only intended for those in the MIRA MMTC working group.

(Back up to current central page or to MIRA pages.)

(Towards a) MMTC proposal

Contents (click to jump to a section)

draft 2.2 by Stephen W. Draper

Preface

This is a draft of a document representing our thinking on the MMTC proposal by the MIRA working group, chaired by Miche. This version has been edited in the light of some comments by Miche sent to me in August.

Forerunner documents:

  • *Miche offer
  • *My report on Grenoble MMTC discussion
  • First draft of this document for a meeting with Mark and Miche.

    MM = multimedia
    TC = test collection

    A. The problem


    Aim/need

    TCs have played, and are still playing, an important role in IR research. They allow software designs to be compared under controlled conditions. If testing can be fast and cheap, then designs can be rapidly improved. TCs represent the fact that in IR, testing is cheap for a designer if but only if someone else has built the test facility: the TC. This emphasis on technical software improvement is justified either a) because it includes improvements to the user interface (improve usability i.e. reduce costs to user) or b) because technical performance (e.g. speed, precision, recall) has an important impact on the utility to the user.

    The technical basis of IR has changed enormously as computing has changed. It must now additionally address the WWW, multimedia, hypertext, highly interactive systems. That is:

    The question is: what can be done to build a TC to cover some or all of these new demands?

    Points that must be accommodated

    Here is a set of points I have learned during MIRA, that need to be accommodated into any satisfactory solution for new TC work.

  • There are different types / levels of user goal ("information need").

  • There is not one but many types of relevance.

  • IR engines produce an ordered list, but past TCs compare this to binary human relevance judgements: surely a category error.

  • HCI has always said: test real users. And today would say, study real work situations.

  • Other media e.g. video still need to catch up with print: don't confuse the huge advantage of adding these simple mechanisms (indices, random access) with IR mechanisms.

  • Text retrieval is a magic trick using meaning from word-stems without syntax: can we expect a comparable one to appear for non-text media?

  • Cross-media retrieval (e.g. text queries for image retrieval) is NOT just a shabby trick, but is required by some real work domains

  • The importance of (studies of) work domains. E.g. to make the last point...

  • Low consensus in relevance judgements of pictures. (FERMI study; but is this actually also true of text? Does this undermine TCs?)

  • Precision&recall are clearly of dubious value as measures of utility for real tasks

  • Even if they are valid, other parts than the retrieval engine have major effects on task success

  • We must be able to evaluate more radical designs (such as that by Iain Campbell)

    Summary of the aim

    We would like to have a test facility that had at least the advantages of old TCs: fast, automatic (i.e. no human subjects) lab. tests of engines that can drive rapid technical improvements; but that apply to the new technical scope of IR; and can accommodate the issues listed above. The project is addressed to discovering whether this is possible, what such a facility would be like, and providing demonstrators of parts of such a facility.

    B. Outline solution

    The research question

    To investigate the possibility and usefulness of building a TC evaluation facility for interactive and multimedia IR which would: (I.e. not to build a full TC to a pre-specification; but to borrow and build illustrative collections, in order to establish what is useful and in what way.)

    Main research activities

    Strand 1 To study a number of real work domains / situations.
    This will be to allow validity checks (strand 3), exercise various document types (strand 2), and to identify user needs that may not be satisfied by any existing technology.

    Strand 2 Gain access to a variety of multimedia document collections.
    These will be often borrowed from others rather than created and stored ourselves (e.g. Bill Arms' project). Work described as "digital libraries" focusses on this: our emphasis is on testing (evaluation) not collecting.

    Strand 3 Evaluating the evaluations
    Testing the validity of the tests e.g. comparing test scores against measures of utility to users in workplaces. A large part of this will be comparing different levels/types of user goal (information need).

    Strand 4 Explore alternative proxy measures
    Precision and recall are the standard measures, used for convenience (i.e. speed, cheapness, direct comparability between IR engines) by designers to stand in for direct measures of value to the end users. Are there other useful proxy measures?

    Overview

    We wish to explore the design of a test facility. As will be seen, this is a large space, and we will attempt to sample it rather than cover all parts exhaustively.

    The space is generated by the "dimensions":
    A. Collection, query, engine, evaluation measures
    B. Types (or levels) of information need

    A.
    The traditional system orientated text retrieval test facility includes four components: query, engine, collection, evaluation measures. In interactive and multimedia retrieval each of these components incorporates new dimensions. To take account of the user, the query needs to be extended to include the concept of information need (see the reference to Mizzaro's paper below). Equally the search engine cannot be considered in isolation. Some consideration needs to be given to the user interface, i.e. how the system and different types of information are presented to the searcher as well as the system/user interaction. In addition the collection component in the context of multimedia incorporates different types of presentational media as well as documents (besides issues of different types of document in a single media e.g. academic paper abstracts vs. newspaper articles). Since we are studying evaluation and wish to measure the validity of measures in various cases, a fourth factor here is that of the evaluation measures used. Consequently in order to reflect of all these different and new dimensions a variety of new measures need to be explored.

    B.
    Much of our approach can be organised around a dimension of 4 types or levels of information need. We take these from Stefano Mizzaro's paper (which not only defines them, but relates them to the many related distinctions made in the literature). He defines the types in terms of the representation in which the need is expressed. (See also my summary of his framework.)

    Traditional TCs test only the fourth level (of queries). Workplace studies address the first level (RIN). By attempting to measure the validity of laboratory measures, we can study the approximations that have been accepted for so long, and which may or may not apply to multimedia and other new forms of IR. On the other hand, we will also study how to supply if possible the needs of laboratory testing: that is, the advantage of having tests that are fast and cheap and can be applied without human users. One avenue to pursue may be the development of "simulated users", so that some lab. benchmarks can be run on interactive software but without human users. At least a few simple types could easily be constructed: a simulated user that chooses a document (from the list) randomly, one that picks the first document with a keyword in the surrogate, etc. These simulated users may be a) motivated by simulating strategies observed in at least some real users, b) can be checked against real human users to estimate the validity of using them in lab. tests.

    This set of levels of information need also allows us to cover the "task-artifact" cycle in IR: that is, the fact that in some cases technology is developed to satisfy pre-existing user needs (traditional task-driven design), and in other cases new needs and even jobs are created to exploit new technology that was invented without any clear need (artifact-created demand). Studying work domains (RINs) will have as one result the identification of user needs not currently well served (manufacturers could identify new markets from this). Studying the performance of new collections and new engines in the lab against artificial measures is one way of studying artifacts independent of users (comparable to building and testing a rocket car that goes faster than anyone could possibly use a car for). It is likely that quite new uses for retrieval from video and images may emerge if they are made possible by new technology e.g. perhaps in 10 years we will attach a picture to every email to illustrate the mood: possible if image retrieval by "mood" became fast and effective. This approach from both ends (from both tasks and from new artifacts) is also important in our research strategy. We do not know, as we write the proposal, what technology will seem vital by the end of the project. Lab tests (the level of the query) will always be important, but they are by definition closely bound to the type of technology (which defines the "query" language and functions). There are very few examples of image retrieval engines at the moment, and so we do not yet know what will come to seem standard as an image query language and function. But we can already study work domains (and RINs) in this area: only later will the query level settle down.

    It is essential to develop a test framework (and research strategy) that takes account of the WWW. This is not only now very important in most work domains, it is also a challenge because it is too big for any engine to search fully. Thus the old approach to TCs is now no longer adequate, and we must do work on sampling as well as on exhaustive retrieval, aiming to develop standard test procedures based on sampling.

    Tactics

    In every aspect of this proposal we should specify one or two precise activities and leave others explicitly unspecified, because technical and human opportunities will change during the project and to be very explicit will condemn our work to be out of date before the project ends. Furthermore, part of the project work should be to explore what is available and possible. In relation to work domain studies, I therefore suggest we pick one or two only that we specify in the proposal and have users already agreed to participate. Tentatively, I suggest we choose the work domain of designers of leaflets and brochures who need to pick images as illustrations. The advantages of this particular work domain are:
  • In text retrieval, IR adds most value when the goal consists of many weak constraints or requirements, rather than one strong one (which could be found without IR). I think this kind of image illustration is like this, unlike "a picture of Chirac taken yesterday in Paris".
  • It is a real work domain i.e. we can find people who have this ias a paid job
  • But we can also simulate this by hiring students who have the task described to them. Thus we could run many tests with substitute users and hope to find that this is a reasonable approximation.

    (Back up to current central page or to MIRA pages.)