(Back up to current central page or to MIRA pages.)
draft 2.2 by
Stephen W. Draper
Forerunner documents:
MM = multimedia
TC = test collection
The technical basis of IR has changed enormously as computing has changed. It must now additionally address the WWW, multimedia, hypertext, highly interactive systems. That is:
The question is: what can be done to build a TC to cover some or all of these new demands?
Strand 2
Gain access to a variety of multimedia document collections.
These will be often borrowed from others rather than created and stored
ourselves (e.g. Bill Arms' project). Work described as "digital libraries"
focusses on this: our emphasis is on testing (evaluation) not collecting.
Strand 3 Evaluating the evaluations
Testing the validity of the tests e.g. comparing test scores against measures
of utility to users in workplaces. A large part of this will be comparing
different levels/types of user goal (information need).
Strand 4 Explore alternative proxy measures
Precision and recall are the standard measures, used for convenience (i.e.
speed, cheapness, direct comparability between IR engines) by designers to
stand in for direct measures of value to the end users. Are there other useful
proxy measures?
The space is generated by the "dimensions":
A. Collection, query, engine, evaluation measures
B. Types (or levels) of information need
A.
The traditional system orientated text retrieval test facility
includes four components: query, engine, collection, evaluation measures.
In interactive and multimedia retrieval each of these components incorporates
new dimensions. To take account of the user, the query needs to be extended
to include the concept of information need (see the reference to Mizzaro's
paper below). Equally the search engine cannot be considered in isolation.
Some consideration needs to be given to the user interface, i.e. how the
system and different types of information are presented to the searcher as
well as the system/user interaction. In addition the collection component in
the context of multimedia incorporates different types of presentational
media as well as documents (besides issues of different types of document in
a single media e.g. academic paper abstracts vs. newspaper articles). Since
we are studying evaluation and wish to measure the validity of measures in
various cases, a fourth factor here is that of the evaluation measures used.
Consequently in order to reflect of all these different and new dimensions a
variety of new measures need to be explored.
B.
Much of our approach can be organised around a dimension of 4 types or levels
of information need. We take these from
Stefano Mizzaro's paper
(which not only defines them, but relates them to the many related
distinctions made in the literature). He defines the types in terms of the
representation in which the need is expressed. (See also
my summary of his framework.)
Traditional TCs test only the fourth level (of queries). Workplace studies address the first level (RIN). By attempting to measure the validity of laboratory measures, we can study the approximations that have been accepted for so long, and which may or may not apply to multimedia and other new forms of IR. On the other hand, we will also study how to supply if possible the needs of laboratory testing: that is, the advantage of having tests that are fast and cheap and can be applied without human users. One avenue to pursue may be the development of "simulated users", so that some lab. benchmarks can be run on interactive software but without human users. At least a few simple types could easily be constructed: a simulated user that chooses a document (from the list) randomly, one that picks the first document with a keyword in the surrogate, etc. These simulated users may be a) motivated by simulating strategies observed in at least some real users, b) can be checked against real human users to estimate the validity of using them in lab. tests.
This set of levels of information need also allows us to cover the "task-artifact" cycle in IR: that is, the fact that in some cases technology is developed to satisfy pre-existing user needs (traditional task-driven design), and in other cases new needs and even jobs are created to exploit new technology that was invented without any clear need (artifact-created demand). Studying work domains (RINs) will have as one result the identification of user needs not currently well served (manufacturers could identify new markets from this). Studying the performance of new collections and new engines in the lab against artificial measures is one way of studying artifacts independent of users (comparable to building and testing a rocket car that goes faster than anyone could possibly use a car for). It is likely that quite new uses for retrieval from video and images may emerge if they are made possible by new technology e.g. perhaps in 10 years we will attach a picture to every email to illustrate the mood: possible if image retrieval by "mood" became fast and effective. This approach from both ends (from both tasks and from new artifacts) is also important in our research strategy. We do not know, as we write the proposal, what technology will seem vital by the end of the project. Lab tests (the level of the query) will always be important, but they are by definition closely bound to the type of technology (which defines the "query" language and functions). There are very few examples of image retrieval engines at the moment, and so we do not yet know what will come to seem standard as an image query language and function. But we can already study work domains (and RINs) in this area: only later will the query level settle down.
It is essential to develop a test framework (and research strategy) that takes account of the WWW. This is not only now very important in most work domains, it is also a challenge because it is too big for any engine to search fully. Thus the old approach to TCs is now no longer adequate, and we must do work on sampling as well as on exhaustive retrieval, aiming to develop standard test procedures based on sampling.