28 Nov 1997 ............... Length about 2,000 words (14,000 bytes).
This is a WWW document by Steve Draper, installed at http://www.psy.gla.ac.uk/~steve/Dag.html.
You may copy it.
How to refer to it.
Dagstuhl: what I learned
This is a review, or rather a short essay on what I learned at Dagstuhl;
for putting on web, as part of workshop output. Partly for that (contribute to
post-workshop deliverable); partly for my own sake; partly as a small
foundation for planning Nancy; partly for possible future paper.
Stephen W. Draper
At the MIRA workshop at Dagstuhl (14-18 April 1997) I personally learned
two important things. This is a note expanding on the one-sentence summaries
we were asked to give at the workshop. Each of the two things has a section
here; but in brief, they were the importance of "context" in what it is that a
retrieval needs to retrieve for the user, and that using text queries to
retrieve images is not a cheat but rather is what important classes of user
need and also is representative of one of the most central issues in multimedia
information retrieval: how media can cross-refer to other media.
If you are using a textbook, and use the index to look up the places
where the book discusses a term, then you look up the term in the index, go to
the page listed in the index, scan the page for the term. But you do NOT then
just start reading from that term onwards: at the very least you scan backwards
to the beginning of the sentence, and more probably you look back to see what
chapter and section it is part of in order to see the context in which the term
is discussed. If you did not do that, you would probably not get the
information you needed. Thus the structure of the document and the language
are used implicitly but crucially in delivering the content you want.
It is important to realise this. First, we must recognise that even though the
query engine in a typical IR program does not use the document or language
structures, the overall system does in a crucial way. Indeed the whole point
of an IR system is usually not just to re-print the query terms if they are
found in a document, but to print all or part of the document. Furthermore,
because so many words (at least in English) are ambiguous, the word itself does
not carry the meaning: the context carries the information about which of the
word's meanings is valid here. To put it another way: without the context the
text is not information but only data, and the user may not be able to judge
its relevance. The underlying technical and philosophical point is that while
formal languages rest their meaning on separate, prior definitions which it is
assumed that transmitter and receiver have previously agreed and synchronised
by mechanisms outside the language, natural languages carry at least part of
the meaning within themselves and the "context". That is why taking a word or
a sentence out of context often conveys a different meaning from the one
intended by the speaker and understood by the original audience.
The IR design issue is how to present the document or document-part so that the
presentation carries the context the user wants. In practice document
retrieval is often not what is wanted. On the one hand, if the whole document
is retrieved without the query words being highlighted then many users reject
it because they cannot see its relevance. At the other extreme, if only the
query words are re-printed, or if only the document beginning with the first
query words printed, the user will not find that useful. So the general design
problem, to which whole-document presentation is a poor solution, is how to
present retrieved documents so as to be most useful to the user given the
search query. The solution will be one which presents the right amount of
"context" around the hits, probably using the document structure even though
that may not have been used in the query-driven retrieval.
Another aspect of this is suggested by the observation that in some image
retrieval tasks, users are explicit that they do NOT want a single image
returned even if the software was so good that it could correctly calculate
which image would be chosen finally for the user's task. (For example,
designers asking a stock agency to supply a picture for an illustration.)
Instead, such users say they want a set returned: apparently this is not just
so that they can make a choice, but so that they can see something about the
set of neighbours. This may be an image retrieval equivalent of "context": of
seeing the sentence in which a query word appeared.
One of the half-day sessions focussed on image retrieval, and it also
focussed on particular user tasks. Taking these in turn brought home to me
several lessons, the first of which is how there are quite different user tasks
all involving retrieval of images.
The first case described was pictures of "art" i.e. paintings and photographs,
and a multi-dimensional description system founded on professional thinking in
that area. Professionals think in terms of concepts, and retrieval based on a
database-like system matches this: the database should be structured to
represent these concepts. The different "dimensions" (i.e. attributes) mean
that images can be retrieved in a number of independent ways e.g. what an image
is of (e.g. a woman) or about (e.g. motherhood). Next some experiments using
non-experts to group images in various ways extended the conceptual analysis.
Thirdly, there are commercial agencies with several million images
(photographs) on file who supply them to clients. These retrievals are based
on words filled in on a form i.e. again, essentially a database like system.
Because an industry is based around this, it seems we can conclude that there
is a large and important user group who have needs for images expressed in
text. A variation of this would be a journalism-oriented collection, where
pictures could be categorised by who (e.g. Chancellor Kohl) and by expression
(e.g. laughing). Finally however another example was a retrieval program that
allowed users to express part of the query in terms of 2D space e.g. having a
lot of empty sky at the top right of the picture. This was associated with a
user group of graphic designers, who would use the pictures on leaflets but
would also need space to superimpose words. Because their work was essentially
that of 2D layout on the page, a 2D spatial query system matched their needs.
What does this tell me?
Inter-reference of one kind of modality or medium to another is in fact a
general issue that is relevant in a number of ways to IR.
- Image retrieval is not one problem: there are at least as many different user
tasks involving retrieving images as there are involving retrieving words.
Different users may need completely different software to support their
- Studying particular real user groups is very informative for us; not only in
designing a particular IR program, but also in understanding general ideas
about multimedia IR.
- Using text queries to retrieve images is not a "cheat"; using visual queries
may be entirely the wrong way to retrieve images for some user groups e.g.
journalists, art historians. If users think about images in terms of verbal
concepts, then that is how their queries should be expressed.
- In fact, this probably represents a key issue for multimedia retrieval: how
one medium (text) refers to another (images).
In human-computer interaction, one underlying but pervasive issue is how
user input may often not mean something by itself but only by referring either
to computer output (e.g. with menus, mouse input only means something by
referring to the menu displayed on the screen) or to previous user input (e.g.
in the Unix shell's history mechanism, or in any undo command). Output may
also refer to previous output or user input (e.g. in many error messages). For
a longer discussion, see Draper,S.W. (1986) "Display managers as the basis
for user-machine communication" in User Centered System Design eds.
D.A.Norman & S.W.Draper (Erlbaum: London) pp.339-352.
Note that interaction seems to entail inter-reference of input and output. If
IR uses interactive systems, it involves inter-referentiality.
At the MIRA workshop in Monteselice a member of my group described how
the Swiss government has a considerable requirement for cross-language document
retrieval. There are four official languages in Switzerland. Government
officials are usually competent readers in these languages, but may be most
competent in writing only one of these. Hence their need to be able to issue a
query in one language, but to retrieve documents in all of them.
Because basic IR techniques have no understanding of language apart from word
stem matches, this would require an index of words in one language to refer to
collections in all four languages. This is probably comparable to the issues
of having text refer to images.
Getting text queries to refer to images is just one of the many
combinations we need to consider in media retrieval problems.
In addition, another whole kind of inter-media reference is the kind tackled by
Lynda Hardman and her colleagues: to do with delivering ("playing") a
multimedia document. Naive multimedia technology uses only time sequence (the
time line) to organise and relate multiple media. In fact the structure within
each medium is important, and could cross-refer. The very common technique in
film in which when cutting between scenes, the visuals and the soundtrack cut
at different times (e.g. you hear the sound of the new scene several seconds
before you see the new shot) shows that the relationship is non-trivial. For
instance consider a pop video. The music has a structure (e.g. bars of music),
and the lyrics (words) are divided into lines; if you have multi-lingual
subtitles, then these are also divided into lines; the visuals will have a
structure of shots, and this structure probably has a relationship to the
I hope this makes it clear that this is a big subject: I will say no more
It may be worth a moment's thought to reflect on the ways in which in
classical IR the different items in a typical retrieval cycle refer to each
Clearly this is a complex subject. In the light of the above, the major
components may be:
- A formulated query (i.e. a set of terms) is meant to refer to a user's
information need; and bridging that gap is one big issue.
- A query refers to an ordered set of documents: that is what the IR engines
do. How good that mapping is forms the subject of classical evaluations.
- A surrogate (i.e. the summary description of a document on the ordered list
returned from a retrieval) is meant to refer to a whole document in a way that
allows a user to make a relevance judgement quickly.
- A document presentation must also support relevance judgements: hence the
considerable importance of highlighting query terms within the document, and
considering how to present the relevant context around them. These are in
effect attempts to improve the effectiveness with which a document can refer to
the user's information need.
- Relevance feedback is an attempt to get subsets of documents to refer to or
represent the user's information need; and to refer to another (to be
retrieved) subset of documents.
- Each medium and its own technical problems, including: a) how to do user
input of queries in the medium when that is appropriate for a user group; b)
how to present a "document" when retrieved including how to present b2) its
"context" and b3) why it matched the query; c) how to present surrogates e.g.
thumbnail images, speeded up audio.
- User tasks, for which the retrieval machine will only be one tool that is a
means to part of the task. There is an infinite variety of such tasks.
- Human-machine interaction: how to design a complete user-machine interactive
cycle that maximises retrieval success and, more importantly, user-task-success
over a number of cycles.
- Inter-reference: between media, and between user input and machine output.