23 Jul 1998 ............... Length about 2,000 words (13,000 bytes).
This is a WWW document maintained by Steve Draper, installed at http://www.psy.gla.ac.uk/~steve/mpeg7.html.
You may copy it. How to refer to it.
up to current central page)
MPEG-7 and IR
This is a note on MPEG-7 issues, following reports by Alan Smeaton and
Mark Dunlop to MIRA on MPEG-7 meetings, and the
presentation by Rob Koenen
to one of the MIRA meetings. It may of course show that I have not
understood the MPEG work properly.
Stephen W. Draper
The MPEG interest is in developing
standards for future video technology. The
interest is in IR (information retrieval) for multimedia, and this will take
place in the environment strongly influenced by future MPEG decisions, so
MIRA and the IR community in general must keep up to date on MPEG.
As I understand it, MPEG-7 is about creating a standard notation
for video for extra information to support searching, or specifying a
framework for a set of such notations: that is, to do multimedia content
description to allow efficient search using standardised descriptions.
I suppose that the design decisions are about what the notation language is,
and how it is to be encoded along with the video.
The point of this memo is to make some points about the relationship of the
video medium, the new advances MPEG-7 is trying to make, and the IR field. By
"video" I mean the multimedia combination of moving images with an audio track
and sometimes a text track for subtitles (frequently provided by teletext in
broadcast television). By "IR" I mean the kind of text retrieval now familiar
from web search engines such as Alta Vista: based on free-form queries and
search that uses the whole text of documents, ignoring linguistic and document
The arrival of CD-ROM and even the ability to hold video documents on
disk, at least in compressed form, has for the first time made possible random
access to video. By random access, I mean fast access to any arbitrary place
in the document, with the same fast access time for all such places. Thus for
the first time, video can begin to have the basic retrieval advantages of
printed text. With video tape, it takes many minutes to get ready to show a
video clip: with random access, it at last becomes possible to "turn" to the
part you want wherever it happens to be stored in the original document. In a
sense, video is now coming out of the stone age and catching up with print. In
my view, the main point of MPEG-7 search information will be, not advanced
IR, but simply providing the same facilities that a printed non-fiction book
does — the equivalents of:
- Bookmarks. A great feature of printed books is that the reader can
make arbitrary marks on them e.g. insert a strip of paper as a page marker,
turn over a page corner for the same purpose, mark words or passages with
pencil, yellow high-lighter etc. Essentially these are an index created by the
reader not the author; we are used to these now in web browsers. It does not
even use page numbers as a reference mechanism. Simply having bookmarks and
random access will create a huge increase in utility of video.
- Page numbers. These are very crude but very useful. Note that a
page has no relationship to the structure or meaning of the text: the page
boundaries are meaningless. Furthermore they change with each new edition of a
book. Yet a teacher, for example, can give students page numbers as a way of
referring to a passage: they make possible personal references between people,
independent of the author, that allow random access to arbitrary parts of the
document. For video, the equivalent might be the running time in minutes.
(120 minutes for a film compared to 150 pages for a book, say.)
- Paragraphs, sections etc.
In books, typography is used to make these
visually salient, so that once open to the right page, the reader's eye can
find these very quickly: essentially by random access (relative to reading
speed) within a page. In video, these structural divisions are shot and scene
changes and will need to be provided. (I imagine that this will be done by a
table giving this structure and associating each structural element with a
pointer into the video using, say, time in minutes, seconds, frames; and this
table will be associated with the video as "search information", stored with
it, and connected to standard controls for the end user). With a book I
might tell you to start reading on p.23 at the paragraph beginning "He walked
outside ..."; and in video I might tell you to go to the 23rd minute and start
at the shot where the character walks out into the sunlight. Or with a book I
might say read from pages 53-62, and you would expect p.53 to have a major
section boundary e.g. be the start of a chapter. With video I might say view
from minutes 53-62, and you would actually go to the point at 53 minutes 0
seconds, and then skip forward to the next scene change, then view to 62
minutes and on to the end of the current scene.
- A contents page: a hierarchical structure manually created by the
author. Note that a contents page uses page numbers to allow random access.
But it is also an overview of the whole structure that is read by itself. So
with video, a contents page must be capable of being read in on demand, and
displayed visually (as text or graphics perhaps).
- An index page: a sorted list of keywords, manually selected by
the author, as another random-access mechanism into the document. Note that
indices to books are still generated basically by hand, not automatically
i.e. the author chooses personally what terms are to appear as keywords in
the index. Again, an index page relies on page numbers as the reference
mechanism; and needs to be accessible at any time.
Having discussed the kinds of structure we expect to want by analogy to
printed text documents, we may now consider structures suggested by video
itself. The first thing to grasp is that there will never be a single correct
or canonical hierarchy to use to represent structure: there are always
multiple non-commensurable structures. We could guess that because books
standardly have both a contents and an index, representing different
structures. In video we may want a hierarchy representing the scenes and
shots (and sometimes larger structures for when the story moves between times
and locations), but we would need another hierarchy to address requests such
as "the US president meeting African leaders". (For instance, someone
interested in this topic might construct a tree beginning with all the
president's duties, divided into foreign and domestic policy, then within
foreign into world regions, and so on.) There may be a disconnected set of
scenes all to do with such a meeting; but conversely there may be a scene
where the president does a number of things, only one of which concerns
African leaders. Thus these two structures are independent, and in general
have no simple mapping between them. They would need to be separately and
Similarly there is no fixed connection or priority between the three media
involved: images, sound, and text subtitles. For instance in a sports
programme, the video is probably the main organising medium and the sound track
(mainly commentary) is organised around it. But on news and documentary
programmes, it is mainly the other way round with the meaning carried by a
carefully scripted sound track and images used to illustrate or merely decorate
the words. Note too that it is a common technique in film for the sound track
to cut to a new scene several seconds before the vision does: so scene
boundaries do not happen at the same time on the two media. There is a single
logical structure of scenes, but no simple mapping to time and media.
This point is connected to the work by
and others on multimedia authoring languages.
(See for instance "The Amsterdam Hypermedia Model"
Communications of the ACM vol.37 (2), Feb 1994, pp.50-62 and earlier
papers.) Their contribution to multimedia is to go beyond the simple timeline
view of many tools, and show that in addition to that, multimedia authors need
an explicit hierarchical structure and view; and that these are not simply
equivalent with both views showing all the information. In fact this is
strictly comparable to word processors, where in general you need both a
"WYSIWYG" (cf. the timeline) view of how a document will be rendered on a page
and a structural view of sections, paragraphs, etc. Note that the
display hierarchy of pages, lines, words and characters does not have a
simple mapping on to the structural hierarchy of sections etc. You cannot
predict where line and page breaks will go from the structural view alone: it
also depends on things like page size; and similarly, footnotes belong
structurally with the point they refer to, but are displayed some distance
away at the foot of the page.
Thus, to repeat, there is a single logical structure of scenes, but no simple
mapping to the display structures of time and media; and as argued earlier,
there are semantic hierarchies that could describe the meaning of the content
that do not map simply on to either of these, just as a book's index cuts
across the structure of the book as represented by the contents page.
Note how book structure and mark-up is done mainly manually: we expect
the author to insert manually marks for the boundaries of paragraphs, sections
etc. And in addition, to provide the information for the contents list and for
the index (keywords pointing to page numbers). Automatic extraction has little
part. Readers provide a substantial additional amount of their own mark-up,
which they occasionally communicate to each other, but which isn't shared
globally (my private bookmarks are not of much interest to other people).
Flexibility and ease of use are important; standardisation is not very
important. An author may invent a new structure: they choose whether to call
their divisions "chapters" or "parts" or "sections"; and they choose how deep
their hierarchy will be. Readers cope with a wide variety of these. Note too
that books vary widely: fiction has less and less structure precisely because
random access is not important: stories are designed to be read in strict
sequence. Non-fiction however uses the full variety of alternative access
structures (sequence, contents page, index at the back).
The use of current IR technology for retrieving text documents may be
said, if we use a lot of over-generalisation, to have the following
- It emphasises big collections e.g. all the articles that appeared over 5
years in one newspaper.
- It has found that it can successfully ignore all the structure carefully
added by hand (e.g. titles, sections, paragraphs), and indeed all the structure
in the language (e.g. grammar) i.e. it ignores all author mark-up.
- It uses automatic extraction i.e. it automatically re-processes documents to
use all the words in them and nothing else to build new indices that are used
by its software.
Its main use is to search collections that are so big they couldn't possibly be
searched by hand nor marked up again for this new purpose. Of course it isn't
very accurate, and it works by providing short lists of "likely" documents
which are then manually inspected by the user (and during that inspection, all
their internal structure is again important).
IR for video will probably be wholly parasitic on what is stored
for other reasons, and it will probably and rightly use a mixture of
methods. For instance if I could do text IR on the transcripts of TV
documentaries linked into random access to the corresponding points in the
video, that would be enormously useful. (BBC's best science documentary
series, Horizon, now has
full transcripts on the web.) But equally, a
total relevance feedback approach
like Iain Campbell's
would allow a user to find a visual sequence by similarity to other visual
sequences without explicit use of an associated text channel. It uses hidden
symbols associated with each document that the user never sees. This would
probably work with whatever content description was included from the
authoring process, however apparently meaningless this was to most end users.
- The biggest single benefit will come simply from random access plus
users' private bookmarks: not in fact from content description or mark-up
transmitted with the video at all.
- The next biggest benefit will come from the equivalent of contents and
index "pages". The main lesson is that multiple alternative indices
will be required. They could all use, underneath, a common scheme for
referring to places in the video in terms of minutes, seconds, and frames.
In fact they should probably be arbitrary files, some of which will be sent
round with the video (they should probably be available at the start, so that
users can review the structure immediately), but some of which will be held
locally as private "bookmark" files.
These will be mainly manually created by the authors of the document. In
fact, we should probably encourage authors to add in all the information they
have to hand as part of the authoring process: storyboards, scenarios and
screenplays, complete texts, etc. Editing a video should soon include adding
in this extra information, which will be as useful during the authoring/editing
process as in viewing. Automatic extraction of scene boundaries
retrospectively will probably not be very important: just convenient for a
short period in the near future. Strong standardisation is probably
unimportant: the end user just needs to know where the marks lie, and the
hierarchy that the author imposed on the markers (like a section structure in a
document or book). Displaying the "contents page" will show the end user what
structure or "language" was used for this particular document.
- Finally IR will eventually add an ability to search across large
collections of video documents. It will probably be able to do this no
matter what content description and mark-up is supplied for other reasons
(see below). Certainly, that is what it has been able to do for text.
A small test for IR would require hundreds of hours of video documentary (say),
while a small test of the basic facilities would only require one or two sample
video documents and look at how users could find their way within them. That
is where it would start; but in fact half of using IR is opening the documents
the IR engine returns as candidates and then trying to scan them quickly to
make a yes/no decision: so the basic within-document facilities are in fact
crucial to the success of an IR session, even though they are not used by the
The above arguments suggest that almost the only thing that matters for
standards, is a standard syntax for referring to places in the video (e.g. by
minutes, seconds, frames) and a way of associating such pointers with a piece
Other standardisation may not matter much. Within a
document, provided the content description can be displayed, users will make
sense of it whether it is a contents page or a transcript of the soundtrack.
The need to display such content suggests a language like HTML should be used
for which rendering software (i.e. browsers) already exist.
In searching over a large collection, IR techniques will probably not be
sensitive to the type, structure, or format of the content description.
Database retrieval techniques will be vulnerable to a lack of standardisation,
but the difficulties of getting authors to conform will probably simply favour
the use of less fragile techniques such as IR, that can make some use of
whatever is provided.
MPEG-7 should allow what you describe ("structuring Video"), but
will also attempt to go beyond this. It addresses not only Video
but also other MM material (stand-alone or in combination) and it also
wants to make search on the basis of similarity possible. This requires
'low level' descriptions.
This means that your conclusion ('the only thing we need is a standard
for referring to places in video) is not one we can share, if we look
at the whole application base MPEG-7 is intended to support.
Especially in the long run the approach will prove too limited.
There are many things that MPEG can learn about IR though, which
is why we greatly value MIRA's participation in our discussions.
ps: I guess you know that by following
you can find the relevant MPEG-7 documents. Especially the
Applications Document (in zipped WORD)
is interesting to read in this case.
Senior Project Manager
Multimedia Technology Group, KPN Research
PO Box 421, 2260 AK Leidschendam The Netherlands
tel +31 70 332 5310 fax +31 70 332 5567
up to current central page)