Last changed 23 Jul 1998 ............... Length about 2,000 words (13,000 bytes).
This is a WWW document maintained by Steve Draper, installed at http://www.psy.gla.ac.uk/~steve/mpeg7.html. You may copy it. How to refer to it.

MPEG-7 and IR

Contents (click to jump to a section)

Preface
Introduction
Video achieves random access
Multiple structures for video
Summary of book technology
Summary of IR technology
Summary of benefits to video of search facilities
IR for video
Implications for content description standards
A short reply by Rob Koenen

by
Stephen W. Draper

Preface

This is a note on MPEG-7 issues, following reports by Alan Smeaton and Mark Dunlop to MIRA on MPEG-7 meetings, and the presentation by Rob Koenen to one of the MIRA meetings. It may of course show that I have not understood the MPEG work properly.

The MPEG interest is in developing standards for future video technology. The MIRA interest is in IR (information retrieval) for multimedia, and this will take place in the environment strongly influenced by future MPEG decisions, so MIRA and the IR community in general must keep up to date on MPEG.

Introduction

As I understand it, MPEG-7 is about creating a standard notation for video for extra information to support searching, or specifying a framework for a set of such notations: that is, to do multimedia content description to allow efficient search using standardised descriptions. I suppose that the design decisions are about what the notation language is, and how it is to be encoded along with the video.

The point of this memo is to make some points about the relationship of the video medium, the new advances MPEG-7 is trying to make, and the IR field. By "video" I mean the multimedia combination of moving images with an audio track and sometimes a text track for subtitles (frequently provided by teletext in broadcast television). By "IR" I mean the kind of text retrieval now familiar from web search engines such as Alta Vista: based on free-form queries and search that uses the whole text of documents, ignoring linguistic and document structure.

Video achieves random access

The arrival of CD-ROM and even the ability to hold video documents on disk, at least in compressed form, has for the first time made possible random access to video. By random access, I mean fast access to any arbitrary place in the document, with the same fast access time for all such places. Thus for the first time, video can begin to have the basic retrieval advantages of printed text. With video tape, it takes many minutes to get ready to show a video clip: with random access, it at last becomes possible to "turn" to the part you want wherever it happens to be stored in the original document. In a sense, video is now coming out of the stone age and catching up with print. In my view, the main point of MPEG-7 search information will be, not advanced IR, but simply providing the same facilities that a printed non-fiction book does — the equivalents of:

Bookmarks. A great feature of printed books is that the reader can make arbitrary marks on them e.g. insert a strip of paper as a page marker, turn over a page corner for the same purpose, mark words or passages with pencil, yellow high-lighter etc. Essentially these are an index created by the reader not the author; we are used to these now in web browsers. It does not even use page numbers as a reference mechanism. Simply having bookmarks and random access will create a huge increase in utility of video.
Page numbers. These are very crude but very useful. Note that a page has no relationship to the structure or meaning of the text: the page boundaries are meaningless. Furthermore they change with each new edition of a book. Yet a teacher, for example, can give students page numbers as a way of referring to a passage: they make possible personal references between people, independent of the author, that allow random access to arbitrary parts of the document. For video, the equivalent might be the running time in minutes. (120 minutes for a film compared to 150 pages for a book, say.)
Paragraphs, sections etc. In books, typography is used to make these visually salient, so that once open to the right page, the reader's eye can find these very quickly: essentially by random access (relative to reading speed) within a page. In video, these structural divisions are shot and scene changes and will need to be provided. (I imagine that this will be done by a table giving this structure and associating each structural element with a pointer into the video using, say, time in minutes, seconds, frames; and this table will be associated with the video as "search information", stored with it, and connected to standard controls for the end user). With a book I might tell you to start reading on p.23 at the paragraph beginning "He walked outside ..."; and in video I might tell you to go to the 23rd minute and start at the shot where the character walks out into the sunlight. Or with a book I might say read from pages 53-62, and you would expect p.53 to have a major section boundary e.g. be the start of a chapter. With video I might say view from minutes 53-62, and you would actually go to the point at 53 minutes 0 seconds, and then skip forward to the next scene change, then view to 62 minutes and on to the end of the current scene.
A contents page: a hierarchical structure manually created by the author. Note that a contents page uses page numbers to allow random access. But it is also an overview of the whole structure that is read by itself. So with video, a contents page must be capable of being read in on demand, and displayed visually (as text or graphics perhaps).
An index page: a sorted list of keywords, manually selected by the author, as another random-access mechanism into the document. Note that indices to books are still generated basically by hand, not automatically i.e. the author chooses personally what terms are to appear as keywords in the index. Again, an index page relies on page numbers as the reference mechanism; and needs to be accessible at any time.

Multiple structures for video

Having discussed the kinds of structure we expect to want by analogy to printed text documents, we may now consider structures suggested by video itself. The first thing to grasp is that there will never be a single correct or canonical hierarchy to use to represent structure: there are always multiple non-commensurable structures. We could guess that because books standardly have both a contents and an index, representing different structures. In video we may want a hierarchy representing the scenes and shots (and sometimes larger structures for when the story moves between times and locations), but we would need another hierarchy to address requests such as "the US president meeting African leaders". (For instance, someone interested in this topic might construct a tree beginning with all the president's duties, divided into foreign and domestic policy, then within foreign into world regions, and so on.) There may be a disconnected set of scenes all to do with such a meeting; but conversely there may be a scene where the president does a number of things, only one of which concerns African leaders. Thus these two structures are independent, and in general have no simple mapping between them. They would need to be separately and independently represented.

Similarly there is no fixed connection or priority between the three media involved: images, sound, and text subtitles. For instance in a sports programme, the video is probably the main organising medium and the sound track (mainly commentary) is organised around it. But on news and documentary programmes, it is mainly the other way round with the meaning carried by a carefully scripted sound track and images used to illustrate or merely decorate the words. Note too that it is a common technique in film for the sound track to cut to a new scene several seconds before the vision does: so scene boundaries do not happen at the same time on the two media. There is a single logical structure of scenes, but no simple mapping to time and media.

This point is connected to the work by Lynda Hardman and others on multimedia authoring languages. (See for instance "The Amsterdam Hypermedia Model" Communications of the ACM vol.37 (2), Feb 1994, pp.50-62 and earlier papers.) Their contribution to multimedia is to go beyond the simple timeline view of many tools, and show that in addition to that, multimedia authors need an explicit hierarchical structure and view; and that these are not simply equivalent with both views showing all the information. In fact this is strictly comparable to word processors, where in general you need both a "WYSIWYG" (cf. the timeline) view of how a document will be rendered on a page and a structural view of sections, paragraphs, etc. Note that the display hierarchy of pages, lines, words and characters does not have a simple mapping on to the structural hierarchy of sections etc. You cannot predict where line and page breaks will go from the structural view alone: it also depends on things like page size; and similarly, footnotes belong structurally with the point they refer to, but are displayed some distance away at the foot of the page.

Thus, to repeat, there is a single logical structure of scenes, but no simple mapping to the display structures of time and media; and as argued earlier, there are semantic hierarchies that could describe the meaning of the content that do not map simply on to either of these, just as a book's index cuts across the structure of the book as represented by the contents page.

Summary of book technology

Note how book structure and mark-up is done mainly manually: we expect the author to insert manually marks for the boundaries of paragraphs, sections etc. And in addition, to provide the information for the contents list and for the index (keywords pointing to page numbers). Automatic extraction has little part. Readers provide a substantial additional amount of their own mark-up, which they occasionally communicate to each other, but which isn't shared globally (my private bookmarks are not of much interest to other people). Flexibility and ease of use are important; standardisation is not very important. An author may invent a new structure: they choose whether to call their divisions "chapters" or "parts" or "sections"; and they choose how deep their hierarchy will be. Readers cope with a wide variety of these. Note too that books vary widely: fiction has less and less structure precisely because random access is not important: stories are designed to be read in strict sequence. Non-fiction however uses the full variety of alternative access structures (sequence, contents page, index at the back).

Summary of IR technology

The use of current IR technology for retrieving text documents may be said, if we use a lot of over-generalisation, to have the following characteristics:

It emphasises big collections e.g. all the articles that appeared over 5 years in one newspaper.
It has found that it can successfully ignore all the structure carefully added by hand (e.g. titles, sections, paragraphs), and indeed all the structure in the language (e.g. grammar) i.e. it ignores all author mark-up.
It uses automatic extraction i.e. it automatically re-processes documents to use all the words in them and nothing else to build new indices that are used by its software.

Its main use is to search collections that are so big they couldn't possibly be searched by hand nor marked up again for this new purpose. Of course it isn't very accurate, and it works by providing short lists of "likely" documents which are then manually inspected by the user (and during that inspection, all their internal structure is again important).

Summary of benefits to video of search facilities

The biggest single benefit will come simply from random access plus users' private bookmarks: not in fact from content description or mark-up transmitted with the video at all.
The next biggest benefit will come from the equivalent of contents and index "pages". The main lesson is that multiple alternative indices will be required. They could all use, underneath, a common scheme for referring to places in the video in terms of minutes, seconds, and frames. In fact they should probably be arbitrary files, some of which will be sent round with the video (they should probably be available at the start, so that users can review the structure immediately), but some of which will be held locally as private "bookmark" files.
These will be mainly manually created by the authors of the document. In fact, we should probably encourage authors to add in all the information they have to hand as part of the authoring process: storyboards, scenarios and screenplays, complete texts, etc. Editing a video should soon include adding in this extra information, which will be as useful during the authoring/editing process as in viewing. Automatic extraction of scene boundaries retrospectively will probably not be very important: just convenient for a short period in the near future. Strong standardisation is probably unimportant: the end user just needs to know where the marks lie, and the hierarchy that the author imposed on the markers (like a section structure in a document or book). Displaying the "contents page" will show the end user what structure or "language" was used for this particular document.
Finally IR will eventually add an ability to search across large collections of video documents. It will probably be able to do this no matter what content description and mark-up is supplied for other reasons (see below). Certainly, that is what it has been able to do for text.

IR for video

IR for video will probably be wholly parasitic on what is stored for other reasons, and it will probably and rightly use a mixture of methods. For instance if I could do text IR on the transcripts of TV documentaries linked into random access to the corresponding points in the video, that would be enormously useful. (BBC's best science documentary series, Horizon, now has full transcripts on the web.) But equally, a total relevance feedback approach like Iain Campbell's would allow a user to find a visual sequence by similarity to other visual sequences without explicit use of an associated text channel. It uses hidden symbols associated with each document that the user never sees. This would probably work with whatever content description was included from the authoring process, however apparently meaningless this was to most end users.

A small test for IR would require hundreds of hours of video documentary (say), while a small test of the basic facilities would only require one or two sample video documents and look at how users could find their way within them. That is where it would start; but in fact half of using IR is opening the documents the IR engine returns as candidates and then trying to scan them quickly to make a yes/no decision: so the basic within-document facilities are in fact crucial to the success of an IR session, even though they are not used by the search engines.

Implications for content description standards

The above arguments suggest that almost the only thing that matters for standards, is a standard syntax for referring to places in the video (e.g. by minutes, seconds, frames) and a way of associating such pointers with a piece of description.

Other standardisation may not matter much. Within a document, provided the content description can be displayed, users will make sense of it whether it is a contents page or a transcript of the soundtrack. The need to display such content suggests a language like HTML should be used for which rendering software (i.e. browsers) already exist.

In searching over a large collection, IR techniques will probably not be sensitive to the type, structure, or format of the content description. Database retrieval techniques will be vulnerable to a lack of standardisation, but the difficulties of getting authors to conform will probably simply favour the use of less fragile techniques such as IR, that can make some use of whatever is provided.

A short reply by Rob Koenen

MPEG-7 should allow what you describe ("structuring Video"), but will also attempt to go beyond this. It addresses not only Video but also other MM material (stand-alone or in combination) and it also wants to make search on the basis of similarity possible. This requires 'low level' descriptions.

This means that your conclusion ('the only thing we need is a standard for referring to places in video) is not one we can share, if we look at the whole application base MPEG-7 is intended to support. Especially in the long run the approach will prove too limited.

There are many things that MPEG can learn about IR though, which is why we greatly value MIRA's participation in our discussions.

kind regards,
Rob Koenen

ps: I guess you know that by following http://www.cselt.it/mpeg you can find the relevant MPEG-7 documents. Especially the Applications Document (in zipped WORD) is interesting to read in this case.

Rob Koenen,
Senior Project Manager
Multimedia Technology Group, KPN Research
PO Box 421, 2260 AK Leidschendam The Netherlands
tel +31 70 332 5310 fax +31 70 332 5567

(Back up to current central page)