Address correspondence to Mike Burton, Department of Psychology, University
of Glasgow, Glasgow G12 8QQ, UK. Tel. +44 141 330 4060 Email mike@psy.gla.ac.uk
Abstract
Security surveillance systems often produce poor quality video, and
this may be problematic in gathering forensic evidence. We examine the
ability of subjects to identify target people captured by a commercially
available video security device. In Experiment 1 we show that subjects
personally familiar with the targets perform very well at identifying them,
but subjects unfamiliar with the targets perform very poorly. Police officers
with experience in forensic identification perform as poorly as other subjects
unfamiliar with the targets. In Experiment 2, we ask how familiar subjects
can perform so well. Using the same video device, we edited clips to obscure
the head, body or gait of the targets. Obscuring body and gait produced
a small decrement in recognition performance. Obscuring the target's head
had a dramatic effect on subjects' abilities to recognise them. This implies
that subjects are recognising the targets' faces, even in these poor quality
images.
Introduction
The psychological study of face recognition divides into two rather different topics. First, there are projects whose main focus is the recognition of faces previously unfamiliar to subjects (e.g. Brown, Deffenbacher and Sturgill, 1977; Ellis, 1975; Laughery, Alexander and Lane, 1971; for reviews see Clifford and Bull, 1978; Shepherd, Ellis & Davies, 1982). Second, there is a large literature on processes underlying recognition of familiar faces (for example see reviews by Bruce, 1988; Bruce & Humphreys, 1994; and theoretical developments by Bruce & Young, 1986; Burton, Bruce & Johnston, 1990; Burton, Young, Bruce, Johnston & Ellis, 1991).
Studies of unfamiliar face recognition often have a forensic motivation. In typical experiments, subjects are shown faces of unfamiliar people and are subsequently tested using a recognition memory procedure It has been shown on a number of occasions that face recognition of previously unfamiliar faces is rather poor (e.g. Yarmey, 1979). Despite this, juries are said to favour eye-witness face recognition reports, and attach considerable weight to them. It is therefore very important to establish the reliability of such reports across a range of conditions, and to discover techniques for improving the reliability of recognition (Shepherd et al, 1982).
Research in both familiar and unfamiliar face recognition very commonly uses high quality images of target people. However, recent developments in security surveillance throw up a particular problem with image quality. Small-scale security systems based on VHS video or closed circuit TV have become very common in Europe and North America. Such systems are often installed with little attention to optimising lighting conditions or viewing angle. This means that when an image or video sequence is needed for evidence (for example following a crime) it is not always easy to confirm whether the person captured in the security device is the same person accused or suspected of the crime.
In this paper we examine the effectiveness of human face recognition in poor-quality video images. In particular, we are concerned with the effects of familiarity on recognition ability. There are two main questions of interest. First, how good is face recognition in poor quality images? To answer this question we use video sequences captured from a commercially-available video security system. Second, if subjects are able to recognise people from these images, what is the basis for their recognition?
Experimental setting
Both experiments reported here use images from the same security device, which was chosen to be typical of many low-cost security systems. The Department of Psychology at the University of Glasgow, UK, uses a video security system installed by a local company. This is a VHS video device which is triggered when a person approaches the main entrance door of the building. Each time a person enters or leaves, a security light is automatically turned on, and about four seconds of video is recorded. The video camera is located on an inside wall directed towards the main entrance door, at a height of 9 feet. The detailed specification is: a vista NCD 340 CCTV camera (8mm f1.2 lens) with a Mitsubishi HS-5424E(B) Timelapse Cassette Recorder and a Fuji HQ+ 180 PAL VHS video tape.
Systems of this kind are very common in the local area, and the same security company supplies many local businesses. The system was not configured in any way for the purposes of this experiment. Informal observations suggest that the resulting image quality is rather poor, though tolerable in a low-cost system. Figure 1 shows a still from this system.
In the first experiment, we examined the ability of subjects to recognise images from the security system, depending on whether they were personally familiar with the targets or not. Many of the people who walk through this particular system are lecturing staff at the University of Glasgow. It is therefore relatively easy to find subjects who are familiar with them, i.e. students who take classes in psychology. Similarly, it is relatively easy to find subjects who are unfamiliar with targets, i.e. students who do not take classes in psychology. In this experiment we also examined the ability of a set of police officers to recognise the targets. The police subjects were unfamiliar with the targets, but were experienced in making identification judgements.
The experiment makes use of a recognition memory procedure. In the first phase, subjects are shown a set of video sequences and told they will be asked to recognise people in these clips later. In the second phase, subjects are shown a set of high quality photos, and asked, for each photo, whether this person appeared in the first phase. Although this is not a direct analogue of the usual forensic situation (in which only one target would normally be being sought), it is a convenient task to use experimentally, because the same procedure can be used both with the familiar and the unfamiliar subjects. Recognition memory typically covaries with other face recognition tasks, and this procedure has been used commonly in the past to compare familiar and unfamiliar face recognition with the same target stimuli (e.g. Bruce, 1982).
Method
Video clips were chosen of 20 members of lecturing staff, 10 male and 10 female. These clips were taken from the routinely-collected video of people entering the building, i.e. they were not posed, and target people were not aware at the time that their video images would form part of an experiment. Clips were chosen which contained only one person entering the building. In addition to these clips, each target person was photographed on a different day, using a high-quality digital camera, under good lighting conditions. Examples are shown in Figure 1.
Sixty subjects volunteered to take part in the study. Of these, twenty students were recruited from the Department of Psychology, and each had been taught by all 20 of the target lecturers. A further twenty students were recruited from different departments throughout the University, none of whom had taken courses in the Department of Psychology. Finally, a group of twenty serving police officers was selected from those attending a course at a local police training school. These were experienced officers with an average of 13.5 years service.
Subjects were tested individually in an experimental room, and were shown video clips on a standard video recorder and TV. They were initially shown 10 of the available 20 video clips, and told they would be asked to identify these people later. Each subject was shown these clips twice, each time in a different random order. There was a short gap (2-3 seconds) between each clip, and a rest period of one minute after the videos had been seen. The particular subset of videos shown in this phase was counterbalanced across subjects.
There followed a test phase in which subjects were shown each of the 20 high quality images, one at a time. They were told that they would be shown 20 faces, and that half of these people had been present in the videos. They were asked to assign a rating of 1-7 to each of these photos. A score of 7 indicates that subjects are sure that the person appeared in the video, a score of 1 indicates that the person definitely did not appear in the video.
Results and Discussion
Figure 2 shows the mean recognition scores given to seen and unseen targets. The results show that people in the known group perform well, assigning high scores to seen targets, and low scores to unseen targets. Subjects in the other two groups perform less well, making a smaller discrimination between these two groups. Formal analysis can be summarised as follows: i) all groups score seen targets significantly higher than unseen targets, though the effect is much larger in the familiar group; ii) there is no difference in performance between the unfamiliar student and police groups, but both are significantly poorer than the familiar group. Full details of ANOVAs are available from the authors. In summary, 2-way ANOVA shows no main effect of subject group (F(2,57) < 1), a significant effect of seen/unseen target (F(2,57) = 324, p < 0.001) and a highly significant interaction (F(2,57) = 92, p < 0.001).
These data show a very marked benefit for people personally familiar with the targets. The use of the ends of the rating scale was common in the familiar group, and subjects were very accurate indeed in making the seen/unseen decisions. Subjects unfamiliar with the targets performed very poorly, regardless of whether they were students or police officers. Although there were reliable differences on the judgements between targets which had been present and those not present, differences were comparatively small. These results seem particularly important for the issue of security surveillance. If images of this quality are to be used as legal evidence, it is important to demonstrate the basis of recognition judgements. From this study, it seems that only personal familiarity will provide a good basis for accuracy of judgements.
What is the basis for the high scores of the familiar subjects? People familiar with the targets may be recognising a number of characteristics. For example, it may be that subjects are recognising the clothes of their lecturers, or their body shape or gait. It seems reasonable to propose that subjects will use any cue available in order to make the identification. The very low resolution of the information carried in the face (very few scan lines on the video) lead to the hypothesis that it is not faces which subjects are recognising in this study, but whole bodies, and we examine this further in Experiment 2.
In this study clips from the same video security device were used. However, only subjects familiar with the targets were recruited. In order to examine the basis for the familiarity advantage, we selectively disrupted aspects of the video by obscuring the head, body or gait of the targets in videos. As this experiment uses only familiar subjects, a simple identification task is used, rather than the recognition memory task used in Experiment 1.
Method
Video clips were taken of 15 target people. Ten of these people were lecturing staff (6 male, 4 female) who would be familiar to all subjects. The remaining five people were visitors (3 male, 2 female) who would not be familiar to subjects. In contrast to Experiment 1, video clips were not taken from naturally occurring incidents on the surveillance video, but target people were asked to walk into the building on a prescribed route through the door and towards the camera until they pass out of its range. All clips were gathered on the same day. All clips were edited to last for 3 seconds. Figure 3a shows a still from one of these videos.
Copies of the resulting 15 video clips were edited, using digital video editing equipment, in each of the following ways:
Body obscured: A black rectangle was positioned over the body, scaled to fit the body but not to obscure the background or the head of the person. The rectangle tracked the person through the video sequence, changing shape as necessary (i.e. growing as the subject approaches the camera). A still from this sequence is shown in Figure 3b.
Head obscured: A black rectangle was positioned over the head, scaled to obscure the head but not the body of the person. Again, this rectangle tracked the head through the sequence, changing size as necessary. A still from this sequence is shown in Figure 3c.
Gait obscured: To disrupt gait information, the video frames were sampled at 7 equal intervals through the three second period. Instead of showing all frames (and hence continual motion) only seven still frames were shown, each for an equal period, and summing to 3 seconds. This manipulation destroys the apparent motion of the video. The viewer sees 7 snapshots rather than a moving display, and this makes it very difficult to perceive the gait of the target.
The editing procedure resulted in 60 different clips, 15 people x 4 conditions (body obscured, head obscured, gait obscured, unedited). Five different stimulus tapes were prepared in the following way. On each tape, the first 45 clips showed a randomly-ordered sequence of the 45 edited clips (i.e. all clips except the original unedited version). The 15 unedited clips then appeared in a randomly ordered sequence. So, the edited clips were not presented in blocks, condition by condition, but in mixed order. However, these all preceded the unedited clips.
25 volunteer subjects were recruited from students studying in the Department of Psychology, University of Glasgow. None had taken part in Experiment 1. Subjects were asked to identify each of the 60 clips in turn. The five different tapes (containing different random orders) were counterbalanced. Subjects were tested individually. They were told that they would see a series of videos and that some would contain people familiar to them. After each 3 second clip the experimenter asked whether they recognised the person in the clip, and if so, to identify them by name or other distinguishing information. There was no time limit for responses, and subjects were told that they should concentrate on the accuracy of their judgements.
Results
Overall accuracy was high. Averaging over all stimuli, subjects correctly identified 73% of the familiar targets, and correctly rejected 92% of the unfamiliar stimuli.
Responses to familiar stimuli were analysed in two
ways. First, data were analysed as though all four conditions were presented
in random order, taking subjects' average accuracy score in each of the
four conditions (body obscured, head obscured, gait obscured, unedited).
However, there are two potential problems here. First, the unedited condition
was not presented in random order, but always last. Therefore recognition
rates may be artificially high, due to subjects having become familiar
with the stimuli through exposure to the edited conditions. Second, and
potentially more serious, recognition in any condition could be affected
by prior exposure to a target person in a different condition. So, subjects
may recognise a person in the "head concealed" condition, because they
have recently seen that person in the "gait obscured" condition. For this
reason the data were also analysed for "first view" of each target person
only. In this analysis, subjects contribute only 10 data points, one for
each familiar target person. The condition in which this person was first
seen is the only data to enter into the analysis. This gives a second measure
of accuracy.
Details of all statistical analyses are available
from the authors, but can be summarised as follows. Analysis of the "hit"
scores revealed a highly significant effect of condition (F(3,72) = 233;
p < 0.001). Tukey HSD (honestly significant difference) tests showed
that the unedited condition produced significantly more correct identifications
than any other condition, the gait disguised and body concealed conditions
did not differ, but both produced reliably more identifications than the
face concealed condition. Analysis of the "miss" scores shows the identical
pattern of results. Finally, "incorrect" errors were very infrequent, and
were not analysed further.
Figure 5 shows mean identification scores expressed only for the first time each item is encountered. Each of the different orders (tapes) presented to subjects differed slightly in the number of targets appearing for the first time in each category, and so data are expressed as proportions. Note that the targets in the unedited condition were always shown last, and so do not appear in Figure 5. Analysis of the "hit" scores reveals a highly significant effect of condition (F(2,48) = 107; p < 0.001). Tukey HSD tests revealed that the face concealed condition gave rise to significantly fewer hits than either of the other conditions and that the body concealed and gait disguised conditions did not differ significantly. ANOVA on the "miss" scores showed the same pattern of poorer performance for the face concealed condition.
Discussion
The data from this experiment strongly suggests that subjects are using information from the face to identify people in these videos. There is a small (but reliable) reduction in accuracy when a person's gait or body is concealed. It is evident from Figure 4 that the "face obscured" condition is much worse than all others. This is most apparent in Figure 5, showing that when these images are seen for the first time, people are extremely poor at recognising them. It is in this condition that subjects have to rely on information from body shape, gait and knowledge of the people's clothes. However, it seems that they are unable to make good use of these cues to identify the target people.
The pattern of data described here can be summarised as follows. When viewing poor quality videos, people are very good at recognising familiar targets, and very poor at recognising unfamiliar targets. The advantage given by familiarity appears to be largely due to recognition of the face itself, rather than recognition of other cues such as gait, body shape or clothing.
These results have a number of important implications, both for theoretical and applied research in face recognition. Psychologists concerned with familiar face recognition have routinely sought to discover the building blocks of the recognition process. Faces can be parameterised in a number of different ways. For example, some researchers trying to automate the process have tried to characterise faces by a list of 2d distances in the picture plane, and relations between such measures (e.g. Sakai, Nagao and Kanade, 1972; Kanade, 1977; Burton, Bruce and Dench, 1993). More recently, others have used image-based tools relying on patterns of light and dark across the whole image (Turk and Pentland, 1991; Kirby and Sirovich, 1990; Burton, Bruce & Hancock, in press). It seems from these results that facial identities are available in relatively low resolutions and this is consistent with previous research on the spatial scale at which information about identity is available (Bachmann, 1991; Harmon and Julesz, 1973). However, the fact that videos are seen as sequences of frames provides much more information than any individual frame at this resolution. These issues of resolution are likely to be important to theoretical developments in face recognition.
The implications for forensic practice are also very important. In particular, it seems that identification of these types of video sequences is very unreliable, unless the viewer happens to know the target person. There have been some other recent findings which suggest that unfamiliar face matching is difficult, even in the context of high quality images. For example, Kemp, Towell and Pike (1997) studied the ability of supermarket cashiers to verify the identity of shoppers from a small (2cm square) photograph printed onto a credit card. Kemp et al found a high error rate in this setting. Cashiers correctly detected fraudulent identity cards on only 36% of trials when foils were chosen to resemble the card-bearers. Even when foils bore no particular resemblance to the bearer, detection of frauds was only 66%.
Some recent work in our own laboratory underlines the difficulty of unfamiliar face matching (Bruce, Henderson, Greenwood, Hancock, Burton and Miller, in press). In a series of three experiments we showed subjects pictures of unfamiliar targets taken on very high quality video, and asked them to pick out the same person from an array of high quality photographs. The video and photograph of the targets had been taken in good lighting conditions and on the same day, and so superficial aspects of the faces (hairstyle, weight etc) remained constant. Even in these apparently very favourable conditions, there was a high error rate. Using stills taken from the videos, errors were highest when there was a pose difference between the target video face and the photo arrays. However, even in a 10AFC condition, with no time pressure, simultaneous presentation of target and array, and unaltered pose, errors in the order of 25% were observed. Finally, we tested subjects' ability to match moving high quality video clips with an item from a simultaneously presented array of photographs. Once again, errors were unexpectedly high, in the order of 30%. These results, coupled with the results from the present paper, suggest that face recognition for unfamiliar people is dominated by pictorial codes, capturing image-specific details. Recognition of familiar people, on the other hand, is much more flexible, and appears to be mediated by more abstract representations, capable of generalisation over significant changes in image properties.
There are several issues which need to be resolved as a result of this work. First, it will be important to establish exactly the range of video material over which results such as these hold. The particular security system used here was only one example of a commercially available system, and it may be that systems with better image quality support better identification by unfamiliar viewers (though the study by Bruce et al, in press, suggests that improved quality will never eliminate completely the disadvantages observed for unfamiliar viewers). Furthermore, the particular setting of this experiment gives considerable contextual help to viewers familiar with the targets. All subjects familiar with targets in these experiments knew that the setting was the Psychology Department in their University, and that the people they were likely to see would be local academics. The help given by context and expectation needs to be quantified. For example, we do not yet know whether subjects would recognise a famous TV personality, should one happen to have passed unexpectedly through this video context. Similarly, it is not clear how accurate they would have been in recognising their lecturers if they had been presented in an unexpected context, such as a security recording of a crime. These are empirical questions, and it seems that there is a need for full exploration of the various parameters in order to guide good practice in the security industry. Second, these results show rather poor recognition of moving bodies, even by those subjects personally familiar with the target people. Again, this finding needs to be explored further. It seems intuitively reasonable to suppose that we do use gait and body-shape information to discriminate amongst people, but this intuition is not supported in the data.
Finally, those relying on video security surveillance systems need to examine the potential of biometric procedures for identification. In the particular case of poor quality video and unfamiliar viewers, one needs to establish a procedure for automatically deriving matches between targets and suspects. This will be a particularly difficult job. In the case of familiar face recognition, there are no existing systems which can out-perform human recognition. However, in the case of unfamiliar face recognition, it is clear that automatic procedures are needed which out-perform human abilities by a very large margin. Automatic recognition systems which have been developed and tested against one another do generally show good performance, routinely achieving over 90% accuracy in standardised tests (e.g. Phillips, Moon, Rauss and Rizvi, 1997). However, all these systems use high quality images on which to perform their analysis. The challenge for the next generation of automatic face recognition devices is to out-perform human levels of performance matching unfamiliar faces in low quality images.
Acknowledgement
This work was funded by a research project from the ESRC (ref R000236688)
to Vicki Bruce and Mike Burton.
Bachmann, T. (1991). Identification of spatially
quantised tachistoscopic images of faces: How many pixels does it take
to carry identity. European Journal of Cognitive Psychology, 3,
87-103.
Brown, E., Deffenbacher, K. & Sturgill, W. (1977).
Memory for faces and the circumstances of encounter. Journal of Applied
Psychology, 62, 311-318.
Bruce. V. (1982). Changing faces: Visual and non-visual
coding processes in face recognition, British Journal of Psychology,
73, 105-116.
Bruce, V. (1988). Recognising Faces. London:
Lawrence Erlbaum Associates.
Bruce, V., Henderson, Z., Greenwood, K., Hancock,
P.J.B., Burton, A.M. & Miller, P. (in press). Verification of face
identities from images captured on video. Journal of Experimental Psychology:
Applied.
Bruce, V. & Young, A. (1986). Understanding
face recognition. British Journal of Psychology, 77, 305-327.
Bruce, V. & Humphreys, G.W. (19940). Recognizing
objects and faces. Visual Cognition, 1, 141-180.
Burton, A.M., Bruce, V. & Johnston, R.A. (1990).
Understanding face recognition with an interactive activation model. British
Journal of Psychology, 81, 361-380.
Burton, A.M., Bruce, V. & Dench, N. (1993).
What's the difference between men and women? Evidence from facial measurement.
Perception, 22, 153-176.
Burton, A.M., Bruce, V. & Hancock, P.J.B. (in
press) From pixels to people: a model of familiar face recognition. Cognitive
Science.
Burton, A.M., Young, A.W., Bruce, V., Johnston,
R.A., & Ellis, A.W. (1991). Understanding covert recognition. Cognition,
39, 129-166.
Clifford, B.R. & Bull, R. (1978) The Psychology
of Person Identification. London: Routledge & Kegan Paul.
Ellis, H.D. (1975) Recognising faces. British
Journal of Psychology, 66, 409-426.
Harmon, L.D. & Julesz, B. (1973). Masking in
visual recognition: Effects of two-dimensional filtered noise. Science,
180, 1194-1197.
Kanade, T. (1977). Computer recognition of human
faces. Basel: Birkhauser Verlag.
Kemp, R., Towell, N. & Pike, G. (1997). When
seeing should not be believing: Photographs, credit cards and fraud. Applied
Cognitive Psychology, 11, 211-222.
Kirby, M & Sirovich, L. (1990) Applications
of the Karhunen-Loeve procedure for the characterisation of human face.
IEEE: Transactions on Pattern Analysis and Machine Intelligence, 12,
103-108.
Laughery, K.R., Alexander, J.F. & Lane, A.B.
(1971). Recognition of human faces: Effects of target exposure time, target
position, pose position and type of photograph. Journal of Applied Psychology,
55, 477-483.
Philips, P.J., Moon, H., Rauss, P. & Rizvi,
S. (1997). The FERET evaluation methodology for face recognition algorithms.
Proceedings of Computer Vision and Pattern Recognition 97 pp. 137-143.
Los Alamitos, Ca.: IEEE.
Sakai, T., Nagao, M. & Kanade, T. (1972). Computer
analysis and classification of photographs of human faces. Proceedings
of the first USA-Japan Computer Conference, 55-62.
Shepherd, J.W, Ellis, H.D, & Davies, G.M. (1982).
Identification evidence: A psychological evaluation. Aberdeen: University
of Aberdeen Press.
Turk, M. & Pentland, A. (1991). Eigenfaces for
recognition, Journal of Cognitive Neuroscience, 3, pp. 71-86.
Yarmey, A.D. (1979) The psychology of eye-witness
testimony. New York: The Free Press.