In press, Psychological Science
(to appear 1999 )
Face recognition in poor quality video: evidence from security surveillance
 
A. Mike Burton, Stephen Wilson, Michelle Cowan
Department of Psychology
University of Glasgow
UK

Vicki Bruce
Department of Psychology
University of Stirling
UK
 
 
 

Address correspondence to Mike Burton, Department of Psychology, University of Glasgow, Glasgow G12 8QQ, UK. Tel. +44 141 330 4060 Email mike@psy.gla.ac.uk
 
 

Abstract

Security surveillance systems often produce poor quality video, and this may be problematic in gathering forensic evidence. We examine the ability of subjects to identify target people captured by a commercially available video security device. In Experiment 1 we show that subjects personally familiar with the targets perform very well at identifying them, but subjects unfamiliar with the targets perform very poorly. Police officers with experience in forensic identification perform as poorly as other subjects unfamiliar with the targets. In Experiment 2, we ask how familiar subjects can perform so well. Using the same video device, we edited clips to obscure the head, body or gait of the targets. Obscuring body and gait produced a small decrement in recognition performance. Obscuring the target's head had a dramatic effect on subjects' abilities to recognise them. This implies that subjects are recognising the targets' faces, even in these poor quality images.
 

Introduction

    The psychological study of face recognition divides into two rather different topics. First, there are projects whose main focus is the recognition of faces previously unfamiliar to subjects (e.g. Brown, Deffenbacher and Sturgill, 1977; Ellis, 1975; Laughery, Alexander and Lane, 1971; for reviews see Clifford and Bull, 1978; Shepherd, Ellis & Davies, 1982). Second, there is a large literature on processes underlying recognition of familiar faces (for example see reviews by Bruce, 1988; Bruce & Humphreys, 1994; and theoretical developments by Bruce & Young, 1986; Burton, Bruce & Johnston, 1990; Burton, Young, Bruce, Johnston & Ellis, 1991).

    Studies of unfamiliar face recognition often have a forensic motivation. In typical experiments, subjects are shown faces of unfamiliar people and are subsequently tested using a recognition memory procedure It has been shown on a number of occasions that face recognition of previously unfamiliar faces is rather poor (e.g. Yarmey, 1979). Despite this, juries are said to favour eye-witness face recognition reports, and attach considerable weight to them. It is therefore very important to establish the reliability of such reports across a range of conditions, and to discover techniques for improving the reliability of recognition (Shepherd et al, 1982).

    Research in both familiar and unfamiliar face recognition very commonly uses high quality images of target people. However, recent developments in security surveillance throw up a particular problem with image quality. Small-scale security systems based on VHS video or closed circuit TV have become very common in Europe and North America. Such systems are often installed with little attention to optimising lighting conditions or viewing angle. This means that when an image or video sequence is needed for evidence (for example following a crime) it is not always easy to confirm whether the person captured in the security device is the same person accused or suspected of the crime.

    In this paper we examine the effectiveness of human face recognition in poor-quality video images. In particular, we are concerned with the effects of familiarity on recognition ability. There are two main questions of interest. First, how good is face recognition in poor quality images? To answer this question we use video sequences captured from a commercially-available video security system. Second, if subjects are able to recognise people from these images, what is the basis for their recognition?

Experimental setting

    Both experiments reported here use images from the same security device, which was chosen to be typical of many low-cost security systems. The Department of Psychology at the University of Glasgow, UK, uses a video security system installed by a local company. This is a VHS video device which is triggered when a person approaches the main entrance door of the building. Each time a person enters or leaves, a security light is automatically turned on, and about four seconds of video is recorded. The video camera is located on an inside wall directed towards the main entrance door, at a height of 9 feet. The detailed specification is: a vista NCD 340 CCTV camera (8mm f1.2 lens) with a Mitsubishi HS-5424E(B) Timelapse Cassette Recorder and a Fuji HQ+ 180 PAL VHS video tape.

    Systems of this kind are very common in the local area, and the same security company supplies many local businesses. The system was not configured in any way for the purposes of this experiment. Informal observations suggest that the resulting image quality is rather poor, though tolerable in a low-cost system. Figure 1 shows a still from this system.

Experiment 1: the effects of familiarity

    In the first experiment, we examined the ability of subjects to recognise images from the security system, depending on whether they were personally familiar with the targets or not. Many of the people who walk through this particular system are lecturing staff at the University of Glasgow. It is therefore relatively easy to find subjects who are familiar with them, i.e. students who take classes in psychology. Similarly, it is relatively easy to find subjects who are unfamiliar with targets, i.e. students who do not take classes in psychology. In this experiment we also examined the ability of a set of police officers to recognise the targets. The police subjects were unfamiliar with the targets, but were experienced in making identification judgements.

    The experiment makes use of a recognition memory procedure. In the first phase, subjects are shown a set of video sequences and told they will be asked to recognise people in these clips later. In the second phase, subjects are shown a set of high quality photos, and asked, for each photo, whether this person appeared in the first phase. Although this is not a direct analogue of the usual forensic situation (in which only one target would normally be being sought), it is a convenient task to use experimentally, because the same procedure can be used both with the familiar and the unfamiliar subjects. Recognition memory typically covaries with other face recognition tasks, and this procedure has been used commonly in the past to compare familiar and unfamiliar face recognition with the same target stimuli (e.g. Bruce, 1982).

Method

1a: a still from a video                                                1b: photo taken in good lighting
Figure 1 Images of the type used in Experiment 1.

    Video clips were chosen of 20 members of lecturing staff, 10 male and 10 female. These clips were taken from the routinely-collected video of people entering the building, i.e. they were not posed, and target people were not aware at the time that their video images would form part of an experiment. Clips were chosen which contained only one person entering the building. In addition to these clips, each target person was photographed on a different day, using a high-quality digital camera, under good lighting conditions. Examples are shown in Figure 1.

    Sixty subjects volunteered to take part in the study. Of these, twenty students were recruited from the Department of Psychology, and each had been taught by all 20 of the target lecturers. A further twenty students were recruited from different departments throughout the University, none of whom had taken courses in the Department of Psychology. Finally, a group of twenty serving police officers was selected from those attending a course at a local police training school. These were experienced officers with an average of 13.5 years service.

    Subjects were tested individually in an experimental room, and were shown video clips on a standard video recorder and TV. They were initially shown 10 of the available 20 video clips, and told they would be asked to identify these people later. Each subject was shown these clips twice, each time in a different random order. There was a short gap (2-3 seconds) between each clip, and a rest period of one minute after the videos had been seen. The particular subset of videos shown in this phase was counterbalanced across subjects.

    There followed a test phase in which subjects were shown each of the 20 high quality images, one at a time. They were told that they would be shown 20 faces, and that half of these people had been present in the videos. They were asked to assign a rating of 1-7 to each of these photos. A score of 7 indicates that subjects are sure that the person appeared in the video, a score of 1 indicates that the person definitely did not appear in the video.

Results and Discussion

Figure 2: Accuracy of identification

    Figure 2 shows the mean recognition scores given to seen and unseen targets. The results show that people in the known group perform well, assigning high scores to seen targets, and low scores to unseen targets. Subjects in the other two groups perform less well, making a smaller discrimination between these two groups. Formal analysis can be summarised as follows: i) all groups score seen targets significantly higher than unseen targets, though the effect is much larger in the familiar group; ii) there is no difference in performance between the unfamiliar student and police groups, but both are significantly poorer than the familiar group. Full details of ANOVAs are available from the authors. In summary, 2-way ANOVA shows no main effect of subject group (F(2,57) < 1), a significant effect of seen/unseen target (F(2,57) = 324, p < 0.001) and a highly significant interaction (F(2,57) = 92, p < 0.001).

    These data show a very marked benefit for people personally familiar with the targets. The use of the ends of the rating scale was common in the familiar group, and subjects were very accurate indeed in making the seen/unseen decisions. Subjects unfamiliar with the targets performed very poorly, regardless of whether they were students or police officers. Although there were reliable differences on the judgements between targets which had been present and those not present, differences were comparatively small. These results seem particularly important for the issue of security surveillance. If images of this quality are to be used as legal evidence, it is important to demonstrate the basis of recognition judgements. From this study, it seems that only personal familiarity will provide a good basis for accuracy of judgements.

    What is the basis for the high scores of the familiar subjects? People familiar with the targets may be recognising a number of characteristics. For example, it may be that subjects are recognising the clothes of their lecturers, or their body shape or gait. It seems reasonable to propose that subjects will use any cue available in order to make the identification. The very low resolution of the information carried in the face (very few scan lines on the video) lead to the hypothesis that it is not faces which subjects are recognising in this study, but whole bodies, and we examine this further in Experiment 2.

Experiment 2: the basis of the familiarity advantage

    In this study clips from the same video security device were used. However, only subjects familiar with the targets were recruited. In order to examine the basis for the familiarity advantage, we selectively disrupted aspects of the video by obscuring the head, body or gait of the targets in videos. As this experiment uses only familiar subjects, a simple identification task is used, rather than the recognition memory task used in Experiment 1.

Method

    Video clips were taken of 15 target people. Ten of these people were lecturing staff (6 male, 4 female) who would be familiar to all subjects. The remaining five people were visitors (3 male, 2 female) who would not be familiar to subjects. In contrast to Experiment 1, video clips were not taken from naturally occurring incidents on the surveillance video, but target people were asked to walk into the building on a prescribed route through the door and towards the camera until they pass out of its range. All clips were gathered on the same day. All clips were edited to last for 3 seconds.  Figure 3a shows a still from one of these videos.

3a: unedited                    3b: body concealed          3c: face concealed
Figure 3: Stills from video sequences used in Experiment 2.

    Copies of the resulting 15 video clips were edited, using digital video editing equipment, in each of the following ways:

Body obscured: A black rectangle was positioned over the body, scaled to fit the body but not to obscure the background or the head of the person. The rectangle tracked the person through the video sequence, changing shape as necessary (i.e. growing as the subject approaches the camera). A still from this sequence is shown in Figure 3b.

Head obscured: A black rectangle was positioned over the head, scaled to obscure the head but not the body of the person. Again, this rectangle tracked the head through the sequence, changing size as necessary. A still from this sequence is shown in Figure 3c.

Gait obscured: To disrupt gait information, the video frames were sampled at 7 equal intervals through the three second period. Instead of showing all frames (and hence continual motion) only seven still frames were shown, each for an equal period, and summing to 3 seconds. This manipulation destroys the apparent motion of the video. The viewer sees 7 snapshots rather than a moving display, and this makes it very difficult to perceive the gait of the target.

    The editing procedure resulted in 60 different clips, 15 people x 4 conditions (body obscured, head obscured, gait obscured, unedited). Five different stimulus tapes were prepared in the following way. On each tape, the first 45 clips showed a randomly-ordered sequence of the 45 edited clips (i.e. all clips except the original unedited version). The 15 unedited clips then appeared in a randomly ordered sequence. So, the edited clips were not presented in blocks, condition by condition, but in mixed order. However, these all preceded the unedited clips.

    25 volunteer subjects were recruited from students studying in the Department of Psychology, University of Glasgow. None had taken part in Experiment 1. Subjects were asked to identify each of the 60 clips in turn. The five different tapes (containing different random orders) were counterbalanced. Subjects were tested individually. They were told that they would see a series of videos and that some would contain people familiar to them. After each 3 second clip the experimenter asked whether they recognised the person in the clip, and if so, to identify them by name or other distinguishing information. There was no time limit for responses, and subjects were told that they should concentrate on the accuracy of their judgements.

Results

    Overall accuracy was high. Averaging over all stimuli, subjects correctly identified 73% of the familiar targets, and correctly rejected 92% of the unfamiliar stimuli.

    Responses to familiar stimuli were analysed in two ways. First, data were analysed as though all four conditions were presented in random order, taking subjects' average accuracy score in each of the four conditions (body obscured, head obscured, gait obscured, unedited). However, there are two potential problems here. First, the unedited condition was not presented in random order, but always last. Therefore recognition rates may be artificially high, due to subjects having become familiar with the stimuli through exposure to the edited conditions. Second, and potentially more serious, recognition in any condition could be affected by prior exposure to a target person in a different condition. So, subjects may recognise a person in the "head concealed" condition, because they have recently seen that person in the "gait obscured" condition. For this reason the data were also analysed for "first view" of each target person only. In this analysis, subjects contribute only 10 data points, one for each familiar target person. The condition in which this person was first seen is the only data to enter into the analysis. This gives a second measure of accuracy.

    Figure 4 shows analysis of all the data by condition. "Hit" refers to a correct identification of a target familiar to the subject, "miss" refers to a failure to recognise a familiar person as familiar, and "incorrect" refers to an error in which a familiar person is identified as familiar, but mistaken for another familiar person.

    Details of all statistical analyses are available from the authors, but can be summarised as follows. Analysis of the "hit" scores revealed a highly significant effect of condition (F(3,72) = 233; p < 0.001). Tukey HSD (honestly significant difference) tests showed that the unedited condition produced significantly more correct identifications than any other condition, the gait disguised and body concealed conditions did not differ, but both produced reliably more identifications than the face concealed condition. Analysis of the "miss" scores shows the identical pattern of results. Finally, "incorrect" errors were very infrequent, and were not analysed further.

    Figure 5 shows mean identification scores expressed only for the first time each item is encountered. Each of the different orders (tapes) presented to subjects differed slightly in the number of targets appearing for the first time in each category, and so data are expressed as proportions. Note that the targets in the unedited condition were always shown last, and so do not appear in Figure 5. Analysis of the "hit" scores reveals a highly significant effect of condition (F(2,48) = 107; p < 0.001). Tukey HSD tests revealed that the face concealed condition gave rise to significantly fewer hits than either of the other conditions and that the body concealed and gait disguised conditions did not differ significantly. ANOVA on the "miss" scores showed the same pattern of poorer performance for the face concealed condition.

Discussion

    The data from this experiment strongly suggests that subjects are using information from the face to identify people in these videos. There is a small (but reliable) reduction in accuracy when a person's gait or body is concealed. It is evident from Figure 4 that the "face obscured" condition is much worse than all others. This is most apparent in Figure 5, showing that when these images are seen for the first time, people are extremely poor at recognising them. It is in this condition that subjects have to rely on information from body shape, gait and knowledge of the people's clothes. However, it seems that they are unable to make good use of these cues to identify the target people.

General Discussion

    The pattern of data described here can be summarised as follows. When viewing poor quality videos, people are very good at recognising familiar targets, and very poor at recognising unfamiliar targets. The advantage given by familiarity appears to be largely due to recognition of the face itself, rather than recognition of other cues such as gait, body shape or clothing.

    These results have a number of important implications, both for theoretical and applied research in face recognition. Psychologists concerned with familiar face recognition have routinely sought to discover the building blocks of the recognition process. Faces can be parameterised in a number of different ways. For example, some researchers trying to automate the process have tried to characterise faces by a list of 2d distances in the picture plane, and relations between such measures (e.g. Sakai, Nagao and Kanade, 1972; Kanade, 1977; Burton, Bruce and Dench, 1993). More recently, others have used image-based tools relying on patterns of light and dark across the whole image (Turk and Pentland, 1991; Kirby and Sirovich, 1990; Burton, Bruce & Hancock, in press). It seems from these results that facial identities are available in relatively low resolutions and this is consistent with previous research on the spatial scale at which information about identity is available (Bachmann, 1991; Harmon and Julesz, 1973). However, the fact that videos are seen as sequences of frames provides much more information than any individual frame at this resolution. These issues of resolution are likely to be important to theoretical developments in face recognition.

    The implications for forensic practice are also very important. In particular, it seems that identification of these types of video sequences is very unreliable, unless the viewer happens to know the target person. There have been some other recent findings which suggest that unfamiliar face matching is difficult, even in the context of high quality images. For example, Kemp, Towell and Pike (1997) studied the ability of supermarket cashiers to verify the identity of shoppers from a small (2cm square) photograph printed onto a credit card. Kemp et al found a high error rate in this setting. Cashiers correctly detected fraudulent identity cards on only 36% of trials when foils were chosen to resemble the card-bearers. Even when foils bore no particular resemblance to the bearer, detection of frauds was only 66%.

    Some recent work in our own laboratory underlines the difficulty of unfamiliar face matching (Bruce, Henderson, Greenwood, Hancock, Burton and Miller, in press). In a series of three experiments we showed subjects pictures of unfamiliar targets taken on very high quality video, and asked them to pick out the same person from an array of high quality photographs. The video and photograph of the targets had been taken in good lighting conditions and on the same day, and so superficial aspects of the faces (hairstyle, weight etc) remained constant. Even in these apparently very favourable conditions, there was a high error rate. Using stills taken from the videos, errors were highest when there was a pose difference between the target video face and the photo arrays. However, even in a 10AFC condition, with no time pressure, simultaneous presentation of target and array, and unaltered pose, errors in the order of 25% were observed. Finally, we tested subjects' ability to match moving high quality video clips with an item from a simultaneously presented array of photographs. Once again, errors were unexpectedly high, in the order of 30%. These results, coupled with the results from the present paper, suggest that face recognition for unfamiliar people is dominated by pictorial codes, capturing image-specific details. Recognition of familiar people, on the other hand, is much more flexible, and appears to be mediated by more abstract representations, capable of generalisation over significant changes in image properties.

    There are several issues which need to be resolved as a result of this work. First, it will be important to establish exactly the range of video material over which results such as these hold. The particular security system used here was only one example of a commercially available system, and it may be that systems with better image quality support better identification by unfamiliar viewers (though the study by Bruce et al, in press, suggests that improved quality will never eliminate completely the disadvantages observed for unfamiliar viewers). Furthermore, the particular setting of this experiment gives considerable contextual help to viewers familiar with the targets. All subjects familiar with targets in these experiments knew that the setting was the Psychology Department in their University, and that the people they were likely to see would be local academics. The help given by context and expectation needs to be quantified. For example, we do not yet know whether subjects would recognise a famous TV personality, should one happen to have passed unexpectedly through this video context. Similarly, it is not clear how accurate they would have been in recognising their lecturers if they had been presented in an unexpected context, such as a security recording of a crime. These are empirical questions, and it seems that there is a need for full exploration of the various parameters in order to guide good practice in the security industry. Second, these results show rather poor recognition of moving bodies, even by those subjects personally familiar with the target people. Again, this finding needs to be explored further. It seems intuitively reasonable to suppose that we do use gait and body-shape information to discriminate amongst people, but this intuition is not supported in the data.

    Finally, those relying on video security surveillance systems need to examine the potential of biometric procedures for identification. In the particular case of poor quality video and unfamiliar viewers, one needs to establish a procedure for automatically deriving matches between targets and suspects. This will be a particularly difficult job. In the case of familiar face recognition, there are no existing systems which can out-perform human recognition. However, in the case of unfamiliar face recognition, it is clear that automatic procedures are needed which out-perform human abilities by a very large margin. Automatic recognition systems which have been developed and tested against one another do generally show good performance, routinely achieving over 90% accuracy in standardised tests (e.g. Phillips, Moon, Rauss and Rizvi, 1997). However, all these systems use high quality images on which to perform their analysis. The challenge for the next generation of automatic face recognition devices is to out-perform human levels of performance matching unfamiliar faces in low quality images.

Acknowledgement

This work was funded by a research project from the ESRC (ref R000236688) to Vicki Bruce and Mike Burton.
 

References

    Bachmann, T. (1991). Identification of spatially quantised tachistoscopic images of faces: How many pixels does it take to carry identity. European Journal of Cognitive Psychology, 3, 87-103.
    Brown, E., Deffenbacher, K. & Sturgill, W. (1977). Memory for faces and the circumstances of encounter. Journal of Applied Psychology, 62, 311-318.
    Bruce. V. (1982). Changing faces: Visual and non-visual coding processes in face recognition, British Journal of Psychology, 73, 105-116.
    Bruce, V. (1988). Recognising Faces. London: Lawrence Erlbaum Associates.
    Bruce, V., Henderson, Z., Greenwood, K., Hancock, P.J.B., Burton, A.M. & Miller, P. (in press). Verification of face identities from images captured on video. Journal of Experimental Psychology: Applied.
    Bruce, V. & Young, A. (1986). Understanding face recognition. British Journal of Psychology, 77, 305-327.
    Bruce, V. & Humphreys, G.W. (19940). Recognizing objects and faces. Visual Cognition, 1, 141-180.
    Burton, A.M., Bruce, V. & Johnston, R.A. (1990). Understanding face recognition with an interactive activation model. British Journal of Psychology, 81, 361-380.
    Burton, A.M., Bruce, V. & Dench, N. (1993). What's the difference between men and women? Evidence from facial measurement. Perception, 22, 153-176.
    Burton, A.M., Bruce, V. & Hancock, P.J.B. (in press) From pixels to people: a model of familiar face recognition. Cognitive Science.
    Burton, A.M., Young, A.W., Bruce, V., Johnston, R.A., & Ellis, A.W. (1991). Understanding covert recognition. Cognition, 39, 129-166.
    Clifford, B.R. & Bull, R. (1978) The Psychology of Person Identification. London: Routledge & Kegan Paul.
    Ellis, H.D. (1975) Recognising faces. British Journal of Psychology, 66, 409-426.
    Harmon, L.D. & Julesz, B. (1973). Masking in visual recognition: Effects of two-dimensional filtered noise. Science, 180, 1194-1197.
    Kanade, T. (1977). Computer recognition of human faces. Basel: Birkhauser Verlag.
    Kemp, R., Towell, N. & Pike, G. (1997). When seeing should not be believing: Photographs, credit cards and fraud. Applied Cognitive Psychology, 11, 211-222.
    Kirby, M & Sirovich, L. (1990) Applications of the Karhunen-Loeve procedure for the characterisation of human face. IEEE: Transactions on Pattern Analysis and Machine Intelligence, 12, 103-108.
    Laughery, K.R., Alexander, J.F. & Lane, A.B. (1971). Recognition of human faces: Effects of target exposure time, target position, pose position and type of photograph. Journal of Applied Psychology, 55, 477-483.
    Philips, P.J., Moon, H., Rauss, P. & Rizvi, S. (1997). The FERET evaluation methodology for face recognition algorithms. Proceedings of Computer Vision and Pattern Recognition 97 pp. 137-143. Los Alamitos, Ca.: IEEE.
    Sakai, T., Nagao, M. & Kanade, T. (1972). Computer analysis and classification of photographs of human faces. Proceedings of the first USA-Japan Computer Conference, 55-62.
    Shepherd, J.W, Ellis, H.D, & Davies, G.M. (1982). Identification evidence: A psychological evaluation. Aberdeen: University of Aberdeen Press.
    Turk, M. & Pentland, A. (1991). Eigenfaces for recognition, Journal of Cognitive Neuroscience, 3, pp. 71-86.
    Yarmey, A.D. (1979) The psychology of eye-witness testimony. New York: The Free Press.