Humans identify speakers based on a variety of attributes of the person which include acoustic cues, visual appearance cues and behavioral characteristics (such as characteristic gestures, lip movements). In the past, machine implementations of person identification have focussed on single techniques relating to audio cues alone (speaker recognition), visual cues alone (face identification, iris identification) or other biometrics. More recently, researchers are attempting to combine multiple modalities for person identification. Speaker identification is an important technology for a variety of applications including security, and more recently as an index for search and retrieval of digitized multimedia content (for instance in the MPEG7 standard). Audio-based speaker recognition accuracy under acoustically degraded conditions (such as background noise) and channel mismatch (telephone) still need further improvement. To make improvements in such degraded conditions is a hard problem. We have begun to investigate the combination of audio-based processing with visual processing for speaker recognition to improve the accuracy in acoustically degraded conditions in the broadcast news domain.
Key component technologies:
- Face detection, tracking and identification
- Audio-based speaker identification
- Fusion techniques
Papers:
- U.V. Chaudhari, G.N. Ramaswamy, G. Potamianos, and C. Neti, Information fusion and decision cascading for audio-visual speaker recognition based on time varying stream reliability prediction, Proc. Int. Conf. Multimedia Expo., vol. III, pp. 9-12, Baltimore, July 2003.
- U.V. Chaudhari, G.N. Ramaswamy, G. Potamianos, and C. Neti, Audio-visual speaker recognition using time-varying stream reliability prediction, Proc. Int. Conf. Acoust. Speech Signal Process., vol. V, pp. 712-715, Hong Kong, Apr. 2003.
- Benoit Maison, Chalapathy Neti, Andrew Senior. Audio-visual speaker recognition for video broadcast news: some fusion techniques, IEEE Multimedia Signal Processing (MMSP99), Denmark, Sept, 1999.
- Andrew Senior, Chalapathy Neti, Benoit Masion. On the use of visual information for improving audio-based speaker recognition, Audio-Visual Speech processing conference (AVSP99), Santa Cruz, CA, Aug, 1999.
- Chalapathy Neti, Andrew Senior. Audio-visual speaker recognition for video broadcast news, DARPA HUB4 Workshop, Washington D.C., March 1999.
- A. Senior. Recognizing faces in broadcast video. IEEE International Workshop on Recognition, Analysis, and Tracking of Faces and Gestures in Real-Time Systems, ICCV 1999.
