Speech recognition systems have opened the way towards an intuitive and natural human-computer interaction (HCI). However, current HCI systems using speech recognition require a human to explicitly indicate one's intent to speak by turning on a microphone using the keyboard or mouse.
One of the key aspects of naturalness of speech communication involves the ability of humans to detect an intent to speak. Humans detect an intent to speak by a combination of visual and auditory cues. Visual cues include physical proximity, eye contact and lip movement, etc.
Automatic detection of speech onset for open-microphone solutions can be carried out using silence/speech detection. However, purely audio-based techniques suffer from sensitivity to background noise. We are exploring the use of the combination of visual cues and audio cues to provide robust indicators of speech intent and speech onset/offset.
Our current approach uses the following visual cues: User proximity to the computer, frontality of pose and visual speech activity. These cues will be combined with audio cues based on speech/silence detection.
Research Areas:
- Robust frontal pose detection
- Visual speech onset estimation
- Fusion of audio-based speech onset cues with visual cues
Papers:
- P. de Cuetos, C. Neti, A.W. Senior. Audio-visual intent-to-speak detection for human-computer interaction, ICASSP June 5-9 2000, pp. 1325-1328, Istanbul, Turkey.
- G. Iyengar, C. Neti. A vision-based microphone switch for speech intent detection, July 2001, Vancouver, Canada.
Demos:
- Audio Visual Speech Intent Detection System in Action This demo shows the functions of the initial incarnation of a speech recognition system that uses visual and audio information to better understand the users intent to communicate.
- Video Demo of Vision-Based Microphone Switch
