Ongoing Projects

Gerasimos Potamianos has joined the Institute of Informatics and Telecommunications at the National Center for Scientific Research (NCSR), "Demokritos", in Athens, Greece, as a Research Director.

He can be contacted at gpotam@ieee.org.

Updated information can be found at http://www.iit.demokritos.gr/~gpotam

This web page is no longer being maintained.




Gerasimos (Makis) Potamianos

Manager, Multimodal Conversational Solutions Department
Human Language Technologies / Multilingual Analytics and User Technologies, IBM T.J. Watson Research Center

Gerasimos 

(Makis) Potamianos

CONTACT INFO
RESUME
SHORT BIO
PUBLICATIONS BY TOPIC | BY TYPE

Makis manages the Multimodal Conversational Solutions Department at the IBM Thomas J. Watson Research Center. In this role, he is leading the setting and execution of the IBM research agenda for multimodal conversational technologies involving development and integration of multiple input modalities (speech, video, etc), natural language processing, and dialog management into conversational platforms and solutions.

A main focus of the group is in developing conversational interaction solutions on embedded devices using natural language understanding and dialog management. The goal is to do so using free-form speech input, advancing beyond traditional command and control, and allowing recognition of multiple tokens in one turn, as well as to implement sophisticated dialog management for resolving ambiguities and recovering from low confidence input. The resulting systems are expected to run on commercially available embedded platforms. Part of these activities are conducted within the DICIT project, which focuses on multimodal control (voice in the far-field and traditional remote) of interactive television. Additional work is being devoted to developing navigation and music selection systems, running on automotive embedded platforms.

In addition to his management responsibilities, Makis is currently involved in research activities that span the areas of multimodal speech processing for human-computer interaction and ambient intelligence. Work concentrates in the areas of audio-visual speech processing, automatic speech recognition, multimedia signal processing and fusion, as well as computer vision for human detection and tracking. These activities are briefly described next.

Makis' earlier research work (1990-1996) on language modeling and image analysis (while at the Johns Hopkins University) is discussed briefly at the end.

Technology Development for Ambient Intelligence as Part of the CHIL and NETCARITY Projects

Recent efforts of the group have been steered towards extending research expertise and assets developed as part of audio-visual speech research (discussed later) and monomodal technologies (speech, vision) to the problem of technology development for ambient intelligence environments. These efforts have been initiated with IBM participation in the now complete European Union funded project CHIL (2004-2007), as well as the recently commenced NETCARITY and DICIT efforts.

CHIL, "Computers in the Human Interaction Loop", is a recently completed integrated project within the FP6 European Union framework programme with the participation of 15 partner sites from nine countries, under the joint coordination of the Fraunhofer Institute fur Informations- und Datenverarbeitung (IITB) and the Integrated Systems Lab (ISL), of the University of Karlsruhe (UKA), Germany. CHIL focuses on analyzing, understanding, and facilitating human interaction during lectures and meetings inside smart rooms equipped with multiple far-field audio and visual sensors. Based on these, the CHIL consortium effort has concentrated on detecting, classifying, and understanding human activity in the space, addressing the basic questions of the "who", "where", "what", "when", and "how" of the interaction. In the CHIL vision, computers fade into the background, reduced to discreet observers of human activity, ready to provide services proactively and implicitly supporting the meeting participants.

The IBM effort in this project has been led by Dr. Jan Kleindienst, head of a team located at IBM, Czech Republic, Prague, and part of the Human Language Technology Department of IBM Research. The main emphasis of that work is on the CHIL architecture and user modeling. In particular, the IBM team has developed tools for facilitating communication between CHIL perception technology components and CHIL services, as well as for modeling the lecture/meeting "situation".

These activities are being complemented by research work led by Makis at the IBM T. J. Watson Research Center. In particular, Makis has been working and overseeing efforts focusing on the following:

  • Speech technologies using far-field sensors in the CHIL scenarios of interest, i.e., seminars (lectures/meetings) inside smart rooms. In particular, effort has been concentrating on automatic speech recognition (ASR), speech activity detection (SAD), and speaker diarization (SPKR) technologies, focusing on acoustic and language modeling, as well as multi-channel processing. The developed technologies have been evaluated in the Rich Transcription (RT) evaluation campaigns, overseen by the National Institute of Standards and Technology (NIST) during the springs of 2006 and 2007. Multimodal (audio-visual) speech technologies are also being developed in the CHIL scenarios, as described later (see Audio-Visual Speech Technologies). Relevant to this work, Makis has been leading workpackage five (WP5) for the CHIL consortium.
  • speech detector training

    Training schematic of the IBM statistical system for far-field speech activity detection in the CHIL smart room. Two types of features, energy-based and acoustic likelihood-based, are fused and used in a full-covariance Gaussian mixture classifier [6].

  • Computer vision technology for tracking the lecture/meeting participants using multiple (four to five) fixed cameras with overlapping views. Work has been focusing on both 3D head- and 2D-face tracking in the 3D space and available 2D camera views, respectively. Multiple initialization and tracking algorithms have been developed, including a novel extension of mean shift tracking to 3D, adaptive subspace tracking with a "forgetting mechanism", and a variant of the IBM smart surveillance engine. The developed technologies have been evaluated in the evaluation campaign for the Classification of Events, Activities and Relationships (CLEAR).
  • tracking algorithm examples

    Three steps in a motion estimation based initialization system assisted by face detection and the mean shift algorithm, for tracking the presenter in CHIL smart room lectures, using two selected cameras [11].

  • Development of a smart room for data collection and demos in support of the CHIL activities at the IBM T. J. Watson Research Center. The smart room is a regular conference room at IBM that has been retrofitted with numerous audio-visual sensors connected with multiple computers. In particular, the smart room infrastructure consists of nine cameras, 152 microphones, and 7 computers running Linux, inter-connected via 1Gb ethernet, as well as dedicated audio and video data links to the sensors. Among the cameras, five are fixed (Firewire), located at the four room corners and a wide-lens panoramic located at the ceiling pointing down, and four cameras are pan-tilt-zoom (analog), focusing on areas of interest (lecturer, meeting participants, room door, etc). The fixed cameras have highly overlapping fields-of-view, allowing detection and tracking in the 3D space. In the acoustic domain, there exist two 64-channel linear microphone arrays, four 4-channel T-shaped microphone arrays, three table-top microphones, and five close-talking headsets (one wireless). These sensors allow both far-field and near-field speech processing, beamforming, and source localization. The IBM team has recorded ten interactive seminars (meetings) in this smart room in support of the 2006 and 2007 RT and CLEAR evaluation campaigns.
  • IBM smart room diagram

    IBM smart room camera views

    IBM smart room schematic diagram (upper) and example camera views of a recorded meeting (lower).

    An overview of the IBM activities in the CHIL project can be found in Makis' keynote talk at VisHCI'06, with more details described in relevant publications.

    The above work has continued in the framework of the NETCARITY project. Most recently the IBM team led by Makis has been concentrating on the problem of acoustic scene analysis for smart home environments with emphasis on the protection and monitoring of the elderly. For example, the work aims to detect acoustic events that signify danger (falls, break-ins), as well as longer-term activities of daily living.

    Audio-Visual Speech Technologies Research

    Visual information is beneficial to a number of speech technologies, for example, it can dramatically improve automatic speech recognition accuracy in noise (similarly to human lipreading), it can help disambiguate the "who" and "when" of the active speaker in multi-party interaction (speech activity detection, speaker localization), help with person recognition (identification, verification), as well as improve speech "delivery" through the use of visual speech synthesis (avatars, photorealistic talking faces).

    Makis has been actively involved in many of the above aspects of audio-visual speech processing over the past several years with IBM, conducting research and coordinating team efforts. His main activity focuses on audio-visual automatic speech recognition (AVASR). He has worked on all aspects of the problem, including algorithms for face detection and tracking, extraction of speech-informative visual features, integration of audio and visual features into the speech recognition process, as well as development of AVASR prototypes. In this effort, Makis has also collaborated with external customers, as well as colleagues in academia, starting with the summer 2000 workshop at the Johns Hopkins University (WS'00). The work has received an IBM Research Accomplishment award in 2002, and it has led to the 2006 North American Frost and Sullivan Award for Excellence in Research in the speech recognition field awarded to the IBM Corporation. Some highlights of Makis' research work on audio-visual speech technologies include:

  • Databases: The group has collected state-of-the-art corpora that allow large-scale AVASR experiments in a variety of environments. Such range from controlled settings ("studio"-like) to significantly more challenging domains, such as offices, automobiles, broadcast news, and smart rooms. In addition to these sets that contain full-face data, the team has collected mouth-only frontal data using a specially designed audio-visual headset, as well as multi-view (frontal / profile) data using multiple synchronized cameras. All corpora contain a large number of speakers and both continuous large-vocabulary speech, as well as a "control" small-vocabulary (connected digits) set.
  • IBM AV data

    Example frames of four IBM audio-visual datasets recorded, top-to-bottom, in a studio-like setting, offices, automobiles, and using the audio-visual IBM headset [28].

  • Visual front end: Work has been focusing on face and facial feature detection and tracking, as well as the extraction of visual features, relevant to speech. For the former, AdaBoost and GMM based classification schemes are being used. For visual feature extraction, the focus has been on appearance-based methods, employing pattern recognition and image compression techniques. For example, among others, linear discriminant analysis, the discrete cosine transform, and mutual information approaches have been used. In addition, comparisons with alternative visual features have been conducted, for example geometric features, active-appearance model based ones, etc. The visual front end has been appropriately extended to handle profile view data, as well as mouth-only region data provided by the IBM-developed audio-visual headset. Real-time, computationally efficient implementations of these algorithms have been achieved.

    non-frontal face and mouth detection

    Examples of detected face and mouth regions in non-frontal data collected for multi-view AVASR [52].

  • Fusion for speech recognition: Work has been focusing on HMM-based integration approaches for optimal gains in recognition performance. Techniques for feature, decision, and hybrid fusion have been developed, including asynchronous integration of the audio and visual streams. Of particular interest is ongoing research on stream reliability modeling and integration, including algorithms for training global or locally adaptive stream weights. As a result of this work, large improvements in speech recognition performance have been demonstrated, as compared to traditional audio-only ASR systems.

    AV recognition results

    Audio-visual vs. audio-only results using various fusion techniques for connected digit recognition in studio-like visual conditions and various acoustic noise levels under matched training/testing [26].

  • Prototype systems: Efficient AVASR algorithms have been implemented based on the IBM ViaVoice engine platform. The resulting AVASR prototype operates real-time, and it can accept input from various visual sensors, including a web-cam, the IBM audio-visual headset, and a specially designed camera for the automobile environment. Embedded, smaller-footprint implementations are currently under development for targeted applications.

    AV headset prototype

    The IBM audio-visual headset prototype. Both wired and wireless versions exist that provide audio-visual input to the IBM AVASR ViaVoice engine. A helmet variant of this headset has also been developed [27].

    In addition to AVASR, Makis has been conducting research concentrating on the problems of audio-visual speaker recognition, speech activity detection, and speech enhancement. All share significant commonalities with the AVASR problem. An overview of Makis' work on audio-visual speech technologies can be found in his recent keynote talk at VisHCI'06. Detailed algorithms and results can be found in his publications.

    Prior Work on Statistical Language Modeling (1994-1996)

    During his graduate studies, and as part of his teaching assistant duties, Makis began interacting with faculty at the newly formed Center for Language and Speech Processing (CLSP) at the Johns Hopkins University, and became interested in the problems of acoustic and language modeling for ASR. He subsequently remained at Hopkins past the completion of the Ph.D. as a PostDoctoral Fellow, focusing his research efforts on the problem of statistical language modeling.

    Improving the performance of available language models is essential in the quest for reliable ASR, as well as improvements in machine translation, optical character recognition, and spelling correction. Two critical issues in language modeling are the partition of the observed "history" space into equivalence classes, as well as the estimation of conditional probabilities of the next word, given the observed equivalence class, based on typically sparse data. In his PostDoctoral research, Makis has attacked both problems: Equivalence classes have been determined by means of n-gram language models, decision tree, and decision trellis classifiers, whereas conditional probabilities have been estimated from sparse data using variants of the classical deleted interpolation smoothing algorithm.

    More specifically, Makis developed a baseline n-gram language model, employing the widely available Brown corpus. In the process, he devised an improved smoothing algorithm that significantly reduced test data perplexity over the traditional smoothing approach. Makis then studied three variants of decision tree language models for the same problem, and obtained the best results by using a K-means type clustering algorithm to design decision tree splits. The decision tree language model was further improved by a merge-split algorithm, which converted the decision tree into a decision trellis. This approach gave encouraging results on the Brown corpus. Details can be found in [73].

    PhD Thesis Work on Image Analysis Using Markov Random Fields (1990-1994)

    Makis' doctoral thesis has focused on the theory and applications of Markov random fields (MRFs) in image processing and analysis. MRFs belong to a well known exponential parametric family of random field models, and are extensively used for modeling spatial interaction phenomena in terms of a few parameters. Although conceptually simple, their probability distribution has a rather involved form, due to the intractable nature of its normalizing constant. This constant is known as the partition function and, in the general case, lacks a closed-form expression, thus hindering statistical inference (e.g., parameter estimation and hypothesis testing) of fully or partially observed MRF images. Reliable estimation of the partition function (and, hence of likelihoods) would allow replacing various ad-hoc and moderately-only successful MRF parameter estimation techniques with efficient maximum likelihood parameter estimation. That would result to improved performance in problems such as image restoration, segmentation, texture modeling, classification, etc.

    Motivated by the above facts, Makis' dissertation concentrated on the problem of efficiently estimating the partition function of MRF models. A stochastic simulation (i.e., Monte Carlo) approach was proposed for this purpose, with a number of Monte Carlo algorithms introduced and rigorously analyzed in terms of computational complexity and statistical properties of the resulting estimators. This unified analysis allowed a comparative study of the algorithms, and was backed up by extensive simulation experiments. The best such algorithm was then successfully applied to maximum likelihood statistical inference of MRF images. Original contributions of the dissertation included: (a) Development of new partition function estimation (PFE) algorithms. (b) Analysis and classification of all Monte Carlo PFE algorithms into two categories, in terms of their computational complexity. (c) Suggestion of a practical and most efficient PFE algorithm, based on the above mentioned analysis. (d) Application of this algorithm to Monte Carlo maximum likelihood based parameter estimation and hypothesis testing of fully and partially observed MRF images, as well as to image restoration. Details can be found in [74].


    Last Update: Jan. 30, 2008