Gerasimos Potamianos has joined the Institute of Informatics and Telecommunications at the National Center for Scientific Research (NCSR), "Demokritos", in Athens, Greece, as a Research Director.
He can be contacted at gpotam@ieee.org.
Updated information can be found at http://www.iit.demokritos.gr/~gpotam
This web page is no longer being maintained.
Gerasimos (Makis) Potamianos
Manager, Multimodal Conversational Solutions Department
Human Language Technologies / Multilingual Analytics and User Technologies, IBM T.J. Watson Research Center
|
| |||
| CONTACT INFO | |||
| RESUME | |||
| SHORT BIO | |||
| PUBLICATIONS BY TOPIC | BY TYPE |
Makis manages the Multimodal Conversational Solutions Department at the IBM Thomas J. Watson Research Center. In this role, he is leading the setting and execution of the IBM research agenda for multimodal conversational technologies involving development and integration of multiple input modalities (speech, video, etc), natural language processing, and dialog management into conversational platforms and solutions.
A main focus of the group is in developing conversational interaction solutions on embedded devices using natural language understanding and dialog management. The goal is to do so using free-form speech input, advancing beyond traditional command and control, and allowing recognition of multiple tokens in one turn, as well as to implement sophisticated dialog management for resolving ambiguities and recovering from low confidence input. The resulting systems are expected to run on commercially available embedded platforms. Part of these activities are conducted within the DICIT project, which focuses on multimodal control (voice in the far-field and traditional remote) of interactive television. Additional work is being devoted to developing navigation and music selection systems, running on automotive embedded platforms.
In addition to his management responsibilities, Makis is currently involved in research activities that span the areas of multimodal speech processing for human-computer interaction and ambient intelligence. Work concentrates in the areas of audio-visual speech processing, automatic speech recognition, multimedia signal processing and fusion, as well as computer vision for human detection and tracking. These activities are briefly described next.
Makis' earlier research work (1990-1996) on language modeling and image analysis (while at the Johns Hopkins University) is discussed briefly at the end.
Technology Development for Ambient Intelligence as Part of the CHIL and NETCARITY Projects
Recent efforts of the group have been steered towards extending research expertise and assets developed as part of audio-visual speech research (discussed later) and monomodal technologies (speech, vision) to the problem of technology development for ambient intelligence environments. These efforts have been initiated with IBM participation in the now complete European Union funded project CHIL (2004-2007), as well as the recently commenced NETCARITY and DICIT efforts.CHIL, "Computers in the Human Interaction Loop", is a recently completed integrated project within the FP6 European Union framework programme with the participation of 15 partner sites from nine countries, under the joint coordination of the Fraunhofer Institute fur Informations- und Datenverarbeitung (IITB) and the Integrated Systems Lab (ISL), of the University of Karlsruhe (UKA), Germany. CHIL focuses on analyzing, understanding, and facilitating human interaction during lectures and meetings inside smart rooms equipped with multiple far-field audio and visual sensors. Based on these, the CHIL consortium effort has concentrated on detecting, classifying, and understanding human activity in the space, addressing the basic questions of the "who", "where", "what", "when", and "how" of the interaction. In the CHIL vision, computers fade into the background, reduced to discreet observers of human activity, ready to provide services proactively and implicitly supporting the meeting participants.
The IBM effort in this project has been led by Dr. Jan Kleindienst, head of a team located at IBM, Czech Republic, Prague, and part of the Human Language Technology Department of IBM Research. The main emphasis of that work is on the CHIL architecture and user modeling. In particular, the IBM team has developed tools for facilitating communication between CHIL perception technology components and CHIL services, as well as for modeling the lecture/meeting "situation".
These activities are being complemented by research work led by Makis at the IBM T. J. Watson Research Center. In particular, Makis has been working and overseeing efforts focusing on the following:
An overview of the IBM activities in the CHIL project can be found in Makis' keynote talk at VisHCI'06, with more details described in relevant publications.
The above work has continued in the framework of the NETCARITY project. Most recently the IBM team led by Makis has been concentrating on the problem of acoustic scene analysis for smart home environments with emphasis on the protection and monitoring of the elderly. For example, the work aims to detect acoustic events that signify danger (falls, break-ins), as well as longer-term activities of daily living.
Audio-Visual Speech Technologies Research
Visual information is beneficial to a number of speech technologies, for example, it can dramatically improve automatic speech recognition accuracy in noise (similarly to human lipreading), it can help disambiguate the "who" and "when" of the active speaker in multi-party interaction (speech activity detection, speaker localization), help with person recognition (identification, verification), as well as improve speech "delivery" through the use of visual speech synthesis (avatars, photorealistic talking faces).Makis has been actively involved in many of the above aspects of audio-visual speech processing over the past several years with IBM, conducting research and coordinating team efforts. His main activity focuses on audio-visual automatic speech recognition (AVASR). He has worked on all aspects of the problem, including algorithms for face detection and tracking, extraction of speech-informative visual features, integration of audio and visual features into the speech recognition process, as well as development of AVASR prototypes. In this effort, Makis has also collaborated with external customers, as well as colleagues in academia, starting with the summer 2000 workshop at the Johns Hopkins University (WS'00). The work has received an IBM Research Accomplishment award in 2002, and it has led to the 2006 North American Frost and Sullivan Award for Excellence in Research in the speech recognition field awarded to the IBM Corporation. Some highlights of Makis' research work on audio-visual speech technologies include:
In addition to AVASR, Makis has been conducting research concentrating on the problems of audio-visual speaker recognition, speech activity detection, and speech enhancement. All share significant commonalities with the AVASR problem. An overview of Makis' work on audio-visual speech technologies can be found in his recent keynote talk at VisHCI'06. Detailed algorithms and results can be found in his publications.
Prior Work on Statistical Language Modeling (1994-1996)
During his graduate studies, and as part of his teaching assistant duties, Makis began interacting with faculty at the newly formed Center for Language and Speech Processing (CLSP) at the Johns Hopkins University, and became interested in the problems of acoustic and language modeling for ASR. He subsequently remained at Hopkins past the completion of the Ph.D. as a PostDoctoral Fellow, focusing his research efforts on the problem of statistical language modeling.Improving the performance of available language models is essential in the quest for reliable ASR, as well as improvements in machine translation, optical character recognition, and spelling correction. Two critical issues in language modeling are the partition of the observed "history" space into equivalence classes, as well as the estimation of conditional probabilities of the next word, given the observed equivalence class, based on typically sparse data. In his PostDoctoral research, Makis has attacked both problems: Equivalence classes have been determined by means of n-gram language models, decision tree, and decision trellis classifiers, whereas conditional probabilities have been estimated from sparse data using variants of the classical deleted interpolation smoothing algorithm.
More specifically, Makis developed a baseline n-gram language model, employing the widely available Brown corpus. In the process, he devised an improved smoothing algorithm that significantly reduced test data perplexity over the traditional smoothing approach. Makis then studied three variants of decision tree language models for the same problem, and obtained the best results by using a K-means type clustering algorithm to design decision tree splits. The decision tree language model was further improved by a merge-split algorithm, which converted the decision tree into a decision trellis. This approach gave encouraging results on the Brown corpus. Details can be found in [73].
PhD Thesis Work on Image Analysis Using Markov Random Fields (1990-1994)
Makis' doctoral thesis has focused on the theory and applications of Markov random fields (MRFs) in image processing and analysis. MRFs belong to a well known exponential parametric family of random field models, and are extensively used for modeling spatial interaction phenomena in terms of a few parameters. Although conceptually simple, their probability distribution has a rather involved form, due to the intractable nature of its normalizing constant. This constant is known as the partition function and, in the general case, lacks a closed-form expression, thus hindering statistical inference (e.g., parameter estimation and hypothesis testing) of fully or partially observed MRF images. Reliable estimation of the partition function (and, hence of likelihoods) would allow replacing various ad-hoc and moderately-only successful MRF parameter estimation techniques with efficient maximum likelihood parameter estimation. That would result to improved performance in problems such as image restoration, segmentation, texture modeling, classification, etc.Motivated by the above facts, Makis' dissertation concentrated on the problem of efficiently estimating the partition function of MRF models. A stochastic simulation (i.e., Monte Carlo) approach was proposed for this purpose, with a number of Monte Carlo algorithms introduced and rigorously analyzed in terms of computational complexity and statistical properties of the resulting estimators. This unified analysis allowed a comparative study of the algorithms, and was backed up by extensive simulation experiments. The best such algorithm was then successfully applied to maximum likelihood statistical inference of MRF images. Original contributions of the dissertation included: (a) Development of new partition function estimation (PFE) algorithms. (b) Analysis and classification of all Monte Carlo PFE algorithms into two categories, in terms of their computational complexity. (c) Suggestion of a practical and most efficient PFE algorithm, based on the above mentioned analysis. (d) Application of this algorithm to Monte Carlo maximum likelihood based parameter estimation and hypothesis testing of fully and partially observed MRF images, as well as to image restoration. Details can be found in [74].
Last Update: Jan. 30, 2008








