IBM Research has been actively pursuing research in Natural Language Processing over the last few decades. Our mission is to offer speech and language technologies that form the core of current and future products and solutions for processing natural language. We adopt linguistic, knowledge-based, statistical, as well as hybrid methods to natural language processing. Furthermore, our work addresses theoretical issues of computational linguistics and encompasses a wide range of application areas such as speech processing, machine translation, question answering, interactive dialogue systems, text mining and information extraction, natural language understanding and generation, information retrieval, and automatic text summarization.
Some current projects include:
Blue Lego
This goal of this project is business analytics from structured and unstructured data. Using a tight integration of structured and annotated unstructured text in an OLAP like data-model this approach models the aggregated information extracted from unstructured text.
GlossOnt
A domain ontology building tool which applies text mining technology to build domain-specific ontologies. GlossOnt focuses on a particular domain concept at a time, and dynamically acquires source documents about the target concept, such as domain glossaries and search documents. It induces ontological concepts and relations which are related to the target concept from the source documents.
Language Translation
This project deals with natural language analysis and translation by computer. Since the expansion of the Internet increases the chances we face documents written in foreign languages, we focuses on the translation of documents accessible through the Internet.
PIQUANT (Practical Intelligent QUestion ANswering Technology)
The focus of the PIQUANT project is to explore how best to integrate and balance various technologies such as a knowledge base, NLP, planning and traditional text-based IR in order to build an efficient, modular, multi-agent Question Answering environment. The primary goal of PIQUANT is to improve question answering performance by leveraging knowledge-based, statistical, and linguistic approaches to QA. We have developed a modular and extensible QA architecture that facilitates the integration of independently produced knowledge sources, provides a uniform interface to accessing knowledge from these distinct sources, and enables employment of multiple answering agents that may employ vastly different strategies to answering questions.
SAIL
Machine learning applied to information extraction shows promising results. However, to learn accurate annotators, learning algorithms require quite a lot of accurately labeled training data, which is often unavailable. To address this problem in the area of named entities (e.g. names of people, places, organizations, genes, proteins, diseases, etc.), we have developed an interactive learning workbench (SAIL) that substantially reduces the amount of manual labor and level of expertise required to train annotators. Key to SAIL is the use of our Robust Risk Minimization learning algorithm (Zhang, et al 2002), a very efficient statistical algorithm for learning linear classifiers that also provides in-class probability estimates. SAIL uses in-class probability estimates for various forms of active or semi-supervised learning, in which a developer need only evaluate, and perhaps revise, a relatively small number of candidate annotations to produce accurate results. SAIL provides a number of functions that speed up the learning process, e.g., automatic acceptance or rejection of annotations based on user-determined confidence levels.
Statistical Machine Translation
In the Statistical Machine Translation project we are exploring new algorithms to improve the quality of machine translation by exploiting large parallel corpora consisting of sentence pairs that are translations of each other to build a statistical translation model between the two languages. We are building systems in several language pairs ranging from Arabic-Englsih, Chinese-English, Hindi-English to English-French. We are exploring systems that use word-to-word, phrase-to-phrase, and parse-based translation techniques. We have also developed the world's first statistical machine translation product for Arabic-English that runs on Windows and Linux and has been deployed at customer locations.
Text mining
The aim of the text mining project is to research technologies to discover useful knowledge from enormous collections of documents, and to develop a system to provide this knowledge and to support the user's decisions. Usually data mining technologies mine knowledge from data with well-formed schemes such as relational tables. But, text data don't have such scheme, and information is described freely in the documents. Therefore, we focus on Natural Language Processing(NLP) technologies to extract such information. Using NLP technologies, documents are transformed into a collection of concepts, described using terms discovered in the text.
XTeKS (Extensible Text Knowledge Services)
A prototype text analysis system that richly annotates documents in a topic domain, based on thesauri and other input linguistic resources. Annotations include entity names and selected categories of facts and relations. These annotations can support advanced text search, document clustering and text mining applications, and a version of XTeKS called "BioTeKS" is being used to explore the analysis of biomedical text for problem solving in the Life Sciences.
"HowtogetaChineseName: Segmentation and Combination Issues", Hongyan Jing, Radu Florian, Xiaoqiang Luo, Tong Zhang, and Abraham Ittycheriah, EMNLP-2003.
"In Question Answering, Two Heads are Better Than One", Jennifer Chu-Carroll, Krzysztof Czuba, John Prager, and Abraham Ittycheriah, HLT/NAACL-2003.
"Language Model Based Arabic Word Segmentation", Young-Suk Lee, Kishore Papineni, Salim Roukos, Ossama Emam, and Hany Hassan, ACL-2003.
"Sentiment analysis: capturing favorability using natural language processing", Tetsuta Nasukawa and Jeonghee Yi, The Second International Conference on Knowledge Capture (K-CAP 2003).
"Towards Ontologies On Demand", Youngja Park, Roy Byrd, and Branimir Boguraev, Proceedings of Workshop on Semantic Web Technologies for Scientific Search and Information Retrieval.
"tRuEcasIng", Lucian Vlad Lita, Abe Ittycheriah, Salim Roukos, and Nanda Kambhatla, ACL-2003.
