IBM Israel Research Seminars
 
Information extraction -- building annotators to find structured information in unstructured text -- is an important component of many emerging enterprise applications. For the past 3 1/2 years, the Avatar group at IBM Almaden Research center has been working to make these applications a reality by building a robust and scalable platform for information extraction. Our work has already found its way into two IBM products, and we hope to bring out several additional applications in the coming years.
This talk describes our experiences building annotators and the systems that support them. Our initial efforts made use of cascading regular expression grammars, the current state of the art in rule-based information extraction. In the course of our work, we encountered issues with the performance and expressivity of cascading grammars. These issues motivated several generations of improvements, culminating with the development of an algebraic approach to information extraction. We defined an algebra of basic text extraction operations and a system for composing these operators into dataflow graphs. These operator graphs can find important concepts that are difficult or impossible to express using a cascading grammar. The algebraic approach has also lead to significant performance improvements. Borrowing ideas from relational query optimizers, we have developed several optimization strategies that greatly increase annotator performance without changing semantics.
If time permits, I will describe some preliminaries of AQL, a language that enables the definition of annotator rules in a declarative manner and operates on top of the algebra discussed above.
Short bio:
Frederick Reiss is a Research Staff Member at IBM Almaden Research Center in San Jose, California. He joined IBM in 2006 after receiving a Ph.D. in Database Systems from U.C. Berkeley.
 
- Speaker: Frederick R. Reiss, IBM
- Time: 12/06/2008, 11:00 AM - 12:00 PM
- Back to Previous Seminar Listings
