Helping employees quickly find the correct and precise information they need in order to improve the productivity of modern-day medium to large enterprises. Such organizations often produce and accumulate textual data from a wide range of sources and in a variety of formats, and effective tools for searching over that data are central to the organizations' successes.
Today's medium to large enterprises and organizations often generate and accumulate large amounts of data in wide-ranging formats. Much of the data is textual, such as office-type documents, HTML pages, email, internal newsgroups and forums, etc. In addition, both text and numeric data are often kept in relational databases. The data relevant to an employee's information need is often spread across several of the above mentioned data sources, and could be in multiple digital formats. Helping employees to quickly find the correct and precise information they need (and which they are allowed to access) is crucial to improving the productivity of such organizations.
Trevi Key Technologies
The challenges facing enterprise search solutions are quite different than those with which Web search engines deal. Web search engines are facing much larger scales, both of data and of number of users and queries. However, enterprise engines usually need to index data much more frequently in order to serve fresh content. Enterprise solutions need to collect, index and rank content from more varied data sources than Web search engines, and need to do a better job of estimating result set sizes (Internet search engines often intentionally report inaccurate result counts). Enterprise users often require advanced search features, meaning that the performance emphasis must shift in order to accommodate scenarios that are rarely encountered on the Web. A counterpart of this challenge is that the query syntax of enterprise solutions is typically richer than what is supported on the Web. In enterprises, users are often not allowed to see all documents, and so search solutions must make sure to only return results that the querying person is actually allowed to see. This challenge doesn't exist on the Web, where search engines only index data that is open to the general public.
The research team responded to the challenges above by developing highly efficient and flexible indexing and retrieval algorithms that deliver high-quality results. Search indices may be updated dozens of times each day, and novel data representation techniques were developed for keeping index sizes small while still preserving fine-grain details of the original data. In order to support the rich retrieval requirements of the enterprise world, a flexible query language was devised that combines easy to use, common free-text syntax with advanced features such as taxonomy-based constraints, searches over numeric data and security-driven filtering of results. Accordingly, the indexing and retrieval components were adapted to support the evaluation of these rich queries. For improving the quality of search results, each index undergoes a phase of global analysis, whose logic is tailored to suit the characteristics of the various data sources from which its documents originate. Furthermore, the ranking parameters for each incoming query are automatically and dynamically tuned by the query engine, based on the statistical properties of each query. In order to monitor and measure search quality, new robust and scalable evaluation methodologies for information retrieval were developed.
The technology developed by Research has been incorporated in WebSphere Information Integrator OmniFind Edition. WebSphere Information Integrator OmniFind Edition adds enterpise search to IBM's expanding portfolio of data/content management solutions.
WebSphere Information Integrator OmniFind Edition evolved from an IBM project called Trevi, aiming to develop a scalable search solution for the IBM intranet (one of the larger intranets in the private sector, coupled with a very large user population). Trevi combined large-scale system skills and indexing expertise from Almaden Research Center with award-winning information retrieval technologies developed in the Haifa Research Lab, along with crawling, integration and administration components developed in IBM's Silicon Valley Lab. Trevi compared favorably with the then-operational outsourced search solution for IBM's intranet, and proceeded to replace that product in September 2003. At the same time, IBM's Software Group decided to invest in expanding Trevi to a general, competitive, enterprise-ready search product. The first release of WebSphere Information Integrator OmniFind Edition became available in November 2004.
Related Publications
Mining Anchor Text for Query Refinement, R. Kraft and J. Zien, Thirteenth International World Wide Web Conference (WWW'2004), New York, NY, USA, 2004.
Scaling IR-System Evaluation using Term Relevance Sets, E. Amitay, D. Carmel, R. Lempel, A. Soffer, 27th Annual International Conference on Research and Development in Information Retrieval (SIGIR'2004), Sheffield, UK, 2004.
High Performance Index Build Algorithms for Intranet Search Engines, M. F. Fontoura, A. Neumann, S. Rajagopalan, E. Shekita, and J. Zien, 30th International Conference on Very Large Data Bases (VLDB'2004), Toronto, Canada, 2004.
Efficient Query Evaluation using a Two-Level Retrieval Problem, A. Broder, D. Carmel, M. Herscovici, A. Soffer, J. Zien, Twelfth International Conference on Information and Knowledge Management (CIKM'2003), New Orleans, LA, USA, 2003.











