Enterprise Search Technology

Innovation Matters


Helping employees quickly find the correct and precise information they need in order to improve the productivity of modern-day medium to large enterprises. Such organizations often produce and accumulate textual data from a wide range of sources and in a variety of formats, and effective tools for searching over that data are central to the organizations' successes.

Today's medium to large enterprises and organizations often generate and accumulate large amounts of data in wide-ranging formats. Much of the data is textual, such as office-type documents, HTML pages, email, internal newsgroups and forums, etc. In addition, both text and numeric data are often kept in relational databases. The data relevant to an employee's information need is often spread across several of the above mentioned data sources, and could be in multiple digital formats. Helping employees to quickly find the correct and precise information they need (and which they are allowed to access) is crucial to improving the productivity of such organizations.

WebSphere Information Integrator OmniFind Edition
Trevi Key Technologies

The challenges facing enterprise search solutions are quite different than those with which Web search engines deal. Web search engines are facing much larger scales, both of data and of number of users and queries. However, enterprise engines usually need to index data much more frequently in order to serve fresh content. Enterprise solutions need to collect, index and rank content from more varied data sources than Web search engines, and need to do a better job of estimating result set sizes (Internet search engines often intentionally report inaccurate result counts). Enterprise users often require advanced search features, meaning that the performance emphasis must shift in order to accommodate scenarios that are rarely encountered on the Web. A counterpart of this challenge is that the query syntax of enterprise solutions is typically richer than what is supported on the Web. In enterprises, users are often not allowed to see all documents, and so search solutions must make sure to only return results that the querying person is actually allowed to see. This challenge doesn't exist on the Web, where search engines only index data that is open to the general public.

The research team responded to the challenges above by developing highly efficient and flexible indexing and retrieval algorithms that deliver high-quality results. Search indices may be updated dozens of times each day, and novel data representation techniques were developed for keeping index sizes small while still preserving fine-grain details of the original data. In order to support the rich retrieval requirements of the enterprise world, a flexible query language was devised that combines easy to use, common free-text syntax with advanced features such as taxonomy-based constraints, searches over numeric data and security-driven filtering of results. Accordingly, the indexing and retrieval components were adapted to support the evaluation of these rich queries. For improving the quality of search results, each index undergoes a phase of global analysis, whose logic is tailored to suit the characteristics of the various data sources from which its documents originate. Furthermore, the ranking parameters for each incoming query are automatically and dynamically tuned by the query engine, based on the statistical properties of each query. In order to monitor and measure search quality, new robust and scalable evaluation methodologies for information retrieval were developed.
The technology developed by Research has been incorporated in WebSphere Information Integrator OmniFind Edition. WebSphere Information Integrator OmniFind Edition adds enterpise search to IBM's expanding portfolio of data/content management solutions.

WebSphere Information Integrator OmniFind Edition evolved from an IBM project called Trevi, aiming to develop a scalable search solution for the IBM intranet (one of the larger intranets in the private sector, coupled with a very large user population). Trevi combined large-scale system skills and indexing expertise from Almaden Research Center with award-winning information retrieval technologies developed in the Haifa Research Lab, along with crawling, integration and administration components developed in IBM's Silicon Valley Lab. Trevi compared favorably with the then-operational outsourced search solution for IBM's intranet, and proceeded to replace that product in September 2003. At the same time, IBM's Software Group decided to invest in expanding Trevi to a general, competitive, enterprise-ready search product. The first release of WebSphere Information Integrator OmniFind Edition became available in November 2004.

Related Publications  

Mining Anchor Text for Query Refinement, R. Kraft and J. Zien, Thirteenth International World Wide Web Conference (WWW'2004), New York, NY, USA, 2004.

Scaling IR-System Evaluation using Term Relevance Sets, E. Amitay, D. Carmel, R. Lempel, A. Soffer, 27th Annual International Conference on Research and Development in Information Retrieval (SIGIR'2004), Sheffield, UK, 2004.

High Performance Index Build Algorithms for Intranet Search Engines, M. F. Fontoura, A. Neumann, S. Rajagopalan, E. Shekita, and J. Zien, 30th International Conference on Very Large Data Bases (VLDB'2004), Toronto, Canada, 2004.

Efficient Query Evaluation using a Two-Level Retrieval Problem, A. Broder, D. Carmel, M. Herscovici, A. Soffer, J. Zien, Twelfth International Conference on Information and Knowledge Management (CIKM'2003), New Orleans, LA, USA, 2003.


Rate this article

Innovator's corner  

Marcus FontouraMarcus Fontoura Researcher

What is the most exciting potential future use for the work you're doing?
One of the most important problems today is searching large data sets such as the Web, large collections of text, genomic information, and databases. Trevi represents a significant advance in search technology in the areas of scalability, performance, search result quality, ease of use and automatic operation. Trevi is the basis of the recently released WebSphere Information Integrator OmniFind Edition enterprise search product, which enables corporations to index and search large document collections very efficiently.

What is the most interesting part of your research?
Developing new algorithms and data structures that advance search technology to new areas. In addition, the cooperation between research and development, including understanding the business needs and the best ways to transition technology built in the research labs into the IBM products.

What inspired you to go into this field?
The challenges of helping people to be more productive in this era of the Web and information explosion.

What is your favorite invention of all time?
The musical instruments.

Research team  

Andrei Broder

Andrei Broder

Nadav Eiron

Nadav Eiron

Michael Herscovici

Michael Herscovici

Shila Ofek-Koifman

Shila Ofek-Koifman

Dafna Sheinwald

Dafna Sheinwald

Eugene Shekita

Eugene Shekita

Aya Soffer

Aya Soffer

Benjamin Sznajder

Benjamin Sznajder

Jason Zien

Jason Zien

Related Research  

Disciplines: Computer Science
Research Areas: Web
Research Labs: Almaden Research Center, Haifa Research Lab