Online data continues to grow at an explosive pace, due to the Internet and the widespread use of database technology. This phenomenon has created an immense opportunity and the need for methodologies of Knowledge Discovery and Data Mining (KDD). An interdisciplinary area, KDD focuses upon building automated techniques for extracting useful knowledge from data. Research in this area draws principally upon methods from statistics, data management, pattern recognition, and machine learning to deliver advanced techniques for business intelligence. IBM Research has been at the forefront of this exciting new area from its very beginning. Key advances in robust and scalable data mining techniques, methods for fast pattern detection in very large databases, text and web mining, as well as innovative business-intelligence applications have come from our worldwide research laboratories.
An area of particular focus in our KDD research has been high-performance, scalable data-mining techniques for large-scale databases and data repositories. IBM’s early leadership in this area was established by our invention of association-rule and sequential-patterns technology for efficiently detecting patterns in large-scale databases. These and other technologies for scalable and parallel data mining provided the original basis and impetus for IBM’s flagship data-mining products. This theme continues in recent research activities that include automatic subspace clustering, discovery-driven exploration of OLAP (On Line Analytical Processing) data cubes, and fast techniques for pre-computing and maintaining OLAP data.
Another area of investigation has focused on predictive data-mining algorithms, systems, and solutions. One long-term effort has centered on rule-based predictive modeling and its integration into data-mining frameworks. A recent effort has resulted in new data-mining middleware for rule-based probabilistic estimation, which combines machine learning with principles from statistical learning theory and data management for scalable predictive modeling of massive data sets. This technology has been embedded in innovative business-intelligence applications for areas such as insurance risk management and retail targeted marketing. Related research continues in such areas as rare-event predictive mining, robust feature selection, ensemble-based and regularization methods for predictive modeling, and support vector machines.
Some selected on-going projects include:
- Data Analytics Research
The goals of the project are to develop scalable and automated machine learning and statistical modeling based data mining techniques for extracting actionable insights from structured and unstructured data sources, embed mining techniques in middleware platforms, and engage in consulting and services to drive our research agenda for developing novel predictive modeling solutions for data-rich problems in business and industry. 2003 activities include: Developing parallelized implementation of ProbE predictive modeling data mining technologies for database environments; Developing Cross-Channel Optimized Marketing (CCOM) solution for the retail industry; Developing data mining based yield management solutions for the travel & transportation industry; Developing data mining based solutions for market intelligence and competitive insights; Basic research in KDD and ML (Cost Sensitive Learning, Active Learning, Reinforcement Learning, Game Theory, Margin Classifiers.) - Exploratory Data Mining
Current goals of the project include the development of advanced algorithms to explore new data mining techniques and applications, indexing techniques to facilitate non-conventional similarity searches, as well as other optimization and performance subjects (such as optimization techniques for scheduling). In the data mining area, we focus on extending the limitations and shortcomings of state of the art mining methods and exploring new applications areas enabled by the new mining capabilities. These include stream data mining, grid mining, interactive mining and anomaly detection. - Event Mining
Today event management has been largely dependent on model-based approaches, in which domain knowledge is acquired and translated into rules used for real time monitoring. However, this approach becomes cumbersome in managing distributed computer systems because the computer systems are dynamic in nature and evolve rapidly. Since 1998, we have developed a data-driven or discovery-based approach. This is a fundamentally different approach as we aim to learn rules from collected historical data through visual analysis and event mining techniques. Complimentary to the knowledge-based approaches, this approach can be used to learn rules which have not been captured by the model-based approaches, and can also be used to validate models or to assist in establishing a model. Further, it can be used to analyze historical event data so as to discover unexpected, yet interesting information embedded in data. Such information can be used to examine the performance of event management system, pinpoint installation issues, discover normal/abnormal situations, and analyze and validate the temporal relationships of events. Several data mining tools have been developed including Event Browser (a tool for visualizing and analyzing a large number of events at multiple information levels); Event Miner (an integrated set of algorithms for automatically discovering event patterns). - Query Refinement and Disambiguation
We have built a system which answers queries by navigating large hierarchies or taxonomies such as directories for the Internet, library catalogues, and product catalogues. The system also has query refinement and disambiguation tools which determine what part of the hierarchy is relevant to the user's query by seeking relevance feedback. Our approach to query disambiguation is to generate a compact representation of all contexts of the query from all documents that are possibly relevant to the query. The user can choose a particular context, thereby clarifying the query. The system will then continue the search within the particular context. Search Essence and Context miner are two examples of miners which help to refine and zero in on the information the user needed: SearchEssence generates a topic hierarchy from the search results returned by a search engine. Each node in the hierarchy is associated with a topic or a sub-topic and is populated with search results that are representative of that node. ContextMiner derives the concepts, features and specialization associated with the query. The technology behind this miner includes a parts-of-speech tagging, followed by mining of generalized disjunctive association rules. - eShopMonitor
Automation of site monitoring can provide valuable market intelligence and insight. It can also help maintain the quality of the site. For example, the ability to identify incorrect or missing data quickly will help prevent problems such as the $9.99 air fare, missing URLs, and products disappearing/reappearing on the Web. Moreover, the ability to track trends at competitor sites in a timely fashion will add significantly to commerce intelligence. In the eShopMonitor project, we developed solutions to several problems related to automatic information extraction from Web sites as well as to detect anomalies in prices or descriptions and track trends. The technology can also be used for other applications such as price comparisons and competitive intelligence. The eShopMonitor includes : a crawler component which can crawl dynamic pages by emulating form-filling and Java scripts; a miner component which can be configured to extract fields of interest based on a few examples given by the user and stored the extracted information in a database; a query language and interface to access the stored data; and a method for automatically reporting interesting changes to subscribers. - Privacy Preserving Data Mining
The goal of this project is to invent algorithms for discovering accurate data mining models without access to precise information in individual data records, thus finessing the conflict between privacy and data mining. This research provides a practical way out of the current dilemma facing the e-businesses: how to create services based on data analytics while respecting privacy of individuals and complying with the privacy regulations on the horizon (already in place in Europe and affecting multinationals in U.S. under safe harbor). Our initial work developed algorithms for building classifiers from training data in which the values of individual records have been perturbed. Recently we also invented agorithms for mining association rules from transactions consisting of categorical items where the data has been randomized to preserve privacy of individual transactions.
Related Publications
K. Krishna and Raghu Krishnapuram. A Clustering Algorithm for Asymmetrically Related Data with its Applications to Text Mining. CIKM-2001, Atlanta, USA. November 2001.
"A System for Real-time Competitive Market Intelligence", S. M. Weiss and N. K. Verma . in Eigth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), Edmonton, Canada, July 2002.
"Clustering by Pattern Similarity in Large Data Sets", H. Wang, J. Yang, W. Wang, and P.S. Yu, Proc. ACM SIGMOD Conference, Madison, WI, June 2002.
"Discovering actionable patterns in event data", J.L, Hellerstein, S. Ma and C. Perng, IBM system journal, Vol. 41, No. 3, 2002.
"Mining Generalised Disjunctive Association Rules," Amit A. Nanavati, Krishna P. Chitrapura, Sachindra Joshi, and Raghu Krishnapuram, CIKM2001, Atlanta, November 2001, pp. 482-489
"Privacy Preserving Mining of Association Rules", A. Evfimievski, R. Srikant, R. Agrawal, J. Gehrke, in Eigth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), Edmonton, Canada, July 2002.
"Sequential Cost-Sensitive Decision Making with Reinforcement Learning", E. Pednault, N. Abe, B. Zadrozny, H. Wang, W. Fan, and C. Apte, in Eigth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), Edmonton, Canada, July 2002.
"A Probabilistic Estimation Framework for Predictive Modeling Analytics", by C. Apte, R. Natarajan, E.P.D. Pednault, and F. Tipu, in IBM Systems Journal, Vol. 41, No. 3, August 2002.
"Mining Associations between Sets of Items in Massive Databases", R. Agrawal, T. Imielinski and A. Swami., Proc. of the ACM SIGMOD Int'l Conference on Management of Data, Washington D.C., pages 207-216, May 1993.
"Privacy-Preserving Data Mining", R. Agrawal and R. Srikant, Proc. of the ACM SIGMOD Conference on Management of Data, Dallas, May 2000.
Recent Accomplishments
Parallelized predictive modeling integrated into parallel database environments ProbE (for Probabilistic Estimation) is a customizable data mining engine that is being developed to enhance IBM's predictive modeling products and services. ProbE mines data to produce a segmentation based if-then predictive model. The segments are defined by conditions that appear in the ``if'' part of the rule, which could be range tests on reals, integers, and ordinals, or subset tests on nominals. Predictions are made by statistical models that appear in the ``then'' part of rules, which could be linear regression with feature selection, or logistic regression with feature selection, or more specialized models. A key consideration in the design of ProbE is scalability. ProbE is designed to work with very large, out of memory data sets. ProbE is also designed to be data-partition parallelized, wherein large data sets are partitioned across multiple processors, with each processor accessing data only in the partition assigned to it and with only statistical summary information being exchanged among processors. Because the approach minimizes the amount of communication among processors, it is anticipated that it will achieve near-linear improvements in execution speed. We are currently developing and testing prototypes of parallelized predictive modeling integrated into parallel database environments.
Data-driven event management design announced as new service offering Based on our event mining techniques an tools, a new IGS service offering called data-driven event management design, has been announced in Oct. 2002. Since then, several customer engagements have been successfully executed to improve customer's event management systems.
Rate this article
