Spotlight Recent Publications

"A Clustering Algorithm for Asymmetrically Related Data with its Applications to Text Mining," K. Krishna and Raghu Krishnapuram, CIKM-2001, Atlanta, USA, November 2001, pp. 571-573.

Clustering techniques find a collection of subsets of a data set such that the collection satisfies a criterion that is dependent on a relation defined on the data set. The underlying relation is traditionally assumed to be symmetric. However, there exist many practical scenarios where the underlying relation is asymmetric. One example of an asymmetric relation in text analysis is the inclusion relation, i.e., the inclusion of the meaning of a block of text in the meaning of another block. In this paper, we consider the general problem of clustering of asymmetrically related data and propose an algorithm to cluster such data. To demonstrate its usefulness, we consider two applications in text mining: (1) summarization of short documents, and (2) generation of a concept hierarchy from a set of documents. Our experiments show that the performance of the proposed algorithm is superior to that of more traditional algorithms.


"A System for Real-time Competitive Market Intelligence", S. M. Weiss and N. K. Verma . in Eigth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), Edmonton, Canada, July 2002.

A method is described for real-time market intelligence and competitive analysis. News stories are collected online or a designated group of companies. The goal is to detect critical differences in the text written about a company versus the text or its competitors. A solution is found by mapping the task into a non-stationary text categorization model. The overall design consists o the following components:(a)a real-time crawler that monitors newswires or stories about the competitors (b)a conditional document retriever that selects only those documents that meet the indicated conditions (c)text analysis techniques that convert the documents to a numerical format (d)rule induction methods for finding patterns in data (e)presentation techniques or displaying results. The method is extended to combine text with numerical measures, such as those based on stock prices and market capitalizations, that allow or more objective evaluations and projections.


"Clustering by Pattern Similarity in Large Data Sets", H. Wang, J. Yang, W. Wang, and P.S. Yu, Proc. ACM SIGMOD Conference, Madison, WI, June 2002.

In most of the clustering models, the concept of similarity is based on distance. Here we explore a more general type of similarity, where two objects are considered to be similar if they exhibit a coherent pattern on a subset of dimensions. For instance, in DNA microarray analysis, the expression levels of two genes may rise and fall synchronously in response to a set of environmental stimuli. Although the magnitude of their expression levels may not be close, the patterns they exhibit can be very much alike. Discovery of such clusters of genes is essential in revealing significant connection in gene regulatory networks. E-commerce applications, such as collaborative filtering, can also benefit from the new model, which captures not only the closeness of values of certain leading indicators, but also the closeness of (purchasing, browsing, etc.) patterns exhibited by the customers. Our work introduces an effective algorithm to detect such clusters.


"Discovering actionable patterns in event data", J.L, Hellerstein, S. Ma and C. Perng, IBM system journal, Vol. 41, No. 3, 2002.

Applications such as those for systems management and intrusion detection employ an automated real-time operation system in which sensor data are collected and processed in real time. Although such a system effectively reduces the need for operation staff, it requires constructing and maintaining correlation rules. Currently, rule construction requires experts to identify problem patterns, a process that is time-consuming and error-prone. In this paper, we propose reducing this burden by mining historical data that are readily available. Specifically, we first present efficient algorithms to mine three types of important patterns from historical event data: event bursts, periodic patterns, and mutually dependent patterns. We then discuss a framework for efficiently mining events that have multiple attributes. Last, we present Event Correlation Constructor—a tool that validates and extends correlation knowledge.


"Mining Generalised Disjunctive Association Rules," Amit A. Nanavati, Krishna P. Chitrapura, Sachindra Joshi, and Raghu Krishnapuram, CIKM2001, Atlanta, November 2001, pp. 482-489

This paper introduces generalised disjunctive association rules such as "People who buy bread also buy butter jam", and "People who buy either raincoats or umbrellas also buy flashlights". A generalised disjunctive association rule allows the disjunction of conjuncts, "People who buy jackets also buy bow ties or neckties and tiepins". Such rules capture contextual inter-relationships among items.Given a context (antecedent), there may be a large number of generalised disjunctive association rules that satisfy the minsupp and minconf constraints. It is computationally expensive to find all such rules. We present algorithm thrifty traverse which borrows concepts such as subsumption from propositional logic to mine a subset of such rules in a computationally feasible way. We experimented with our algorithm on US census data as well as transaction data from a grocery superstore to demonstrate its computational feasibility, utility and scalability.


"Privacy Preserving Mining of Association Rules", A. Evfimievski, R. Srikant, R. Agrawal, J. Gehrke, in Eigth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), Edmonton, Canada, July 2002.

This paper presents a framework for mining association rules from transactions consisting of categorical items where the data has been randomized to preserve privacy of individual transactions. While it is feasible to recover association rules and preserve privacy using a straightforward uniform randomization, the discovered rules can unfortunately be exploited to find privacy breaches. We analyze the nature of privacy breaches and propose a class of randomization operators that are much more effective than uniform randomization in limiting the breaches. We derive formulae for an unbiased support estimator and its variance, which allow us to recover itemset supports from randomized datasets, and show how to incorporate these formulae into mining algorithms. Finally, we present experimental results that validate the algorithm by applying it on real datasets.


"Sequential Cost-Sensitive Decision Making with Reinforcement Learning", E. Pednault, N. Abe, B. Zadrozny, H. Wang, W. Fan, and C. Apte, in Eigth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), Edmonton, Canada, July 2002.

Recently, there has been increasing interest in the issues of cost-sensitive learning and decision making in a variety of applications of data mining. A number of approaches have been developed that are effective at optimizing cost-sensitive decisions when each decision is considered in isolation. However, the issue of sequential decision making, with the goal of maximizing total benefits accrued over a period of time instead of immediate benefits, has rarely been addressed. This paper proposes a novel approach to sequential decision making based on the reinforcement learning framework. Our approach attempts to learn decision rules that optimize a sequence of cost-sensitive decisions so as to maximize the total benefits accrued over time. We use the domain of targeted marketing as a testbed for empirical evaluation, and show that the proposed method for optimizing total accrued benefits outperforms the usual targeted-marketing methodology of optimizing each promotion in isolation.


Seminal Papers
"A Probabilistic Estimation Framework for Predictive Modeling Analytics", by C. Apte, R. Natarajan, E.P.D. Pednault, and F. Tipu, in IBM Systems Journal, Vol. 41, No. 3, August 2002.

IBM ProbE (for probabilistic estimation) is an extensible, embeddable, and scalable modeling engine, particularly well-suited for implementing segmentation-based modeling techniques, wherein data records are partitioned into segments and separate predictive models are developed for each segment. We describe the ProbE framework and discuss two key business solutions that have been built using ProbE: the IBM Underwriting Profitability Analysis for insurance risk management, and the IBM Advanced Targeted Marketing for Single Events for direct mail database marketing.


"Mining Associations between Sets of Items in Massive Databases", R. Agrawal, T. Imielinski and A. Swami., Proc. of the ACM SIGMOD Int'l Conference on Management of Data, Washington D.C., pages 207-216, May 1993.

We are given a large database of customer transactions. Each transaction consists of items purchased by a customer in a visit. We present an efficient algorithm that generates all significant association rules between items in the database. The algorithm incorporates buffer management and novel estimation and pruning techniques. We also present results of applying this algorithm to sales data obtained from a large retailing company, which shows the effectiveness of the algorithm.


"Privacy-Preserving Data Mining", R. Agrawal and R. Srikant, Proc. of the ACM SIGMOD Conference on Management of Data, Dallas, May 2000.

A fruitful direction for future data mining research will be the development of techniques that incorporate privacy concerns. Specifically, we address the following question. Since the primary task in data mining is the development of models about aggregated data, can we develop accurate models without access to precise information in individual data records? We consider the concrete case of building a decision-tree classifier from training data in which the values of individual records have been perturbed. The resulting data records look very different from the original records and the distribution of data values is also very different from the original distribution. While it is not possible to accurately estimate original values in individual data records, we propose a novel reconstruction procedure to accurately estimate the distribution of original data values. By using these reconstructed distributions, we are able to build classifiers whose accuracy is comparable to the accuracy of classifiers built with the original data.