A variety of data mining techniques, recently combined
in a series of new
products and services, enables companies to make maximum use of information that they collect
In Brief:
Data mining technology
permits organizations to make the most effective use of the vast amounts of data that they have gathered about customers, suppliers and industry trends.
During the past six years, IBM scientists have
devised a series of "mining" techniques that can find
associations among data, make predictions on the basis of current records and identify clusters of related data. Those methods have now been incorporated into a wide range of products and services that users can tailor to their specific data mining needs.
Names and addresses. Social security numbers. Buying habits. Geographic preferences. Suppliers' price lists. Rivals' activities and pricing strategies. Prices and other characteristics of the many thousands of items on the shelves of the typical supermarket. Modern companies are awash in data on customers, clients, suppliers, competitors and industry trends. But they collect their data far more effectively than they use it. The CEO of a retail company who approached IBM's Research Division for help with that problem six years ago estimated that his company was using only 4 percent of the data that it gathered.
The problem lies in knowing how to use the data effectively. The chosen method for doing so is data mining, a technology to which Research and its partners in IBM have made significant contributions. The work has already produced practical results in the marketplace. IBM recently unveiled a range of "decision-support offerings" to permit clients to excavate their data for high-value business intelligence - such items as hidden relationships, fresh trends and previously undetected patterns that conventional searching techniques do not reveal.
The technology has a wide variety of potential applications. Organizations can use it to detect fraud. Retailers can glean ideas on how best to target their promotional campaigns. Customers can apply the technology to identify suppliers - and suppliers to identify customers - free of geographical constraints. The key in every case is the use of data mining to extract useful pearls from a vast and apparently disordered mŽlange of data. "The idea is to discover information buried inside the data. You want to find combinations in data that are not readily apparent," explains Bill Pulleyblank, director of mathematical sciences at Watson, and Research's data mining strategist.
The new technology has broad appeal to two major groups of customers. It provides data analysts who specialize in mining with a selection of tools for specific problems, such as setting up rules to detect fraud. More important, it permits business executives who have little experience in analyzing data to carry out their own data mining. For example, rules for mining data are put in visual terms, and the results arrive in user-friendly forms, making it relatively simple to see the relationships in the data.
The new products and services, introduced in April 1996, provide companies of all sizes with the ability to customize their searches for significant nuggets in their collections of data. One product, the Intelligent Miner - a data mining toolkit that permits users to analyze, extract and validate data from data warehouses - can deal with data in both flat files and the DB2® family of database products, and operates on AIX®, MVS® and the parallel-processing SP2®. In fact, the ability to operate in a parallel computing environment is a key to data mining technology, because it often involves very large amounts of data.
The products include algorithms developed at IBM's Almaden Research Laboratory, and involved major contributions from many organizations across IBM. They include the Thomas J. Watson Research Center, the Santa Teresa Laboratory, the European Center for Applied Mathematics in Paris, France, the Tokyo Research Laboratory and laboratories in Rochester, Minnesota; Hursley, England; and Böblingen, Germany.
Data mining vs. database querying
Although the term "data mining" has become something of a buzzword in recent months, there is still some confusion about its definition. "Data mining is the process of extracting valid, previously unknown, and ultimately comprehensible information from large databases and using it to make crucial decisions," explains Evangelos Simoudis, vice president, Decision Support Solutions for IBM North America, who works at the Almaden Research Center.
That definition greatly extends the simple procedure of database querying that is often included under the aegis of data mining. True data mining is a far more sophisticated procedure than simply asking a question of a database. It permits users to explore and draw useful information from "unstructured" data that is arranged in apparently random patterns. Database queries require that the original data be structured in particular ways - arranged in specific fields, for example. "Certain data mining queries can be regarded as database queries," explains Watson researcher Se-June Hong, "but databases can't answer them." Pulleyblank explains the difference more succinctly. "Data mining comes in when you don't know the questions to ask," he asserts.
To illustrate the technology's sophistication, imagine a typical candidate for using the technology: a bank manager who wants to identify the best customers to contact in a promotional campaign for the bank's new offering of mutual funds.
Using traditional database searching, the manager must make a reasonable hypothesis about the types of customers likely to be the best bets for the mutual funds' promotional campaign. He or she might assume, for example, that married, two-income families with a high net worth are the most likely targets, then set up the appropriate query of the database, and try to interpret the results to determine whether the hypothesis is correct. "Data mining, in contrast, automatically discovers relationships hidden in the databank, and presents them in an understandable way," explains Ashok Chandra, director of database systems at Almaden. A search might be set up, for example, to detect clusters of bank customers with similar characteristics, and to determine which of those clusters are most likely to respond to a mutual funds promotion. "Data mining seeks useful data of two kinds," explains Hong. "That traditionally known to be useful, and that which is useful because it pops up."
Forms of data mining
IBM scientists have defined five major types of data mining operations: associations, sequential patterns, similar time series, classification and regression, and clustering. The Intelligent Miner toolkit, just announced, incorporates all five.
Invention of the association operation was initially inspired by problems in the retail business. By trawling through records from point-of-sale terminals, for example, retailers can discover what types of items sell together, then adjust their store layouts and advertising campaigns to profit most effectively from those associations. "Statistical tools are available to deal with some of the data, but the problem is the sheer amount," says Almaden researcher Rakesh Agrawal, who led the team that developed Quest, a basic tool for several data mining operations. "Looking for relationships among retail items becomes a real problem when 10,000 items are involved, as is typical in the retail business."
Other novel data mining operations in Quest, whose development team included Manish Mehta, R. Srikant and John Shafer, include discovery of sequential patterns and similar time series. While association finds events that occur together from logical collections of events, the sequential-patterns operation finds common events that occur over a period of time. Examples of applications include attaching fliers advertising one type of mail-order item to invoices for another, based on the chances of add-on sales, and identifying patterns of symptoms and diseases in medical research.
Similar time series, as its name suggests, can be used to identify similar series gathered over a period of time in large databases. Applications include identifying companies with similar patterns of growth, finding products with similar selling patterns and discovering stocks or mutual funds with similar price movements.
Classification and regression uses existing data to create models of the behavior of variables in databases. For example, a credit card company might use the technique to model characteristics of individuals regarded as good risks for providing credit, as opposed to poor credit risks, based on the total record of many characteristics in its database. Such characteristics might include income, credit history, type and location of employment and most abstruse data. The capability has an almost human quality. "It permits moving into the front office and point-of-sale locations the emulated intelligence of back-office underwriters and approvers," says Chid Apte of Watson.
Clustering involves segmenting information in databases - related to customers or documents, for example - into definable, homogeneous groups on the basis of specific characteristics. The basic idea is not new; clustering has a long history in statistics. What is original in the data mining context is the application of clustering techniques to nonnumerical attributes. Edna Grossman, a member of IBM's new Decision Support Unit, which works with Watson researchers on developing data mining, points out another attribute of clustering technology. "One thing that makes it different is that you don't tell it how many clusters you want. IBM's unique clustering technology does that itself."
Research has used one other approach to data mining in a unique tool that has found value, and publicity, in the National Basketball Association (see "Nothing But Net,"). That is Advanced Scout, a prototype data mining tool based on "attribute focusing" - the ability to pick out a few salient facts in huge databanks of information. In a way, this method is the opposite of clustering. What it does is help a user to manipulate data relevant to a topic in which he or she is an expert. "It makes it easy for the user to do the mining," explains Watson's Inderpal Bhandari.
The value of associations
What types of information do users seek? Users themselves had little idea when IBM's data mining project started. Retail company Marks and Spencer approached scientists at Almaden in 1990, seeking ways to mine the vast amounts of data that it possessed on its customers. "It became clear that companies wanted value from systems that the traditional approach wasn't providing," explains Agrawal.
The challenge was that of sheer size. "Nobody had looked at such huge amounts of data," says Agrawal, noting that Marks and Spencer had three to five years' worth of customers' inputs, amounting to several gigabytes. "Most of the technology had been developed for small quantities." In response, Almaden scientists developed Quest to detect less-than-obvious associations among items in databases.
Then, using that tool, the scientists conducted four studies to validate and assess the technology's market value. Subjects of the studies included two direct mail companies, a major chain of discount stores and a market information provider. After a while, recalls Agrawal, the joke was: "We know more about these companies than they know themselves." The success of those studies led to the formation of a commercial data mining venture in IBM Marketing and Sales. Later, a product group in the German software development laboratory in Böblingen was formed to build an IBM data mining product.
The most common use of association technology comes in businesses that sell tangible items to the general public. In a project outside the retail domain, the Australian Health Insurance Commission used Quest to look at associations in its data. The data mining keyed in on the over-ordering of medical tests, which it spotted by finding associations among codes for specific tests that were commonly being performed together. Human experts observed that some of the tests were redundant and could be eliminated. The effort found enough cases to realize substantial savings for the commission.
How can companies use these results? Managers who have discovered an unlikely association of items could ensure that both do not go on sale at the same time. And retailer ShopKo is using association technology to assist in its store layout, and to discover cause-and-effect relationships between store items and customers' buying habits. "In general, the data mining process has been worth millions for us," said ShopKo CIO Jim Tucker in a recent Datamation magazine article.
Meanwhile, researchers at IBM's Tokyo Research Laboratory have developed an extension to the Quest association algorithm, named SONAR (System for Optimized Numeric Association), to extract items of interest from numeric data. Japanese banks, for example, are using the tool to identify customers most likely to make late credit card payments.
From past to future
Classification and regression applies most effectively to vast collections of information, such as databases of customers that include multiple fields for every individual. Searching through such databases, classification tools can identify patterns from the past that have predictive value for the future. In the "learning phase," the tools search the data to generate minimal classification rules, which the tools use to classify that collected data into groups that have several characteristics in common. Then, in the "prediction phase," the tools place fresh data in the appropriate groups, which permits them to predict the behavior of the person or object from which that data derived.
Apte expects the typical client to be one that needs to predict the behavior of new customers on the basis of trends shown by long-standing customers. Credit card companies, for example, which have vast stores of data on their clients' credit histories and payment patterns, can use the approach to set up a model for judging the creditworthiness of new applicants for the card.
The method can even help develop new types of markets. One trend in the credit business is wooing "deadbeats" - individuals with poor credit standings, owing to their failure to pay off loans and credit cards in a timely fashion. Some of those individuals, it turns out, change their approach to spending so effectively that they become good credit risks. A potentially profitable market is starting up to serve such individuals, who can be identified by classification and regression.
Other institutions can use the method in similar ways. Insurance companies, for example, can mine data on present policyholders to determine the number of claims that a new customer is likely to make - and then set the cost of the policy accordingly. Stockbrokers can extract predictive data on stocks and bonds, based on past performance.
The Intelligent Miner incorporates a classifier, called SLIQ (Supervised Learning in Quest), and its sibling SPRINT (Scalable, Parallelizable Induction of Decision Trees), both developed as part of the Quest project. The two have speed and the ability to develop prediction models for very large databases. They can also obtain higher accuracies by classifying large databases that other classifiers cannot handle.
Two years ago, Apte's team at Watson applied a system that they had developed called RAMP (Rule Abstraction for Modeling and Prediction) to a project for an Argentine bank, designed to detect fraudulent coins. After determining all the characteristics of real and forged coins that had passed through the bank, the team used data mining to develop simple rules for an automatic sorting system to separate real coins from forgeries.
Creating clusters
Clustering boasts the longest history in data mining. The approach originated in the mind of the Marquis de Condorcet, an 18th-century French philosopher-mathematician, who sought the best way to represent the results of an election with candidates. In the late 1970s, IBM's European Center for Applied Mathematics proposed a so-called integer programming solution to the voter problem that, while not perfect, was close to optimal. The basic idea, says Grossman, is to ask each voter to give his or her own ranking of candidates, rather than to state which single candidate the voter likes best. Manipulating that data provides a solution that comes as close as possible to the voters' collective wishes.
That was very theoretical. However, the Paris team realized that it could use the same approach to segment customers on the basis of their collective attributes recorded in databases, such as age, zip code and spending habits. The team sought to group customers into significant clusters that are both optimally homogeneous within themselves
and maximally different
from all other clusters. To do so, Grossman explains, "the team thought of each attribute as a 'voter' in the Condorcet problem. Then they said, 'Let's find the clustering that agrees with as many votes as possible.'"
Recently, neural network technology developed at IBM's laboratory in Rochester, Minnesota, has been added to the toolkit in an effort to strengthen its ability to detect clusters. Such networks are particularly good at handling numerical data.
Marketing represents an obvious target for clustering. The mining technique can segment potential customers according to specific sets of attributes. Other areas of application include the insurance industry - IBM's group at Hursley in the United Kingdom is developing clustering-based solutions for European insurance companies - and the sorting of documents. The ultimate goal, says Grossman, is automatic or machine-assisted classification of patents, newswires and similar texts.
Fresh applications will undoubtedly emerge as the technology makes its way into the market. And the availability of data mining will persuade companies to adapt their methods of operation. "The ability to discover hidden facts in its data will give any company a huge advantage in its competitive position," asserts Pulleyblank. "Knowing better what their customers really want will revolutionize the way companies do business."
Nothing but Net
Advanced Scout is IBM data mining software geared to detecting statistical patterns, including rare events. It is written specifically with user-friendliness in mind. And it has made an instant hit in particularly public arenas: the stadiums in which National Basketball Association teams play.
The project started at the Thomas J. Watson Research Center in 1990 as a software engineering effort to help IBM improve its software production. It was designed to find faults in the production process rather than in the software. "We realized that we had to do some data mining to find process problems," recalls Inderpal Bhandari of Watson, who heads the Advanced Scout group. "Instead of using skilled data mining analysts, we had to gear it for the development team, who were less experienced in the technique."
Even in an early, primitive form, says Bhandari, the user-friendly program became a really successful internal product. Then, he adds, "It hit me that it was so successful because we had given the production team the ability to do the mining." At that point, Bhandari looked for an external situation that could benefit from the technology. A fan of professional basketball since his student days at the University of Massachusetts, when he followed the Boston Celtics, Bhandari decided to concentrate on the NBA. He had read that Pat Riley, then the coach of the New York Knicks, kept a database; early last year, he approached the Knicks with the idea of applying Advanced Scout. Riley's assistant coach, Bob Salmi, agreed to give it a try.
The technology benefits from the fact that the NBA keeps a record of every "event" in every game: passes, shots, rebounds, double-teaming of one player by the opposition team and the like. By mining the data effectively, Bhandari realized, the software can isolate happenings that coaches don't necessarily notice when they watch games live and on film.
The mining revealed something that the Knicks coaching staff had missed. Double-teaming of one player can often open up another for an easy shot. But when the Chicago Bulls played the Knicks, the shooting percentage after Knicks center Patrick Ewing was double-teamed was remarkably low, indicating that the Knicks weren't reacting effectively to the double-teams. To see why, the New York coaching staff looked carefully at tapes of games against Chicago. They showed that Chicago's players broke away from the double-team so quickly that they could get to the open shooter before he could line up his shot effectively.
With that knowledge, the coaches were able to devise alternative strategies to deal with the double-teams. "It's like having another person on your staff who tells you what he notices," enthused Salmi.
Last September, following its successful tryout, IBM offered Advanced Scout to the NBA, for which it was a corporate sponsor. The NBA, in turn, gave its 29 teams the opportunity to use it. Sixteen teams adopted the technology last season and two more - including the Miami heat, which Riley now coaches - have signed on since the season ended.
This application, says Bhandari, has proved its user-friendliness, because none of the coaches who have used it have computer backgrounds. The project has one other distinction. "It went," reports Bhandari, "directly from Research to the customers."
Last summer, that customer base expanded to a special group of clients. In collaboration with the Knicks and the NBA, IBM introduced Advanced Scout to inner-city public school students. The program, called HOOPS (Help Out Our Public Schools), permits students to analyze game data using Advanced Scout. IBM and the NBA now plan to establish HOOPS in every city whose local NBA team is using Advanced Scout.