Software That Automatically Sorts Incoming
Messages Could Speed Responses, Improve Service And Yield Valuable Clues
About Customer Needs
In Brief:
A machine-learning system being developed at IBM Research can quickly
classify and route email messages in a sample provided by NationsBank. It
could also be adapted to provide templates that speed employees'
responses to customer queries, or reply on its own. Generating and refining
its own classification rules on new data, the system is gradually improving
in accuracy.
Sooner or later, anyone who uses email discovers that there really can be
too much of a good thing. Large businesses are especially hard hit. Some
corporations, flooded with tens of thousands of electronic messages a
month, are forced to hire scores of people just to sort and respond to the
steady stream of inquiries.New software being developed at IBM Research
could provide the tools to manage this important but sometimes overwhelming
medium. One prototype analyzes, categorizes and routes messages according
not merely to their subject lines but to their contents. Potentially, it
could save a company thousands of person-hours -- and millions of dollars
-- each year while improving customer service.
NationsBank learned just how daunting large quantities of email can be. In
1996, as its electronic banking clientele grew from 50,000 to 250,000, the
number of messages soared from a few hundred to 20,000 a month. The
following year, the bank had to hire about 100 people to cope with the
extra load. To complicate matters, three or four times as much email
arrived on Monday as on any other day, creating a major work-balancing
problem. NationsBank wanted to reply quickly in order to provide better
customer service than its competitors, but the bank's studies showed that
an employee could reliably answer 20 to 25 email requests per day. Just
routing the mail was a formidable task.
The bank asked Jim Deupree, a principal in IBM Consulting's Banking,
Finance and Securities Industry unit, if software could be developed to
categorize incoming email automatically.
Deupree approached David Johnson, manager of natural language understanding
at IBM's Thomas J. Watson Research Center, who had been working on email
categorization. The inquiry led to the establishment of a "first-of-a-kind"
project with NationsBank, which began in April 1997.
The resulting email classifier works with Lotus Notes® and runs on the
Windows 95®, Windows NT® and AIX® operating systems. The system
continues to be improved, and could ultimately be developed into a
commercial product. Related work is
also under way to create categorization software that individuals can use
to simplify the task of filing mail in Lotus Notes.
Filling A Tall Order
When NationsBank approached IBM, it soon became apparent that the bank
needed a highly sophisticated system. The software would have to be far
more powerful than the "filter" functions provided by email packages that
file incoming messages according to the sender's name, or a word in the
subject line.
For a bank, the sender's name would suggest nothing about the message's
content, and a word like "loan" in a subject line could refer to a
mortgage, home improvement, business or car loan -- each of which would
need to be routed differently. Even if the subject line specified "mortgage
loan," the sender could be asking for payment status, a copy of a statement
or tax information -- again, different tasks handled by different people.
To effectively categorize email, software would have to analyze the full
text of every message.
To speed responses, the bank also wanted a
system that could provide templates for replies. If the request was for tax
information, the system could provide a boilerplate letter, so a customer
service representative would have only to type in a few particulars.
Ideally, the system would even respond to standard requests automatically.
Only the small number of inquiries remaining would require full human
attention.
The bank hoped to reap downstream benefits, as well. By analyzing past
email, the system should be able to unearth patterns in customer complaints
or requests, yielding strong clues about new products or services worth
developing and old ones that need fixing, as well as
opportunities for cross-selling.
To address the NationsBank project, Johnson formed a team that included
Watson colleagues Fred Damerau, Thilo Goetz, Frank Oles and Thomas Hampp.
Their work was motivated by a text categorization system developed several
years earlier by Chid Apte, manager of data abstraction research at Watson,
Damerau, and Professor Sholom Weiss, a visiting scientist from Rutgers
University. That system was based on a machine-learning algorithm known as
Swap, written by Weiss. Swap discovered classification rules by
systematically searching training data that had appropriate categories
preassigned for combinations of words predictive of each category.
The trio had tested Swap on the Reuters financial newswire database in a
joint project with the Reuters news agency, which wanted a program to
create the category headers under which stories were sent over the wires,
freeing editors from the chore. The researchers trained Swap on some 10,000
existing stories, then turned it loose on several thousand more.
The Swap-based system scored high on a key metric known as the break-even
point -- the point at which the system's recall rate (the percentage of
items it categorizes) matches its precision (the percentage of categorized
items that are labeled correctly). The system achieved a break-even point
of 80.5 percent on all of the 93 Reuters categories, compared with a
previous best of 67 percent. Although Reuters needed guarantees of accuracy
that the Swap-based system could not provide, the system has since become
recognized as a seminal contribution to the text-classification field.
Johnson's team faced some stiff challenges. They needed to build a system
that would outperform the Swap-based system on more difficult data.
NationsBank had identified more than 100 types of responses it sent to
email correspondents; these would form the categories.
The team first developed a text categorizer toolkit that helped them
determine the best techniques for selecting the words and phrases to be
used in training. They also evaluated three machine-learning algorithms,
which led to the selection of a decision-tree system developed at IBM's
Almaden Research Center by the data mining and decision support group
managed by Rakesh Agrawal. The system is part of Intelligent Miner,
an IBM product that uncovers patterns in data. Johnson's team also wrote a
program that converts decision trees to classification rules and developed
a general categorization engine that applies classification rules to new
documents.
They first benchmarked the prototype email classifier on the Reuters data
set, achieving a
precision of 88 percent and a recall of 78 percent on all 93 categories.
This was quite encouraging, since NationsBank was particularly interested
in
precision. "As far as I am aware, this result is among the best ever
reported on Reuters," Johnson says. Beginning in February, Johnson's team
started working on NationsBank email. In March, they trained the classifier
on almost 5,000 messages and then presented it with about 1,000 new ones.
The preliminary experiments on 14 categories of email received by one bank
center yielded an average precision of 91 percent and an average recall of
81 percent. "We were very excited about the result," Johnson says. "With
more data, there is every reason to believe the current system will do even
better."
To date, the system performs well enough to meet some, but not all, of the
performance requirements of NationsBank. For example, says Johnson, it is
not yet precise enough to respond automatically to incoming mail -- that
would require a percentage in the very high 90s. But the system's current
precision would be good enough for prompting customer service
representatives with reply templates. Indeed, work is under way with Lotus
Consulting to integrate the categorizer into a Domino-based Lotus
Notes system that customer service representatives use to respond to email.
Whereas they must now scroll through a list of response templates, the
categorizer will pop up a suggested response template, speeding up this
tedious task.
The Watson researchers are continuing to improve the classifier. One
development in the works is a machine-learning algorithm specifically
designed for text categorization that, unlike the current algorithm, can be
trained incrementally and can provide confidence measures.
Along with Weiss and Apte, the team is also experimenting with a data
sampling technique called "boosting," which applies multiple decision trees
to the same data. Used with other systems, boosting achieved results
superior to all previously reported results on the Reuters domain. "We
expect similar results when we use boosting with our current algorithm,"
Johnson predicts.
As more messages from NationsBank become available, the team plans to
continue to experiment with the classifier, to see if the program can
improve further. Marshall Schor, manager of knowledge systems, who oversees
Johnson's group, is eager for the answer, because it will show how far the
system's machine-learning capability can refine rules. "That," he says, "is
the key to classifying a message that humans readily understand but that
lacks certain key terms -- for example, a request for a mortgage that
reads, 'I want to build a house, and need money.'"
Branching Out
The scope of tasks the email classifier could
tackle is broad, spanning numerous industries, according to Deupree. "Any
company that provides general product information or customer service could
use the system to answer requests for parts, service or documentation," he
says.
For a closer
look, click here  |
Eventually, the approach could be applied to voice messages as well -- a
boon to call centers -- since the rules needed to analyze content are the
same. In collaboration with the data abstraction research group, Johnson's
group is also working with IBM Global Services on "intelligent call
routing," which involves categorizing summaries of telephone conversations
describing problem statements to determine the proper work queue. Companion
work on categorizing voice is in progress both at Watson and at IBM's Haifa
Research Laboratory.
"Personally, I'd love it if IBM had an email classification product for its
own use," says
Bill Pulleyblank, director of Watson's mathematical sciences department.
"It would save me at least an hour a day."
Mark Fischetti is a freelance science writer in Lenox,
Massachusetts.
More Information:
Metrics for Categorization
Your Personal Email Assistant
Teaching Itself to Learn
Metrics for Categorization
The abilities of text categorizers can be measured. First, sample data
with preassigned categories are randomly divided into a training set and an
independent test set. The training set is provided to the machine-learning
algorithm, which attempts to discover rules that will correctly predict the
categories of unseen data in the test set.
There are two metrics: precision and recall. High precision means that the
system categorized few test documents incorrectly, whereas high recall
means that many of the articles were in fact categorized by the system.
Both metrics are needed to evaluate an algorithm. A system could achieve
perfect
recall at the expense of high precision by simply labeling each document
with every category. Conversely, a system might achieve perfect precision
at the expense of high recall by identifying only one document out of a
large collection but identifying it correctly. Researchers often report
text categorization results in terms of a precision-recall "break-even
point," a hypothetical point at which precision and recall are the same.
Your Personal Email Assistant
You walk into your office, hang up your coat and check your email.
There are 28 new messages. It's the same chore every morning. Don't you
wish your computer could help you file them all? If Jeff Kephart succeeds,
it soon may.
Regardless of which email program you use, filing mail is tedious. To save
a message, you click on an icon that says something like "Move to Folder."
You then have to scroll and click through levels of folders to find the
right destination. "It takes about 10 seconds," says Kephart, manager of
agents and emergent phenomena at Research. "That may not sound like much,
but it's enough of a barrier to cause people like me to let email pile up,
and finding anything in that mess becomes a nightmare."
Kephart and his colleagues Rich Segal and
Hoi Chan have created software that learns how you file your mail. As you
open a message, the program scans the contents and presents buttons showing
the three most likely folders for saving it. Click one, and the message is
filed there. "It only takes about one and a half seconds," Kephart says.
The program must be trained on past email, but it also learns from the
manual filings you may make each day, so its predictive accuracy continues
to improve. The researchers trained the program on 500 messages and 25
possible folders. When they tested it on 500 new messages, one of the three
buttons showed the correct folder
98 percent of the time.
The mail categorizer was designed to work with Lotus Notes. "But," Kephart
says, "nothing would prevent it from being used in just about any
commercial email application."
The next stage of development is to test the program for several months on
a handful of IBM Research employees, who will use it to file 100,000
messages. "A lot of people have volunteered,"says Kephart.
Teaching Itself to Learn
The key to the email classifier's performance is a text categorizer, which
creates the rules by which email is sent to one category or another during
a preliminary training phase. Rules are formulated in several steps, using
a collection of email already categorized by the bank. First, the system
scans all the email and eliminates words such as "the," "Sincerely," and so
on that are common to email. As it proceeds, it builds a "local dictionary"
of characteristic terms for each category. These dictionaries are then
submitted to a machine-learning
algorithm.
For each email category, the algorithm builds a decision tree that
determines which words - and how many occurrences of those words -
distinguish messages within that category from those in other categories.
The algorithm does this by asking questions such as "Which word best splits
the data into a 'Mortgage Loan' class and everything else? Which other word
would best refine the previous split?" Eventually, the system decides it
can no longer improve by asking new questions and so halts. A program
developed by Johnson's team translates the trees into rules, which are
expressed as conditionals - for example, if the words "apply" and "card"
occur at least once, the message belongs in the "Credit Card Application"
category.
Once the rules are devised, determining the category of a new message is
simple and fast. The email classifier simply counts the occurrences of the
words in the text and works through the set of rules to see which, if any,
apply. If it does not find a rule, it leaves the category blank, for a
human operator to determine. The proper categories for those messages, and
corrections of the system's mistaken categorizations, can be fed back to
the system so that it can refine its own rules, improving its
performance.
The rules devised by the algorithm are not always
intuitive, Johnson explains. "Some are based on relationships that no human
would ever conceive or have thought sensible." For example, when the email
classifier was run on the Reuters sample, it created a rule stating that,
"If 'U.S.' occurs no more than twice, and 'estimate' occurs no more than
once, and 'vs.' occurs at least once, then the article belongs in the
category 'Earnings.'" This rule proved wrong only twice when applied to 942
articles, for a precision of 99.8 percent.