IBM®
Skip to main content
    Country/region change    Terms of use
 
 
 
    Home    Products    Services & solutions    Support & downloads    My account    
IBM Research

Think Research


 

Featured Concept
Made to order

By Victor D. Chase

unstructuredMost of us like to have a certain amount of structure in our lives, yet all too frequently the complexity of modern-day living confronts us with what seems more like chaos than order. When that happens, we strive to regain a level of organization that helps us deal with the complexities of our imperfect world.

Much the same holds true in the inanimate world of computerized data, which is increasingly confronted with confusion merely by virtue of its exponential growth. As a result, IBM researchers are developing a series of complex software systems aimed at bringing this seemingly uncontrollable mass of information, known as unstructured data, into an orderly, usable, structured state.

Data disperses
The growth of unstructured data tracks the evolution of computer utilization. During the 1960s, computing was primarily a backroom, punch-card operation, used for basic functions like billing and tracking inventory. Gradually, computers moved to the front office to handle more people-oriented transactions such as making airline and hotel reservations. Most of these tasks could be accomplished by categorizing and accessing information in ordered rows and columns of numbers, known as relational databases. It is precisely this type of routine "grunt work" that computers were originally designed to handle.

With the advent of the Internet and e-commerce, computers had to deal with huge stores of information recorded in a variety of disparate formats, including text, audio, video and graphics. This unstructured data does not fit easily into relational databases. Consequently, the job of computers has become much more difficult. It is estimated that as much as 85 percent of the information that currently flows through businesses is unstructured.

Unfortunately, however, capturing this data has proven difficult. "Unstructured data do not have the same characteristics that make structured data easily searchable, and today they frequently are not stored in structured databases," says Anant Jhingran, director of computer science at IBM's Almaden Research Center in California.




Unstructured data is any data
residing unorganized outside

of a database and can be text,
audio, video, or graphics.

Where did I read that?
The advantages of harnessing the information contained in unstructured data are many. For example, say you are preparing a report and want to include an important fact that you came across several weeks ago, but you can't find the source.

"You would like to ask our data management software to find that bit of information no matter where it is, whether it's in spreadsheets, audio form, a Word document, plain old flat files, or a database engine," says integrated information researcher and IBM Fellow Pat Selinger, of IBM's Silicon Valley lab. "Being able to do that would be a great productivity tool, and that's really what we're after."

This software could help a pharmaceutical researcher find information about a specific molecule with certain characteristics simply by giving the system the desired parameters and letting it search. The system might sift through a multitude of sources such as graphic, text, and chemical compound databases maintained at various locations, including manufacturers and academic institutions. And once found, says Selinger, who serves as chief technology officer for the information integration area of IBM's Data Management Division, "I don't necessarily have to put the information in my database. It can continue to live like it lives today in some file system. I just have to know how to reach out and where to get it."

Herding the data
Some of these capabilities are currently available in IBM's new Discoverylink™ software, designed specifically for the life sciences field.

Discoverylink provides the user with a uniform method of querying for the desired information without having to understand the languages of the databases being searched. "A user doesn't need to learn how to use all these specific searches. Discoverylink does that," explains Selinger. "It translates the request into the specific dialect for the search that has to take place on a particular data source."

Researchers could apply Discoverylink's capabilities in the field of medical research, where it can be used to determine the efficacy of specific treatment protocols. For instance, if hundreds of patients with varying medical backgrounds have received different treatment protocols from different doctors over the course of several years, and researchers want to determine the most effective treatment for specific categories of patients, Discoverylink could be used to seek out and help correlate the relevant information from various databases.

Another crucial step in corralling unstructured data involves text. One financial institution is using IBM ViaVoiceŽ voice recognition software to convert complaint calls—considered unstructured data—into text. These text files are then combined with a structured database containing information about the calling customer's transactions and other relevant information. A survey is then used to determine if the problem was handled to the customer's satisfaction. That information is also entered into the database so analyses can be run to determine the types of complaint responses that satisfy customers.

Unstructured emotions
Helping computers understand human emotions is the goal of another unstructured data research project. This work is taking natural language understanding — which allows a computer to recognize natural speech rather than specific commands — a significant step further. In this case, a computer will review text and determine the sentiment of the individual who created it.

So how can a plain piece of paper express feelings like anger or enthusiasm? "Many of the cues that authors provide to human readers are cues that are available to machines" explains Jhingran. "And if a machine takes those cues into account it doesn't necessarily have to do very deep natural language understanding in order to comprehend what the document is about, what the sentiment of the document is, what the important features of the document are."

Jhingran points to the Web as an example of text-based emotional cues. "People love to create lists, italics, boldface, and flashings," he says. "That is their way of cueing to the reader that this is important according to them. So if you take many of these things into account you can get away without having a very sophisticated natural language understanding, yet increase the level of understanding of documents far beyond what it is today."

People also make their feelings known in less direct ways, says Jhingran. "People actually vote their preferences by providing links to different documents," he explains. "You may be able to determine that a page is authoritative because lots of people have found it important enough to have links to it. People explicitly create links from page one to page two, and if many people point to page two it looks like it is an important link to something." Businesses could use such analytical capability to determine the "buzz" about their products found in chat rooms and forums on the Internet.

Alphabet soup
In addition to locating and understanding data, unstructured data research also focuses on enabling the combination of unstructured with structured data. "This will allow businesses to derive value from all of the data they own, not just from specific islands of data," says Jhingran.

Harnessing unstructured data, merging it with structured data, and manipulating the information so that it is easily accessible requires a myriad of complex programs, which can sound like a software engineer's alphabet soup.

One tool used to corral unstructured data is XML (extensible markup language), which tags salient parts of unstructured electronic documents so they can be searched. The structure of XML documents resembles that of a tree, with branches of tagged information, while relational databases consist of regimented rows. "Being able to produce, accept, store, and search XML provides a little structure to unstructured information," explains Selinger of the Silicon Valley Lab.

For instance, using XML tags, the various fields of an electronic purchase order can be identified by the information they contain, such as product, quantity, price, tax, and delivery date. A piece of IBM software currently under development, code-named Clio, will also enable that data to be transferred automatically from one company's electronic form to another even if the forms are different. "It uses mining techniques to look at the format of the data," explains Hamid Pirahesh, an IBM Fellow and senior manager of the Database Technology Institute at Almaden. "It says, 'This smells like a last name to me and in the target form it does the same thing, so it looks like they should be paired.'"

To accomplish this pairing there must be a way to search the XML tags, and that's where XQuery enters the picture. The universally accepted tool for searching structured, or relational, databases, known as SQL (structured query language), was developed during the late 1980s with IBM's participation. When it became apparent that a new tool would be needed to track the trees of XML, IBM contributed to the creation of XQuery, which was adopted as a universal standard in January 2001 by W3C (World Wide Web Consortium), the international software standards group.

But the need for greater capability does not stop there. By definition, unstructured data resides in many locations, so being able to make coherent use of it requires that these various sources interact with each other despite the fact that many of them speak different languages. Pirahesh compares it to the numerous telephone systems throughout the world. Despite their differences there is a uniform standard that allows one to easily place calls to almost anywhere in the world. "In the new world of XML we need to have that kind of standard," says Pirahesh.

Providing that standard is Web services, a protocol developed by IBM that enables authorized manufacturers, vendors, and customers, to have immediate access to each other's data. Encryption is built into this protocol to provide secure exchanges of data among authorized parties. All of this is accomplished within what is referred to as a federated system, or a series of linked databases. "You as a provider of a set of products can publish a set of Web services that allows me as one of your partners to come in and ask you about the status of my orders. I federate with your database to enhance my data," explains Pirahesh.

IBM began delivering an early version of a Web services product this past summer as part of its DB2™ database product. "It is coming in stages, so you are going to see a lot more of this coming up with more features and greater capabilities," says Pirahesh.


    About IBMPrivacyContact