Tags that help machines interpret data could transform e-business
and simplify the exchange of information.
Much as people
and economies depend on information, the exchange of data has often been hindered by the incompatible formats
of proprietary hardware and software. That was less of a problem when computers rarely communicated with each other,
but now it's a major obstacle to the spread of global networking and the growth of e-business. HTML,
the language that allows data to be tagged so its style or format can be read on different platforms,
is a step in the right direction. But an emerging standard called XML, for eXtensible Markup Language,
can label data in still more useful ways. XML is what is known as a "metalanguage" -- a language
for creating other languages, in this case new and useful markup languages. For all its promise, however,
the nature of XML can be difficult to grasp. Like the fabled elephant that three men perceived variously
as a snake, a rope or a tree, depending on which part they touched, XML can appear to be different things
to different people. For consumers and researchers, XML promises to help search engines and intelligent
agents return more meaningful results from forays onto the Web.
For companies expanding into e-commerce, XML provides a low-cost way to exchange purchase
orders and other business documents over the Internet. For content providers, meanwhile,
XML can automatically reformat a document to feed many different publishing media.
IBM is developing and strongly supporting XML, which it sees as a strategic technology
for spreading e-business across different computing platforms, much as the Java
software environment does. Indeed, the two technologies are highly complementary.
While Java enables the creation of platform-independent programs, XML enables
the manipulation of platform-independent data. In combination with the Internet, which enables
platform-independent networking, they provide the three prerequisites for universal computing: global communications, portable software, and portable data.
The easiest way to explain XML is that it's a way of tagging any kind of data to make its significance understandable even to machines. A human may be able to tell the difference between a subtotal and a total, or a billing address and a shipping address, or a retail price and a sale price, but agents, bots and other programs need extra help. Indeed, XML is intended mainly to benefit computer programs. "The most important application for XML in the years
to come will be computer-to-computer communications instead of computer-to-human or human-to-human communications," says Simon Phipps, IBM's chief Java/XML evangelist, based at the IBM Center for Java Technology Development in Hursley, England. "It's all about exchanging data."
Although the tags created with XML resemble the HTML tags used today to create Web pages, there are two important differences: XML tags separate content from presentation, and XML is extensible -- that is, it allows the creation of new tags for new and unforeseen purposes.
Both types of tags enclose a keyword or character in angle brackets, like this: <I>Sample HTML tags</I>. In this case, the HTML tag <I> means the following text should appear on the screen in italics. The tag </I> turns off italics and returns to the previous type style. XML tags follow the same pattern, enclosing content between "on" and "off" tags. But while HTML tags generally indicate only how the content should appear, XML tags indicate what the content means. Content therefore grows more accessible to different kinds of software and less dependent on a specific output device.
The value of such information is apparent when you look at how HTML and XML might
be used to label a price on a Web page. Here's the HTML way:
<H3>Sale price: $24.95</H3> <I>(Suggested retail: $39.95)</I> <B>Shipping cost: $4.00 UPS Ground</B>
The pairs of <H3>, <I> and <B> tags merely indicate that the enclosed text should appear in headline type, italics or boldface. They give no hint as to what the enclosed content refers to.
That's acceptable for human readers, but a search engine or automated "shopbot" program seeking out the lowest price for an item on the Web would have to be very smart indeed to figure out which amount is the actual selling price. Because there's no standard method of specifying this information in HTML, different Web sites will display prices in different ways.
Here's a possible XML alternative:
<PRICE type="sale" unit="US Dollar">24.95</PRICE>
<PRICE type="retail" unit="US Dollar">39.95</PRICE>
<SHIPPING type="UPS Ground" unit="US Dollar">4.00</SHIPPING>
These tags clearly identify the sale price, regular retail price, shipping cost, shipping method and currency type. An XML-enabled search engine or bot can readily interpret such tags, especially if they are standardized for a particular product category or industry. Users will get more meaningful results with fewer false hits.
Unlike HTML, XML tags don't specify how to display the information. That's done with a style sheet written in XSL (eXtensible Style Language), an XML language for reformatting and presentation. By separating the definition of content from the style of presentation, XML makes it easier to use the content in different ways.
For instance, a retailer might design different style sheets for the high-resolution screens of PCs, the low-resolution screens of TVs, the even lower-resolution monochrome LCDs of palmtop computers and the colorful pages of a printed catalog. If the sale price for an item changes, only the original XML document needs updating -- the style sheets can stay the same. They may also choose not to display certain data depending on the use intended; for example, a European style sheet might display only metric units and a U.S. style sheet only imperial.
Style sheets can be designed for any number of new media types. "This will be vital as new classes of computing devices appear," notes Rakesh Mohan, a researcher at IBM's Thomas J. Watson Research Center.
LANGUAGES ON DEMAND
XML's extensibility gives it still further advantages. While HTML tags must be blessed by an international standards body -- or chaotically imposed on the market by individual browser vendors -- new XML tags can be created by anyone at any time.
To prevent chaos, the tags used by
a particular XML document can be described in a file called a "document type definition" (DTD). The file can reside in the same place as the XML document or at any location on the Internet. By referring to the definition file, an XML parser, or reader, can verify that the document is "valid" -- in compliance with both XML grammar and the rules of the newly created definition file.
IBM's XML Parser for Java, originated at the Tokyo Research Laboratory (TRL) and extended at IBM's Java Technology Center in Cupertino, California, is a leading example of such a program. It allows users to generate new XML documents or manipulate existing documents. In either case, the parser validates a document's use of tags as defined in a DTD.
Extensibility will create endless opportunities for customization. "For example," says Hiroshi Maruyama, who led the development of the XML parser at TRL, "an industry, or even a single business, can easily create its own complex order forms or catalogs, and an XML-enabled browser will be able to read them, so long as a corresponding style sheet is provided."
Xeena, developed at IBM's Haifa Research Lab, is a Java application built on top of the XML parser that provides an intuitive visual interface for creating and editing XML documents by pointing and clicking on a menu of XML tags. Moreover, by changing the availability of certain tags according to the context, it enables users to produce valid documents without being familiar with a particular document type definition or even knowing much about XML.
Many specialized but openly documented tag sets already exist. They include Genealogical Markup Language, or GedML, a set of tags for describing family trees; Chemical Markup Language, or CML, for molecules and crystals; and Mathematical Markup Language, or MathML, for equations (see "IBM Software + Web = Science Online," IBM Research, Number 3, 1998).
A group at IBM Research led by David Epstein has even created a technique that goes beyond tagging data. They are using XML to represent programs. The team's Bean Markup Language (BML) is a set of XML tags that describes the structure of Java applications. Using this approach, an abstract XML description of an application, such as an order entry system, can be automatically transformed into a BML document, which is directly executable. This not only makes it possible for machines to generate applications on demand but also provides a new way to insert additional functionality into an application while it's running. For example, an email application could dynamically incorporate a translation capability if it detected that mail was being sent to another country.
"This technology adds a new range of dynamic capabilities to Java," says Epstein. "It provides a unified way of customizing applications. It improves performance by building applications from smaller parts that can be independently added only when needed. And it provides a framework for building applications that can automatically extend themselves."
A striking example of the new business capabilities XML will make possible is YODA (Your Own Data Access), a Java applet developed at Watson. YODA uses XML tags and XSL to bring the advantages of Electronic Data Interchange (EDI) to thousands of small- and medium-sized companies.
EDI is a standard that lets companies exchange electronic documents. A company might transmit an electronic purchase order to a supplier, which might send back another electronic document containing an invoice. But EDI requires expensive software and special network connections, making it too costly for most small businesses. "The challenge for large companies like IBM is to reduce the amount of paper purchase orders and invoices. An important step toward that goal is to lower the cost of EDI transactions, so that small- and medium-size companies can make use of EDI," says Jen-Yao Chung, a Watson researcher at IBM's Institute of Advanced Commerce who helped develop YODA.
Thanks to YODA and IBM's XML/EDI tags, small businesses will be able to affordably exchange many types of EDI messages. All they'll need is an XML-enabled Web browser and Internet access -- YODA downloads automatically and handles the document exchange. A pilot project with YODA is expected to conclude this year.
Of course, none of these new applications can take off until XML standards are set. The World Wide Web Consortium (W3C) adopted XML 1.0 as a "recommendation" in 1998. But there's still much work to do on related standards. A technology known as XLink or XPointer, which allows much more flexible and powerful hyperlinking than is possible with HTML, is expected to win W3C approval this spring. The still-evolving XSL style-sheet standard is in the final phases of evaluation and should be adopted this summer. In addition to the five current standards, 10 other fundamental XML standards should be ready within the next year, says Watson researcher Bob Schloss.
For its part, IBM is helping develop a number of these standards. For example, it is participating in the formation of a standard for specifying the structure of XML documents. "Among other things, this will help us specify whether a piece of data is an integer or a floating-point number or a currency such as dollars, not just text strings like we have now," explains Ashok Malhotra, a Watson researcher who's working with the W3C on the new standard. Malhotra is also working on standards to translate database contents into XML documents that can then be searched using XML-tagged queries. "This would allow us to work with information gathered from diverse sources in a single format," he says.
Standardizing data formats not only helps companies exchange documents but also makes their own archives more accessible. "People will often say, 'Where's that report we wrote eight years ago?' and find that the report was written in WordPerfect 1.6 or some other format that no software can read," says Schloss. "With XML, data will be durable."
Yet another use for XML is the exchange of "metadata" -- descriptions of data instead of actual data. For instance, today's Internet search engines can't index the dynamic Web pages that servers create on the fly in response to user input. The pages don't exist until a user requests some data. A new standard called the Resource Description Framework (RDF) uses XML, and allows a Web site to describe its dynamic content without creating any pages that contain the content. "Search engines could use this metadata to lead users to the site in response to a query," says Schloss, who is co-chairman of the RDF committee of the W3C.
Other IBM researchers are using RDF as a way to describe the internal structure of multimedia streams. A new search site for XML, called xCentral, is setting the direction for the evolution of future Web search engines. xCentral includes a Web crawler -- the world's first RDF search engine -- that discovers document type definitions in use, newsgroup articles about XML and XML documents on the Web, which it then represents in RDF. "The value of using RDF is that the information one is looking for is in a structured form that allows far more precise matches between a query and the responses to it than is possible with today's search engines," says Neel Sundaresan, a Web technology researcher at IBM's Almaden Research Center.
Besides the standards issue, other factors could slow the adoption of XML. For one thing, the language is harder to grasp than HTML. It also consists of many layers of technology that must fit together to yield the desired results. So XML won't change the world overnight. Although some companies are already using XML, its impact is likely to be gradual this year and then accelerate. The language may not become mainstream for five or six years.
In its favor, however, XML enjoys broad support. Microsoft's Channel Definition Format (CDF), for example, uses the language to define how Web servers should "push" information to PC desktops, and the company's forthcoming Office 2000 will store some information in XML format, while Microsoft® Internet Explorer 5.0 supports XML and an early version of XSL. The language also has the backing of AOL/ Netscape, Sun, Oracle, and virtually every other mover and shaker in the industry. Despite disagreements over the details of particular standards, everyone recognizes XML's key benefits: more compatibility between applications, easier exchange of information, less dependence on native platforms, greater access to enterprise data, more efficient e-commerce and a smoother road to pervasive computing.
"It's going to be a faster world because of XML," says Dan Ford, who led the xCentral development team at Almaden. "XML is going to make everything a lot easier."
Tom R. Halfhill, a former senior editor at Byte, lives near San Francisco.
XML is a subset of SGML (Standard Generalized Markup Language), which in turn descends from earlier markup languages first developed at IBM as early as 1969. The oldest direct ancestor is GML, which both stands for Generalized Markup Language and comprises the initials of the IBM researchers who created it: Charles F. Goldfarb, Edward Mosher and Raymond Lorie.
SGML is a far more extensive markup language, adopted as ISO standard 8879 in 1986. To this day, it remains the last word in markup languages, and applications such as Adobe® FrameMaker® use it for desktop publishing. But SGML is considered too complex for widespread e-business and similar applications. By strategically omitting large chunks of SGML, XML provides about 80 percent of the functionality with only about 20 percent of the complexity.
More information: IBM's xml site. Many of the XML tools IBM is developing are available
for downloading, experimentation and discussion on the alphaWorks Web site.