
Data Format Description Language (DFDL) Overview
IBM T. J. Watson Research Center, P.O.Box 704, Yorktown Heights, NY 10598, USA krisrose@us.ibm.comThis is an overview of the variant of the Data Format Description Language (DFDL, pronounced “daffodil”) supported by the “Virtual XML Garden” release [Virtual] from IBM alphaWorks. After the introduction (with goals, terms, and references) we illustrate the basic principles of DFDL through an example (keeping it simple), go through the data format directives and the data format properties that can be set before we revisit the example (with bells and whistles). Finally, we explain how to run the examples and how our description diverges from published DFDL drafts and current working group discussions.
The Data Format Description Language (DFDL) is being developed by the appropriately named Data Format Description Language Working Group [DFDL-WG] of the Global Grid Forum (GGF). The purpose of DFDL [DFDL] is to define mappings between non-XML formatted data (in byte strings) and XML data structure [XML]. This is achieved by specifying the desired XML structure through an XML schema [Schema0] which is then augmented with “annotations” that give the details of how the formatted data is to be mapped into XML. DFDL is by design limited to fairly “direct” mappings to make sure that implemetations can be used for converting both ways between the XML and non-XML form.
The language described in this document is a variant that tries to stay close to the latest draft [DFDL] while allowing experiments with and research of some specific goals:
Integrate strongly with XPath [XPath2].
Have a completely uniform and orthogonal treatment of all data format properties.
Support a simple lexical scoping mechanism.
The following terms will be used freely in the following.
Data. The file (or other data source) that contains the actual data that is being read by the DFDL interpreter to construct an XML data model instance.
“data” and “xs” namespaces. Throughout this document we associate the DFDL draft's namespace URI “http://dataformat.org/” with the “data:” prefix and, as usual, the XML schema namespace URI “http://www.w3.org/2001/XMLSchema” with the “xs:” prefix.
DFDL description. Denotes an XML schema with DFDL annotations as described in this document.
Facet. Constraint used to impose restrictions on an XML schema simple type.
Particle. Any of the principal XML schema declarations which allow repetition (through the minOccurs and maxOccurs declarations).
XML Instance. Denotes an XML data “document” that can be accessed using the conventions of the XPath/XQuery Data Model [DM].
XML schema. An XML document with a description of the structure and datatypes allowed for XML instances. DFDL specifications are themselves XML schema.
DFDL: Mike Beckerle, Data Format Description Language (DFDL), A Proposal, Working Draft, 2005-03-30, available from [DFDL-WG].
DFDL-Prop: Michael Beckerle, DFDL Representation Properties, 2004, available from [DFDL-WG].
DFDL-WG: Martin Westhead, Alan Chappell, and Mike Beckerle, Data Format Description Language Working Group (http://forge.gridforum.org/projects/dfdl-wg/).
DM: Mary Fernández, Ashok Malhotra, Jonathan Marsh, Marton Nagy, and Norman Walsh, XQuery 1.0 and XPath 2.0 Data Model (http://www.w3.org/TR/xpath-datamodel/), World Wide Web Consortium Working Draft, 2005.
Schema0: David C. Fallside and Priscilla Walmsley, XML Schema Part 0: Primer (http://www.w3.org/TR/xmlschema-0/), World Wide Web Consortium Recommendation, 2004.
Virtual: Kristoffer Rose, Lionel Villard, Achille Fokoué, Rajeshwari Rajendra, Paul Castro, Christopher Holtz, William Li, Virtual XML Garden (http://www.alphaWorks.ibm.com/tech/virtualxml), IBM alphaWorks, 2005.
XML: Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, and Eve Maler, Extensible Markup Language (XML) 1.0 (http://www.w3.org/TR/REC-xml), World Wide Web Consortium Recommendation, 1999.
XPath2: Anders Berglund, Scott Boag, Don Chamberlin, Mary F. Fernández, Michael Kay, Jonathan Robie, and Jérôme Siméon, XML Path Language (XPath) 2.0 (http://www.w3.org/TR/xpath20/), World Wide Web Consortium Working Draft, 2005.
In this section we explain how a simple legacy data format can be described using DFDL.
One of the most common formats for legacy data exchange is the COBOL “copybook” record storage format. A COBOL copybook looks like the following:
00572 **************************************************** 00572 * COBOL COPYBOOK - CUSTOMERS 00572 * DATA FOR CUSTOMER TABLE 00572 **************************************************** 00572 77 CUSTOMER-RECORD-COUNT PIC 9999. 00573 01 CUSTOMER-RECORD. 00574 05 CUSTOMER-LAST-NAME PIC X(20). 00575 05 CUSTOMER-FIRST-NAME PIC X(15). 00576 05 CUSTOMER-AGE PIC 999. 00577 05 CUSTOMER-PHONE PIC 9(10).
This specific copybook can be read as follows:
The record count (at the special 77
level) is a four digit integer called CUSTOMER-RECORD-COUNT.
The record container unit (at level 01)
is CUSTOMER-RECORD.
Each record has four fields (at level 05).
The first and second field contain twenty and fifteen
characters (X(20) and X(15),
respectively), and the third and fourth contain three and ten digits
(999 and 9(10),
respectively).
Here is a simple XML schema with a structure equivalent to such a collection of records:
<xs:schema xmlns:xs=”http://www.w3.org/2001/XMLSchema”>
<xs:element name=”copybook”>
<xs:complexType>
<xs:sequence>
<xs:element name=”CUSTOMER-RECORD” minOccurs=”0” maxOccurs=”9999”>
<xs:complexType>
<xs:sequence>
<xs:element name=”CUSTOMER-LAST-NAME” type=”xs:string”/>
<xs:element name=”CUSTOMER-FIRST-NAME” type=”xs:string”/>
<xs:element name=”CUSTOMER-AGE” type=”xs:int”/>
<xs:element name=”CUSTOMER-PHONE” type=”xs:string”/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
<xs:attribute name=”CUSTOMER-RECORD-COUNT” type=”xs:int”/>
</xs:complexType>
</xs:element>
</xs:schema>We have translated the last field as a string rather than an integer because phone numbers are not really numbers (where leading zeros are ignored, etc.), and we have manifested the record count as an attribute on the container element.
The purpose of DFDL is to define a mapping that allows an XML view of real copybook instances to permit using XML tools to read, update, and write, the copybook records. DFDL achieves this by leveraging the information in the XML schema, adding annotations with the additional details of the actual data format.
Here is our XML schema from before with a few extra annotations (in bold) to make the mapping between copybook records and XML unambiguous; this is now a DFDL specification:
<xs:schema xmlns:xs=”http://www.w3.org/2001/XMLSchema”
xmlns:data=”http://dataformat.org/”>
<xs:element name=”copybook”>
<xs:complexType>
<xs:sequence>
<xs:element name=”CUSTOMER-RECORD”
minOccurs=”0” maxOccurs=”9999”>
<xs:complexType>
<xs:sequence>
<xs:element name=”CUSTOMER-LAST-NAME” type=”xs:string”
data:encoding=”ebcdic-cp-us” data:length=”20”/>
<xs:element name=”CUSTOMER-FIRST-NAME” type=”xs:string”
data:encoding=”ebcdic-cp-us” data:length=”15”/>
<xs:element name=”CUSTOMER-AGE” type=”xs:int”
data:encoding=”ebcdic-cp-us” data:length=”3”/>
<xs:element name=”CUSTOMER-PHONE” type=”xs:string”
data:encoding=”ebcdic-cp-us” data:pattern=”\d{{10}}”/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
<xs:attribute name=”CUSTOMER-RECORD-COUNT” type=”xs:int”
data:value=”{count(CUSTOMER-RECORD)}”/>
</xs:complexType>
</xs:element>
</xs:schema>The bold “data [format] properties” make the mapping to the actual records precise. Here's how it works:
At the top we make sure to introduce a prefix (“data:”) for the DFDL namespace.
The two first fields are encoded as 20 and 15 EBCDIC characters, respectively.
The third field of each record is a three digit number (again with EBCDIC encoding).
The last field of each record is a ten digit string of just digits (in EBCDIC encoding). The digit string is specified by a regular expression pattern following the XML schema standard except braces are doubled.
At the bottom we indicate that the value of the
CUSTOMER-RECORD-COUNT attribute on the
“copybooks” element is obtained by computing the actual
number of records.
That is all: using this DFDL specification the following data with two copybook records
ROSE KRISTOFFER 0402025555555 ROSE SOFUS 0060000000000
(as one contiguous stream of the underlined characters in EBCDIC encoding) will be viewed as if it was the following XML document (where we have preserved the underline to indicate characters that were copied out of the actual record):
<copybooks CUSTOMER-RECORD-COUNT=”2”> <CUSTOMER-RECORD> <CUSTOMER-LAST-NAME>ROSE </CUSTOMER-LAST-NAME> <CUSTOMER-FIRST-NAME>KRISTOFFER </CUSTOMER-FIRST-NAME> <CUSTOMER-AGE>40</CUSTOMER-AGE> <CUSTOMER-PHONE>2025555555</CUSTOMER-PHONE> </CUSTOMER-RECORD> <CUSTOMER-RECORD> <CUSTOMER-LAST-NAME>ROSE </CUSTOMER-LAST-NAME> <CUSTOMER-FIRST-NAME>SOFUS </CUSTOMER-FIRST-NAME> <CUSTOMER-AGE>6</CUSTOMER-AGE> <CUSTOMER-PHONE>0000000000</CUSTOMER-PHONE> </CUSTOMER-RECORD> </copybooks>
The DFDL engine generates this view by making some simple assumptions:
The only top-level element is assumed to be the root element.
The data is expected to only contain the simply typed values mandated by the XML schema particles with a data format annotation, juxtaposed in the same order as in the XML view.
The XML schema lexical form, as constrained by DFDL properties, is used for the default textual form of each value.
The details are explained in the following two sections.
The core of DFDL is the data format directive, supplemented by a single auxiliary directive, the defaults declaration directive. Both directives follow the XML schema convention of being embedded in “application information” annotations in the XML schema, like this:
<xs:annotation> <xs:appinfo source="http://dataformat.org/"> Directives </xs:appinfo> </xs:annotation>
We will explain the data format directive, with its variations, and last the defaults directive.
Data format properties are the primary means to guide the mapping between non-XML and XML. Data format annotations are only allowed on XML schema particles, specifically element, attribute, and type declarations. Any number of data format properties can be bound with the directive
<data:format Property=”Value” ... />
The Property must be either one of the pseudo-properties
(there are five: name, ref,
useRestrictions, value,
and guard, explained in subsequent
subsections), or one of the generic data format properties explained
in Sectionチ 4. Such properties are
only in effect for the directly annotated XML schema particle.
There is no scoping (except as introduced by the “defaults”
directive, discussed below).
The Value string used for data format property definitions must be a non-empty string which is subject to two conversions before it is used:
Non-XML character literals, identified by “\#”.
Embedded XPath computation identified by single braces “{”
and “}”.
These are explained below.
We refer to the XML schema component with the property as “the annotated schema component”.
If an XML schema particle has
more than one data format in it then the first data format is
used. (The guard pseudo-property,
explained below, can be used to change this.)
As an alternative notation you can use attributes directly on XML schema components (as was done in the illustrative example): attribute declarations
data:Property=”Value” ...
are a short-hand for adding the following annotation to the XML schema component with the attribute:
<xs:annotation><xs:appinfo source="http://dataformat.org/"> <data:format Property=”Value” ... /> </xs:appinfo></xs:annotation>
Non-XML (and other) characters can be inserted into property value strings with the following special DFDL character escape syntax:
“\#D...;”
(“D...” denotes one or more
digits) inserts a single character with that decimal code,
“\#xH...;”
(“H...” denotes hexadecimal
digits) inserts a single character with that hexadecimal code,
“\#;” is ignored,
and
“\\#” inserts a
literal “\#” character
pair.
Note that this is in addition to the use of XML numeric character
entities of the form “&#D...;”
because XML character entities can only represent the subset of the
Unicode characters that are explicitly allowed by the XML standard
[XML]. In addition, the DFDL escape conversion
happens after the XML entities are interpreted, so “\#65;”
is the same as “\#65;” which
is then interpreted by DFDL as “A” whereas “\#38;#65;”
is interpreted as the literal string “A”.
In any case it only makes sense to allow character codes that are
actually supported by the used character encoding, of course.
The “file magic” example below illustrates the use of non-XML character escapes.
XPath computations can be inserted into property value strings
using curly brace characters “{”
and “}” as follows:
Single braces are interpreted as surrounding a string XPath expression to be computed (details below). The single braces must be matched and cannot contain another single brace inside.
Double braces can be used to insert literal single braces. Double braces do not need to be matched.
XPath
computation happens after XML entities have been interpreted. This
means that “{” and
“}” are completely
equivalent to “{” and “}”,
respectively. The XPath expression is not subject to DFDL character
escapes so “\#123;” and
“\#125;” are only equivalent
to “{{” and “}}”
outside of XPath expressions. To include a single brace inside
the XPath expression to compute you must use the corresponding double
brace. When in doubt do not hesitate to insert the ignored literal
“\#;”.
The embedded XPath expressions should follow the XPath syntax from the specification [XPath2]. The XPath expression is computed in an XPath context defined as follows:
The “context node” for the XPath is the node in the generated XML document that is the parent of the node that might be constructed by the annotated XML schema particle. The context sequence is the singleton with the context node.
For each data format property defined for the XML schema
component, the variable “$data:property”
is defined with the same value.
Be aware that it is possible to create infinite loops in the XPath computation since computations can depend on other computations through the use of variables and by the XPath navigating to peer nodes. (This is similar to “recalculate” loops in spreadsheets.)
The computed attribute
CUSTOMER-RECORD-COUNT in the
introductory example illustrates the use of an embedded XPath.
A data format directive can be named by including the
special name pseudo-property:
<data:format ... name=”FormatName” .../>
Named data format directives are only allowed in annotations directly
on the XML schema declaration itself. Such a named data format is
then available to other data format directives by the ref
pseudo-property:
<data:format ... ref=”FormatName” .../>
Searching for a property in a data format with a ref pseudo-property first searches for the property in the format itself and then searches the referenced named format.
Here is our XML schema from before, rewritten to use a named data format:
<xs:schema xmlns:xs=”http://www.w3.org/2001/XMLSchema”
xmlns:data=”http://dataformat.org/">
<xs:annotation>
<xs:appinfo source=”http://dataformat.org/”>
<data:format name=”ebcdic” encoding=”ebcdic-cp-us”/>
</xs:appinfo>
</xs:annotation>
<xs:element name=”copybook”>
<xs:complexType>
<xs:sequence>
<xs:element name=”CUSTOMER-RECORD”
minOccurs=”0” maxOccurs=”9999”>
<xs:complexType>
<xs:sequence>
<xs:element name=”CUSTOMER-LAST-NAME” type=”xs:string”
data:ref="ebcdic" data:length=”20”/>
<xs:element name=”CUSTOMER-FIRST-NAME” type=”xs:string”
data:ref="ebcdic" data:length=”15”/>
<xs:element name=”CUSTOMER-AGE” type=”xs:int”
data:ref="ebcdic" data:length=”3”/>
<xs:element name=”CUSTOMER-PHONE” type=”xs:string”
data:ref="ebcdic" data:pattern=”\d{{10}}”/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
<xs:attribute name=”CUSTOMER-RECORD-COUNT” type=”xs:int”
data:value=”{count(CUSTOMER-RECORD)}”/>
</xs:complexType>
</xs:element>
</xs:schema>The named data format directive has been lifted to the annotation of the top-level XML schema annotations, as required for named declarations. Since we're not using the “defaults” DFDL directive no scoping is in effect, so we have to explicitly bind the data format to every value where it is needed. Notice how we used the “attribute shorthand” instead of full annotations.S
Including the pseudo-property guard
makes the data format optional:
<data:format ... guard=”Value” .../>
The value must evaluate to either “true” or “false”: If the value of the guard property is “true” then the data format is considered, if it is “false” then the data format is ignored. As a special rule, XML schema components that have only data formats with “false” guard values are not allowed to be instantiated at all. This is mostly useful for guards written as embedded XPath expressions, i.e., using the form
<data:format ... guard=”{boolean(Test)}” .../>where the conversion rules will correctly translate the Test to either “true” or “false”.
Special support is available for reusing simple XML schema types. Including the pseudo-property
<data:format ... useRestrictions=”Name1 ... Namen” .../>
(with Name1 ... Namen a space-separated list of type names) tells the DFDL engine that for each XML schema declaration of the form
<xs:simpleType name="Namei"> <xs:annotation> <xs:appinfo source="http://dataformat.org/"> Directives </xs:appinfo> </xs:annotation> <xs:restriction base=”BaseType"/> </xs:simpleType>
(where the Namei is one of the names in Name1 ...Namen), every use of BaseType as the type of an attribute or elemet should be interpreted as a use of the type Namei which in particular implies that it should use the Directives as if it they were specified directly for the BaseType.
To avoid inconsistent specifications, the Directives cannot
in turn include data formats which (re)define the useRestrictions
pseudo-property.
Finally, the “defaults” directive
<data:defaults> Directives </data:defaults>
where Directives should be individual data format directives, introduces scoping: the data format directives will be searched (in order) for properties that are not already found for any XML schema component contained within the XML schema component annotated with the defaults directive. (In case defaults directives are nested then this implies that they are searched bottom-up following usual lexical scoping conventions.)
To summarize, here are the rules for resolving the value of a data format property associated with a given XML schema component with a particular target type:
First search for a local value: If the component has
any “format” annotations that define the property
(either directly or by reference to a named data format,
recursively), then resolve to the first such value found in a
data format annotation (in schema document order) that does not
include a guard property that is
“false”, if any.S
If no value was found, search for one in the type of attribute and element declarations, as follows:
First search for a type restriction value for simple
types: resolve the pseudo-property useRestrictions
recursively (using just rule 1 and 3), into a list of type names
Name1...Namen. Then check if the
target type is one of the types BaseType1...BaseTypen
that Name1...Namen are restrictions
of. If so then resolve to the value obtained by applying rules 1
and 3 to the type declaration component.
In no type restriction value was found then search for a type local value by applying rules 1 and 3 to the type declaration component.
If neither a type restriction nor type local value was found but the type is derived, repeat the type subrules a–c recursively for the base type, otherwise no property value can be found for the type.
If neither a local nor a type value was found, search for a
defaults value: repeat the rules recursively for formats in
the nearest enclosing XML schema component with a data:defaults
annotation.
While these rules are not complex be warned that mixing nested “defaults” directives with type declarations and “useRestrictions” pseudo-properties can lead to complex format specifications. Keep it simple!
This section lists the few DFDL data format properties that our prototype supports to determine what (if any) data should be read to create instances of element, attributes, and simple values, declared by the annotated XML schema. A separate working group draft details the list of properties that is currently being considered for the full specification language [DFDL-Prop].
Describes the low-level characteristics of how the bytes of a particular value are stored.
|
Property |
Values |
Description |
|---|---|---|
|
byteSize |
integer (≥0) |
Number of bytes occuppied by data (excluding initiator, etc.; default is special). |
|
byteOrder |
“bigEndian” or “littleEndian” |
Representation of multi-byte values (default: “bigEndian”). |
|
encoding |
string |
IANA character set name or the special constant “bytes” (the default) with one-byte character units with value from 0 through 255. |
The byteSize property default value is special:
For the primitive fixed-length XML schema types it has a default value: 1 for xs:byte, 2 for xs:short, 4 for xs:int and xs:float, and 8 for xs:long and xs:double.
For values computed using the special “value” pseudo-property it has the default value of 0.
In other cases there is no default.
Constrain the textual format of values. The strings and regular expressions will match the decoded character string (which with the “bytes” encoding are just the raw bytes).
|
Property |
Values |
Description |
|---|---|---|
|
length |
integer |
Length of data in character units (default from facet). |
|
pattern |
string |
Regular expression (following XML schema) for matching the textual representation of the value. |
|
patternGroup |
integer |
The “group” of the regular expression that actually contains the value, or 0 (the default) to specify that the regular expression matches the entire value. |
The length and pattern properties correspond to the XML schema facets of the same name explained in more detail in the XML schema standard. As a special case, if the encoding property is set to a character set (not “bytes”) then the length and pattern facets inherit from the corresponding XML Schema facets.
Describes how a sequence of data is grouped into “units”.
|
Property |
Values |
Description |
|---|---|---|
|
initiator |
string |
Required before first unit (default: empty). |
|
separator |
string |
Required between units (default: empty). |
|
terminator |
string |
Required (or allowed) after all units (default: empty). |
|
finalTerminatorCanBeMissing |
boolean |
Whether the terminator can be omitted if redundant (default: false). |
These strings match when they contain the same character units as the formatted data; in case the used character encoding allows non-XML characters (such is certainly the case for the bytes pseudo-encoding) the DFDL character escapes should be used.
We present some further examples.
Here is our example again, this time done in the fully modular style that will fit larger scale XML schema descriptions extended into DFDL specifications by using type restrictions; this is somewhat overkill in this example but hopefully it illustrates the idea.
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:data="http://dataformat.org/">
<!-- Generic COPYBOOK representation rules -->
<xs:annotation>
<xs:appinfo source="http://dataformat.org/">
<data:defaults>
<data:format useRestrictions="ebcdic-string ebcdic-short child-count"/>
</data:defaults>
</xs:appinfo>
</xs:annotation>
<xs:simpleType name="ebcdic-string" data:encoding="ebcdic-cp-us">
<xs:restriction base="xs:string"/>
</xs:simpleType>
<xs:simpleType name="ebcdic-short" data:encoding="ebcdic-cp-us">
<xs:restriction base="xs:short"/>
</xs:simpleType>
<xs:simpleType name="child-count" data:value="{count(*)}">
<xs:restriction base="count"/>
</xs:simpleType>
<!-- COPYBOOK types -->
<xs:simpleType name="last-name">
<xs:restriction base="xs:string">
<xs:length value="20"/>
</xs:restriction>
</xs:simpleType>
<xs:simpleType name="first-name">
<xs:restriction base="xs:string">
<xs:length value="15"/>
</xs:restriction>
</xs:simpleType>
<xs:simpleType name="age">
<xs:restriction base="xs:short">
<xs:length value="3"/>
</xs:restriction>
</xs:simpleType>
<xs:simpleType name="phone">
<xs:restriction base="xs:string">
<xs:length value="10"/>
<xs:pattern value="\d{10}"/>
</xs:restriction>
</xs:simpleType>
<xs:simpleType name="count">
<xs:restriction base="xs:short">
<xs:minInclusive value="0"/>
<xs:maxInclusive value="9999"/>
</xs:restriction>
</xs:simpleType>
<xs:element name="copybook">
<xs:complexType>
<xs:sequence minOccurs="0" maxOccurs="unbounded">
<xs:element name="CUSTOMER-RECORD">
<xs:complexType>
<xs:sequence>
<xs:element name="CUSTOMER-LAST-NAME" type="last-name"/>
<xs:element name="CUSTOMER-FIRST-NAME" type="first-name"/>
<xs:element name="CUSTOMER-AGE" type="age"/>
<xs:element name="CUSTOMER-PHONE" type="phone"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
<xs:attribute name="CUSTOMER-RECORD-COUNT" type="count"/>
</xs:complexType>
</xs:element>
</xs:schema>This is fully modular: the entire data format information is at the top (in bold) and the rest of the XML Schema is merely a reasonable XML representation of the COPYBOOK.
The top annotation proclaims that throughout the document (because
of data:defaults) the data format
implied by the type restrictions “ebcdic-string”,
“ebcdic-short”, and “child-count”, apply.
These type restrictions immediately follow, and we note that the
first two restrict the XML schema simple types xs:string and xs:short
to the EBCDIC character set. The third, “child-count”,
restricts every occurrence of an attribute or element of the type
“count” to be computed as the number of children of the
parent.
The rest of the schema is merely a precise modular schema for the COPYBOOK using precise type restrictions with the appropriate facets for each field. Notice how the facet names coincide with the DFDL property names to ease exposing them to the schema or not. When the DFDL engine processes the structure at the bottom of the specification it will instantiate the “copybooks” document element since there is no choice in the matter. Then it will search for an instantiation of the CUSTOMER-RECORD-COUNT attribute. Since the attribute declaration itself has no data format annotation but there is an active “useRestriction” for the type, we use the data format annotation of the “child-count” restriction which tells us how to compute the attribute by counting the children of the element with the attribute. This will cause those children to be visited, which in turn causes the engine to attempt to instantiate the children of the first CUSTOMER-RECORD element of which the first is the CUSTOMER-LAST-NAME. This is a “last-name” which gets properties from the facets of that type as well as the data format restriction of xs:string since “last-name” derives from xs:string. And so on...
A classic application is the “file” program that investigates the first few bytes (usually) of a file to determine the file type. Here is the beginning of a DFDL specification to replace the venerable /etc/magic file on un*x systems:
<xs:schema
xmlns:xs="http://www.w3.org/2001/XMLSchema"
elementFormDefault="qualified" attributeFormDefault="unqualified"
xmlns:data="http://dataformat.org/">
<xs:element name="file-description">
<xs:complexType>
<xs:choice>
<xs:element name="java-class" type="xs:string" minOccurs="0" maxOccurs="1"
data:initiator="\#xCA;\#xFE;\#xBA;\#xBE;" data:value="\#;"/>
<xs:element name="zip-archive" type="xs:string" minOccurs="0" maxOccurs="1"
data:initiator="PK\#3;\#4;" data:value="\#;"/>
</xs:choice>
</xs:complexType>
</xs:element>
</xs:schema>This identifies files starting with the bytes with the hexadecimal values CA, FE, BA, and BE, as Java Class files, and files starting with the two letters “P” and “K” followed by bytes with value 3 and 4 as Zip archives. The key property is that the XML schema dictates that there is a choice between the two elements, and the data format property declarations are used by the DFDL engine to decide between them.
Many applications use configuration files such as “registry files” of the form
[homes] comment=Home Directories browseable=no [printers] comment=All Printers browseable=no
Here is an XML version:
<configuration> <section> <name>homes</name> <key><name>comment</name><value>Home Directories</value></key> <key><name>browseable</name><value>no</value></key> <name>homes</name> <key><name>comment</name><value>All Printers</value></key> <key><name>browseable</name><value>no</value></key> </section> </configuration>
This is generated by the DFDL format specification
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:data="http://dataformat.org/">
<xs:annotation>
<xs:appinfo source="http://dataformat.org/">
<data:defaults>
<data:format encoding="US-ASCII"/>
</data:defaults>
</xs:appinfo>
</xs:annotation>
<xs:element name="configuration">
<xs:complexType>
<xs:sequence minOccurs="0" maxOccurs="unbounded">
<xs:element name="section" data:initiator="[">
<xs:complexType>
<xs:sequence>
<xs:element name="name" type="xs:string"
data:pattern="([a-z]+)\]\n" data:patternGroup="1"/>
<xs:element name="key" minOccurs="0" maxOccurs="unbounded" data:initiator=" ">
<xs:complexType>
<xs:sequence>
<xs:element name="name" type="xs:string" data:terminator="="/>
<xs:element name="value" type="xs:string" data:terminator=" "/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>Our final example illustrates the use of binary records. Consider the following initialized C array:
struct { char c; short s; int i; long l; float f; double d; }
onetwo[2] = {'\1', 1, 1, 1, 1.0, 1.0, '\2', 2, 2, 2, 2.0, 2.0};Assume we write this data structure on a little-endian machine to create the byte sequence:
1 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 80 3f 0 0 0 0 0 0 f0 3f 2 2 0 2 0 0 0 2 0 0 0 0 0 0 0 0 0 0 40 0 0 0 0 0 0 0 40
The following DFDL specification
<xs:schema
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:data="http://dataformat.org/">
<xs:annotation>
<xs:appinfo source="http://dataformat.org/">
<data:defaults>
<data:format data:encoding="bytes" data:byteOrder="littleEndian"/>
</data:defaults>
</xs:appinfo>
</xs:annotation>
<xs:element name="sextet">
<xs:complexType>
<xs:sequence>
<xs:element name="group" minOccurs="0" maxOccurs="unbounded">
<xs:complexType>
<xs:sequence>
<xs:element name="byte" type="xs:byte"/>
<xs:element name="short" type="xs:short"/>
<xs:element name="int" type="xs:int"/>
<xs:element name="long" type="xs:long"/>
<xs:element name="float" type="xs:float"/>
<xs:element name="double" type="xs:double"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>will interpret the byte sequence as the XML data
<sextet> <group><byte>1</byte><short>1</short><int>1</int><long>1</long><float>1.0</float><double>1.0</double></group> <group><byte>2</byte><short>2</short><int>2</int><long>2</long><float>2.0</float><double>2.0</double></group> </sextet>
(as always without the spacing).
To run the above examples you can download the “Virtual XML Garden” release [Virtual] from IBM alphaWorks. Once installed, you can invoke the DFDL engine from the command with something like this:
java... 'v:dfdl(“dfdl-schema.xsd”, v:file(“binary-file”))'
assuming “java...” invokes the Java runtime with the virtual XML program and the proper class path and that the command is executed in the same directory as the dfdl-schema.xsd and binary-file that you wish to use.
Our prototype implementation is subject to some limitations compared to the language described in the referenced draft [DFDL]:
We only include a very minimal set of data representation properties, specifically all byte oriented and with no good provisions for skipping and reordering data. This is intentional as we wish to encourage experimentation with using DFDL for just the initial step of getting the data in “some” XML form and then using XML procesing for further transformation.
The “encoding” property captures the combination of the draft's “repType” and “charset” properties.
The “format” directive captures the draft's “dataFormat”, “useType”, and “configuration” directives.
The “format” directive implies no scoping or inheritance of the defined properties.
We have kept the “defaults” directive separate as it is, in fact, completely orthogonal to the DFDL annotations and could be generic across many annotation languages.
Annotations on XML schema restrictions are not presently handled, especially there is no good way to generate values with enumeration restrictions.
There is currently no way to write embedded XPath expressions that generate non-XML character codes.
XPath expressions are evaluated with the context node set to the parent of the node to be generated, and disallows references to the node to be constructed. The current draft suggests using the (not yet constructed) new node as the context node, however, this seems to be hard to define the exact meaning of.
It is currently not possible to specifiy that the value of a property is the empty string.
There are some restrictions on the use of XML Schema that are imposed by our DFDL engine to ensure that the engine can deterministically construct a unique document:
Mixed content is not supported.
There is no good way to specificy the seuquence of multiple
attributes beyond what can be done with the initiator
and guard format properties.
Every choice, including among possible document (root) elements, must be guarded in such a way that at most one choice is applicable for any concrete data. (The choice can either be explicit by through XML schema constraining facets or implicit from DFDL directives that restrict the possible data that can generate each choice, such as the “initiator” property.)
Other than that there are no restrictions on the use of complex XML schema constructions...
The “Defuddle” project (http://sourceforge.net/projects/defuddle) started by researchers from the Pacific Northwest National Laboratory also implements a variant of DFDL.