Daffodil flower

Data Format Description Language (DFDL) Overview

Kristoffer H. Rose

IBM T. J. Watson Research Center, P.O.Box 704, Yorktown Heights, NY 10598, USA
krisrose@us.ibm.com
November 8, 2005

This is an overview of the variant of the Data Format Description Language (DFDL, pronounced “daffodil”) supported by the “Virtual XML Garden” release [Virtual] from IBM alphaWorks. After the introduction (with goals, terms, and references) we illustrate the basic principles of DFDL through an example (keeping it simple), go through the data format directives and the data format properties that can be set before we revisit the example (with bells and whistles). Finally, we explain how to run the examples and how our description diverges from published DFDL drafts and current working group discussions.

Introduction

The Data Format Description Language (DFDL) is being developed by the appropriately named Data Format Description Language Working Group [DFDL-WG] of the Global Grid Forum (GGF). The purpose of DFDL [DFDL] is to define mappings between non-XML formatted data (in byte strings) and XML data structure [XML]. This is achieved by specifying the desired XML structure through an XML schema [Schema0] which is then augmented with “annotations” that give the details of how the formatted data is to be mapped into XML. DFDL is by design limited to fairly “direct” mappings to make sure that implemetations can be used for converting both ways between the XML and non-XML form.

Goals

The language described in this document is a variant that tries to stay close to the latest draft [DFDL] while allowing experiments with and research of some specific goals:

  1. Integrate strongly with XPath [XPath2].

  2. Have a completely uniform and orthogonal treatment of all data format properties.

  3. Support a simple lexical scoping mechanism.

Terms

The following terms will be used freely in the following.

References

DFDL: Mike Beckerle, Data Format Description Language (DFDL), A Proposal, Working Draft, 2005-03-30, available from [DFDL-WG].

DFDL-Prop: Michael Beckerle, DFDL Representation Properties, 2004, available from [DFDL-WG].

DFDL-WG: Martin Westhead, Alan Chappell, and Mike Beckerle, Data Format Description Language Working Group (http://forge.gridforum.org/projects/dfdl-wg/).

DM: Mary Fernández, Ashok Malhotra, Jonathan Marsh, Marton Nagy, and Norman Walsh, XQuery 1.0 and XPath 2.0 Data Model (http://www.w3.org/TR/xpath-datamodel/), World Wide Web Consortium Working Draft, 2005.

Schema0: David C. Fallside and Priscilla Walmsley, XML Schema Part 0: Primer (http://www.w3.org/TR/xmlschema-0/), World Wide Web Consortium Recommendation, 2004.

Virtual: Kristoffer Rose, Lionel Villard, Achille Fokoué, Rajeshwari Rajendra, Paul Castro, Christopher Holtz, William Li, Virtual XML Garden (http://www.alphaWorks.ibm.com/tech/virtualxml), IBM alphaWorks, 2005.

XML: Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, and Eve Maler, Extensible Markup Language (XML) 1.0 (http://www.w3.org/TR/REC-xml), World Wide Web Consortium Recommendation, 1999.

XPath2: Anders Berglund, Scott Boag, Don Chamberlin, Mary F. Fernández, Michael Kay, Jonathan Robie, and Jérôme Siméon, XML Path Language (XPath) 2.0 (http://www.w3.org/TR/xpath20/), World Wide Web Consortium Working Draft, 2005.

Introductory example

In this section we explain how a simple legacy data format can be described using DFDL.

One of the most common formats for legacy data exchange is the COBOL “copybook” record storage format. A COBOL copybook looks like the following:

00572 ****************************************************
00572 *  COBOL COPYBOOK - CUSTOMERS
00572 *  DATA FOR CUSTOMER TABLE
00572 ****************************************************
00572  77 CUSTOMER-RECORD-COUNT   PIC 9999.
00573  01 CUSTOMER-RECORD.
00574     05 CUSTOMER-LAST-NAME   PIC X(20).
00575     05 CUSTOMER-FIRST-NAME  PIC X(15).
00576     05 CUSTOMER-AGE         PIC 999.
00577     05 CUSTOMER-PHONE       PIC 9(10).

This specific copybook can be read as follows:

Here is a simple XML schema with a structure equivalent to such a collection of records:

<xs:schema xmlns:xs=”http://www.w3.org/2001/XMLSchema”>
 <xs:element name=”copybook”>
  <xs:complexType>
   <xs:sequence>
    <xs:element name=”CUSTOMER-RECORD” minOccurs=”0” maxOccurs=”9999”>
     <xs:complexType>
      <xs:sequence>
       <xs:element name=”CUSTOMER-LAST-NAME” type=”xs:string”/>
       <xs:element name=”CUSTOMER-FIRST-NAME” type=”xs:string”/>
       <xs:element name=”CUSTOMER-AGE” type=”xs:int”/>
       <xs:element name=”CUSTOMER-PHONE” type=”xs:string”/>
      </xs:sequence>
     </xs:complexType>
    </xs:element>
   </xs:sequence>
   <xs:attribute name=”CUSTOMER-RECORD-COUNT” type=”xs:int”/>
  </xs:complexType>
 </xs:element>
</xs:schema>

We have translated the last field as a string rather than an integer because phone numbers are not really numbers (where leading zeros are ignored, etc.), and we have manifested the record count as an attribute on the container element.

The purpose of DFDL is to define a mapping that allows an XML view of real copybook instances to permit using XML tools to read, update, and write, the copybook records. DFDL achieves this by leveraging the information in the XML schema, adding annotations with the additional details of the actual data format.

Here is our XML schema from before with a few extra annotations (in bold) to make the mapping between copybook records and XML unambiguous; this is now a DFDL specification:

<xs:schema xmlns:xs=”http://www.w3.org/2001/XMLSchema”
           xmlns:data=”http://dataformat.org/”>
 <xs:element name=”copybook”>
  <xs:complexType>
   <xs:sequence>
    <xs:element name=”CUSTOMER-RECORD”
                minOccurs=”0” maxOccurs=”9999”>
     <xs:complexType>
      <xs:sequence>
       <xs:element name=”CUSTOMER-LAST-NAME” type=”xs:string”
         data:encoding=”ebcdic-cp-us” data:length=”20”/>
       <xs:element name=”CUSTOMER-FIRST-NAME” type=”xs:string”
         data:encoding=”ebcdic-cp-us” data:length=”15”/>
       <xs:element name=”CUSTOMER-AGE” type=”xs:int”
         data:encoding=”ebcdic-cp-us” data:length=”3”/>
       <xs:element name=”CUSTOMER-PHONE” type=”xs:string”
         data:encoding=”ebcdic-cp-us” data:pattern=”\d{{10}}”/>
      </xs:sequence>
     </xs:complexType>
    </xs:element>
   </xs:sequence>
   <xs:attribute name=”CUSTOMER-RECORD-COUNT” type=”xs:int”
         data:value=”{count(CUSTOMER-RECORD)}”/>
  </xs:complexType>
 </xs:element>
</xs:schema>

The bold “data [format] properties” make the mapping to the actual records precise. Here's how it works:

That is all: using this DFDL specification the following data with two copybook records

ROSE                KRISTOFFER     0402025555555         
ROSE                SOFUS          0060000000000         

(as one contiguous stream of the underlined characters in EBCDIC encoding) will be viewed as if it was the following XML document (where we have preserved the underline to indicate characters that were copied out of the actual record):

<copybooks CUSTOMER-RECORD-COUNT=”2”>
 <CUSTOMER-RECORD>
  <CUSTOMER-LAST-NAME>ROSE                </CUSTOMER-LAST-NAME>
  <CUSTOMER-FIRST-NAME>KRISTOFFER     </CUSTOMER-FIRST-NAME>
  <CUSTOMER-AGE>40</CUSTOMER-AGE>
  <CUSTOMER-PHONE>2025555555</CUSTOMER-PHONE>
 </CUSTOMER-RECORD>
 <CUSTOMER-RECORD>
  <CUSTOMER-LAST-NAME>ROSE                </CUSTOMER-LAST-NAME>
  <CUSTOMER-FIRST-NAME>SOFUS          </CUSTOMER-FIRST-NAME>
  <CUSTOMER-AGE>6</CUSTOMER-AGE>
  <CUSTOMER-PHONE>0000000000</CUSTOMER-PHONE>
 </CUSTOMER-RECORD>
</copybooks>

The DFDL engine generates this view by making some simple assumptions:

The details are explained in the following two sections.

Data format directives

The core of DFDL is the data format directive, supplemented by a single auxiliary directive, the defaults declaration directive. Both directives follow the XML schema convention of being embedded in “application information” annotations in the XML schema, like this:

<xs:annotation>
 <xs:appinfo source="http://dataformat.org/">
  Directives
 </xs:appinfo>
</xs:annotation>

We will explain the data format directive, with its variations, and last the defaults directive.

Basic data formats

Data format properties are the primary means to guide the mapping between non-XML and XML. Data format annotations are only allowed on XML schema particles, specifically element, attribute, and type declarations. Any number of data format properties can be bound with the directive

<data:format Property=”Value... />

The Property must be either one of the pseudo-properties (there are five: name, ref, useRestrictions, value, and guard, explained in subsequent subsections), or one of the generic data format properties explained in Sectionチ 4. Such properties are only in effect for the directly annotated XML schema particle. There is no scoping (except as introduced by the “defaults” directive, discussed below).

The Value string used for data format property definitions must be a non-empty string which is subject to two conversions before it is used:

  1. Non-XML character literals, identified by “\#”.

  2. Embedded XPath computation identified by single braces {” and “}.

These are explained below.

We refer to the XML schema component with the property as “the annotated schema component”.

If an XML schema particle has more than one data format in it then the first data format is used. (The guard pseudo-property, explained below, can be used to change this.)

As an alternative notation you can use attributes directly on XML schema components (as was done in the illustrative example): attribute declarations

data:Property=”Value...

are a short-hand for adding the following annotation to the XML schema component with the attribute:

<xs:annotation><xs:appinfo source="http://dataformat.org/">
  <data:format Property=”Value... />
</xs:appinfo></xs:annotation>

Non-XML character literals

Non-XML (and other) characters can be inserted into property value strings with the following special DFDL character escape syntax:

Note that this is in addition to the use of XML numeric character entities of the form “&#D...;” because XML character entities can only represent the subset of the Unicode characters that are explicitly allowed by the XML standard [XML]. In addition, the DFDL escape conversion happens after the XML entities are interpreted, so “\&#35;65;” is the same as “\#65;” which is then interpreted by DFDL as “A” whereas \#38;#65;” is interpreted as the literal string “&#65;”. In any case it only makes sense to allow character codes that are actually supported by the used character encoding, of course.

The “file magic” example below illustrates the use of non-XML character escapes.

Embedded XPath computation

XPath computations can be inserted into property value strings using curly brace characters “{” and “}” as follows:

XPath computation happens after XML entities have been interpreted. This means that “&#123;” and “&#125;” are completely equivalent to “{” and “}”, respectively. The XPath expression is not subject to DFDL character escapes so “\#123;” and “\#125;” are only equivalent to “{{” and “}}outside of XPath expressions. To include a single brace inside the XPath expression to compute you must use the corresponding double brace. When in doubt do not hesitate to insert the ignored literal “\#;”.

The embedded XPath expressions should follow the XPath syntax from the specification [XPath2]. The XPath expression is computed in an XPath context defined as follows:

Be aware that it is possible to create infinite loops in the XPath computation since computations can depend on other computations through the use of variables and by the XPath navigating to peer nodes. (This is similar to “recalculate” loops in spreadsheets.)

The computed attribute CUSTOMER-RECORD-COUNT in the introductory example illustrates the use of an embedded XPath.

Named data formats

A data format directive can be named by including the special name pseudo-property:

<data:format ... name=”FormatName.../>

Named data format directives are only allowed in annotations directly on the XML schema declaration itself. Such a named data format is then available to other data format directives by the ref pseudo-property:

<data:format ... ref=”FormatName.../>

Searching for a property in a data format with a ref pseudo-property first searches for the property in the format itself and then searches the referenced named format.

Here is our XML schema from before, rewritten to use a named data format:

<xs:schema xmlns:xs=”http://www.w3.org/2001/XMLSchema”
           xmlns:data=”http://dataformat.org/">
 <xs:annotation>
  <xs:appinfo source=”http://dataformat.org/”>
   <data:format name=”ebcdic” encoding=”ebcdic-cp-us”/>
  </xs:appinfo>
 </xs:annotation>

 <xs:element name=”copybook”>
  <xs:complexType>
   <xs:sequence>
    <xs:element name=”CUSTOMER-RECORD”
                minOccurs=”0” maxOccurs=”9999”>
     <xs:complexType>
      <xs:sequence>
       <xs:element name=”CUSTOMER-LAST-NAME” type=”xs:string”
         data:ref="ebcdic" data:length=”20”/>
       <xs:element name=”CUSTOMER-FIRST-NAME” type=”xs:string”
         data:ref="ebcdic" data:length=”15”/>
       <xs:element name=”CUSTOMER-AGE” type=”xs:int”
         data:ref="ebcdic" data:length=”3”/>
       <xs:element name=”CUSTOMER-PHONE” type=”xs:string”
         data:ref="ebcdic" data:pattern=”\d{{10}}”/>
      </xs:sequence>
     </xs:complexType>
    </xs:element>
   </xs:sequence>
   <xs:attribute name=”CUSTOMER-RECORD-COUNT” type=”xs:int”
         data:value=”{count(CUSTOMER-RECORD)}”/>
  </xs:complexType>
 </xs:element>
</xs:schema>

The named data format directive has been lifted to the annotation of the top-level XML schema annotations, as required for named declarations. Since we're not using the “defaults” DFDL directive no scoping is in effect, so we have to explicitly bind the data format to every value where it is needed. Notice how we used the “attribute shorthand” instead of full annotations.S

Guarded data formats

Including the pseudo-property guard makes the data format optional:

<data:format ... guard=”Value.../>

The value must evaluate to either “true” or “false”: If the value of the guard property is “true” then the data format is considered, if it is “false” then the data format is ignored. As a special rule, XML schema components that have only data formats with “false” guard values are not allowed to be instantiated at all. This is mostly useful for guards written as embedded XPath expressions, i.e., using the form

<data:format ... guard=”{boolean(Test)}” .../>

where the conversion rules will correctly translate the Test to either “true” or “false”.

Type-specific formats

Special support is available for reusing simple XML schema types. Including the pseudo-property

<data:format ... useRestrictions=”Name1 ... Namen.../>

(with Name1 ... Namen a space-separated list of type names) tells the DFDL engine that for each XML schema declaration of the form

<xs:simpleType name="Namei">
 <xs:annotation>
  <xs:appinfo source="http://dataformat.org/">
   Directives
  </xs:appinfo>
 </xs:annotation>
 <xs:restriction base=”BaseType"/>
</xs:simpleType>

(where the Namei is one of the names in Name1 ...Namen), every use of BaseType as the type of an attribute or elemet should be interpreted as a use of the type Namei which in particular implies that it should use the Directives as if it they were specified directly for the BaseType.

To avoid inconsistent specifications, the Directives cannot in turn include data formats which (re)define the useRestrictions pseudo-property.

Scoped defaults

Finally, the “defaults” directive

<data:defaults>
  Directives
</data:defaults>

where Directives should be individual data format directives, introduces scoping: the data format directives will be searched (in order) for properties that are not already found for any XML schema component contained within the XML schema component annotated with the defaults directive. (In case defaults directives are nested then this implies that they are searched bottom-up following usual lexical scoping conventions.)

To summarize, here are the rules for resolving the value of a data format property associated with a given XML schema component with a particular target type:

  1. First search for a local value: If the component has any “format” annotations that define the property (either directly or by reference to a named data format, recursively), then resolve to the first such value found in a data format annotation (in schema document order) that does not include a guard property that is “false”, if any.S

  2. If no value was found, search for one in the type of attribute and element declarations, as follows:

    1. First search for a type restriction value for simple types: resolve the pseudo-property useRestrictions recursively (using just rule 1 and 3), into a list of type names Name1...Namen. Then check if the target type is one of the types BaseType1...BaseTypen that Name1...Namen are restrictions of. If so then resolve to the value obtained by applying rules 1 and 3 to the type declaration component.

    2. In no type restriction value was found then search for a type local value by applying rules 1 and 3 to the type declaration component.

    3. If neither a type restriction nor type local value was found but the type is derived, repeat the type subrules a–c recursively for the base type, otherwise no property value can be found for the type.

  3. If neither a local nor a type value was found, search for a defaults value: repeat the rules recursively for formats in the nearest enclosing XML schema component with a data:defaults annotation.

While these rules are not complex be warned that mixing nested “defaults” directives with type declarations and “useRestrictions” pseudo-properties can lead to complex format specifications. Keep it simple!

Data format properties

This section lists the few DFDL data format properties that our prototype supports to determine what (if any) data should be read to create instances of element, attributes, and simple values, declared by the annotated XML schema. A separate working group draft details the list of properties that is currently being considered for the full specification language [DFDL-Prop].

Byte properties

Describes the low-level characteristics of how the bytes of a particular value are stored.

Property

Values

Description

byteSize

integer (≥0)

Number of bytes occuppied by data (excluding initiator, etc.; default is special).

byteOrder

“bigEndian” or “littleEndian”

Representation of multi-byte values (default: “bigEndian”).

encoding

string

IANA character set name or the special constant “bytes” (the default) with one-byte character units with value from 0 through 255.

The byteSize property default value is special:

Character properties

Constrain the textual format of values. The strings and regular expressions will match the decoded character string (which with the “bytes” encoding are just the raw bytes).

Property

Values

Description

length

integer

Length of data in character units (default from facet).

pattern

string

Regular expression (following XML schema) for matching the textual representation of the value.

patternGroup

integer

The “group” of the regular expression that actually contains the value, or 0 (the default) to specify that the regular expression matches the entire value.

The length and pattern properties correspond to the XML schema facets of the same name explained in more detail in the XML schema standard. As a special case, if the encoding property is set to a character set (not “bytes”) then the length and pattern facets inherit from the corresponding XML Schema facets.

Grouping properties

Describes how a sequence of data is grouped into “units”.

Property

Values

Description

initiator

string

Required before first unit (default: empty).

separator

string

Required between units (default: empty).

terminator

string

Required (or allowed) after all units (default: empty).

finalTerminatorCanBeMissing

boolean

Whether the terminator can be omitted if redundant (default: false).

These strings match when they contain the same character units as the formatted data; in case the used character encoding allows non-XML characters (such is certainly the case for the bytes pseudo-encoding) the DFDL character escapes should be used.

More examples

We present some further examples.

The COPYBOOK, revisited

Here is our example again, this time done in the fully modular style that will fit larger scale XML schema descriptions extended into DFDL specifications by using type restrictions; this is somewhat overkill in this example but hopefully it illustrates the idea.

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
           xmlns:data="http://dataformat.org/">
 
 <!-- Generic COPYBOOK representation rules -->
 <xs:annotation>
  <xs:appinfo source="http://dataformat.org/">
   <data:defaults>
    <data:format useRestrictions="ebcdic-string ebcdic-short child-count"/>
   </data:defaults>
  </xs:appinfo>
 </xs:annotation>
 
 <xs:simpleType name="ebcdic-string" data:encoding="ebcdic-cp-us">
  <xs:restriction base="xs:string"/>
 </xs:simpleType>
 
 <xs:simpleType name="ebcdic-short" data:encoding="ebcdic-cp-us">
  <xs:restriction base="xs:short"/>
 </xs:simpleType>
 
 <xs:simpleType name="child-count" data:value="{count(*)}">
  <xs:restriction base="count"/>
 </xs:simpleType>

 <!-- COPYBOOK types -->
 <xs:simpleType name="last-name">
  <xs:restriction base="xs:string">
   <xs:length value="20"/>
  </xs:restriction>
 </xs:simpleType>
 
 <xs:simpleType name="first-name">
  <xs:restriction base="xs:string">
   <xs:length value="15"/>
  </xs:restriction>
 </xs:simpleType>
 
 <xs:simpleType name="age">
  <xs:restriction base="xs:short">
   <xs:length value="3"/>
  </xs:restriction>
 </xs:simpleType>
 
 <xs:simpleType name="phone">
  <xs:restriction base="xs:string">
   <xs:length value="10"/>
   <xs:pattern value="\d{10}"/>
  </xs:restriction>
 </xs:simpleType>
 
 <xs:simpleType name="count">
  <xs:restriction base="xs:short">
   <xs:minInclusive value="0"/>
   <xs:maxInclusive value="9999"/>
  </xs:restriction>
 </xs:simpleType>
 
 <xs:element name="copybook">
  <xs:complexType>
  
   <xs:sequence minOccurs="0" maxOccurs="unbounded">
    <xs:element name="CUSTOMER-RECORD">
     <xs:complexType>
     
      <xs:sequence>
       <xs:element name="CUSTOMER-LAST-NAME" type="last-name"/>
       <xs:element name="CUSTOMER-FIRST-NAME" type="first-name"/>
       <xs:element name="CUSTOMER-AGE" type="age"/>
       <xs:element name="CUSTOMER-PHONE" type="phone"/>
      </xs:sequence>
      
     </xs:complexType>
    </xs:element>
   </xs:sequence>
   
   <xs:attribute name="CUSTOMER-RECORD-COUNT" type="count"/>
   
  </xs:complexType>
 </xs:element>

</xs:schema>

This is fully modular: the entire data format information is at the top (in bold) and the rest of the XML Schema is merely a reasonable XML representation of the COPYBOOK.

The top annotation proclaims that throughout the document (because of data:defaults) the data format implied by the type restrictions “ebcdic-string”, “ebcdic-short”, and “child-count”, apply. These type restrictions immediately follow, and we note that the first two restrict the XML schema simple types xs:string and xs:short to the EBCDIC character set. The third, “child-count”, restricts every occurrence of an attribute or element of the type “count” to be computed as the number of children of the parent.

The rest of the schema is merely a precise modular schema for the COPYBOOK using precise type restrictions with the appropriate facets for each field. Notice how the facet names coincide with the DFDL property names to ease exposing them to the schema or not. When the DFDL engine processes the structure at the bottom of the specification it will instantiate the “copybooks” document element since there is no choice in the matter. Then it will search for an instantiation of the CUSTOMER-RECORD-COUNT attribute. Since the attribute declaration itself has no data format annotation but there is an active “useRestriction” for the type, we use the data format annotation of the “child-count” restriction which tells us how to compute the attribute by counting the children of the element with the attribute. This will cause those children to be visited, which in turn causes the engine to attempt to instantiate the children of the first CUSTOMER-RECORD element of which the first is the CUSTOMER-LAST-NAME. This is a “last-name” which gets properties from the facets of that type as well as the data format restriction of xs:string since “last-name” derives from xs:string. And so on...

File magic

A classic application is the “file” program that investigates the first few bytes (usually) of a file to determine the file type. Here is the beginning of a DFDL specification to replace the venerable /etc/magic file on un*x systems:

<xs:schema
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    elementFormDefault="qualified" attributeFormDefault="unqualified"
    xmlns:data="http://dataformat.org/">

 <xs:element name="file-description">
  <xs:complexType>
   <xs:choice>
    <xs:element name="java-class" type="xs:string" minOccurs="0" maxOccurs="1"
        data:initiator="\#xCA;\#xFE;\#xBA;\#xBE;" data:value="\#;"/>
    <xs:element name="zip-archive" type="xs:string" minOccurs="0" maxOccurs="1"
        data:initiator="PK\#3;\#4;" data:value="\#;"/>
   </xs:choice>
  </xs:complexType>
 </xs:element>
</xs:schema>

This identifies files starting with the bytes with the hexadecimal values CA, FE, BA, and BE, as Java Class files, and files starting with the two letters “P” and “K” followed by bytes with value 3 and 4 as Zip archives. The key property is that the XML schema dictates that there is a choice between the two elements, and the data format property declarations are used by the DFDL engine to decide between them.

Configuration files

Many applications use configuration files such as “registry files” of the form

[homes]
 comment=Home Directories
 browseable=no
[printers]
 comment=All Printers
 browseable=no

Here is an XML version:

<configuration>
 <section>
  <name>homes</name>
  <key><name>comment</name><value>Home Directories</value></key>
  <key><name>browseable</name><value>no</value></key>
  <name>homes</name>
  <key><name>comment</name><value>All Printers</value></key>
  <key><name>browseable</name><value>no</value></key>
 </section>
</configuration>

This is generated by the DFDL format specification

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
           xmlns:data="http://dataformat.org/">
 <xs:annotation>
  <xs:appinfo source="http://dataformat.org/">
   <data:defaults>
    <data:format encoding="US-ASCII"/>
   </data:defaults>
  </xs:appinfo>
 </xs:annotation>

 <xs:element name="configuration">
  <xs:complexType>
   <xs:sequence minOccurs="0" maxOccurs="unbounded">
    <xs:element name="section" data:initiator="[">
     <xs:complexType>
      <xs:sequence>
       <xs:element name="name" type="xs:string"
          data:pattern="([a-z]+)\]\n" data:patternGroup="1"/>
       <xs:element name="key" minOccurs="0" maxOccurs="unbounded" data:initiator=" ">
        <xs:complexType>
         <xs:sequence>
          <xs:element name="name" type="xs:string" data:terminator="="/>
          <xs:element name="value" type="xs:string" data:terminator="&#10;"/>
         </xs:sequence>
        </xs:complexType>
       </xs:element>
      </xs:sequence>
     </xs:complexType>
    </xs:element>
   </xs:sequence>
  </xs:complexType>
 </xs:element>
</xs:schema>

C structure

Our final example illustrates the use of binary records. Consider the following initialized C array:

struct { char c; short s; int i; long l; float f; double d; }
  onetwo[2] = {'\1', 1, 1, 1, 1.0, 1.0, '\2', 2, 2, 2, 2.0, 2.0};

Assume we write this data structure on a little-endian machine to create the byte sequence:

1 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 80 3f 0 0 0 0 0 0 f0 3f
2 2 0 2 0 0 0 2 0 0 0 0 0 0 0 0 0  0 40 0 0 0 0 0 0  0 40

The following DFDL specification

<xs:schema
  xmlns:xs="http://www.w3.org/2001/XMLSchema"
  xmlns:data="http://dataformat.org/">
 <xs:annotation>
  <xs:appinfo source="http://dataformat.org/">
   <data:defaults>
    <data:format data:encoding="bytes" data:byteOrder="littleEndian"/>
   </data:defaults>
  </xs:appinfo>
 </xs:annotation>

 <xs:element name="sextet">
  <xs:complexType>
   <xs:sequence>
    <xs:element name="group" minOccurs="0" maxOccurs="unbounded">
     <xs:complexType>
      <xs:sequence>
       <xs:element name="byte" type="xs:byte"/>
       <xs:element name="short" type="xs:short"/>
       <xs:element name="int" type="xs:int"/>
       <xs:element name="long" type="xs:long"/>
       <xs:element name="float" type="xs:float"/>
       <xs:element name="double" type="xs:double"/>
      </xs:sequence>
     </xs:complexType>
    </xs:element>
   </xs:sequence>
  </xs:complexType>
 </xs:element>
</xs:schema>

will interpret the byte sequence as the XML data

<sextet>
  <group><byte>1</byte><short>1</short><int>1</int><long>1</long><float>1.0</float><double>1.0</double></group>
  <group><byte>2</byte><short>2</short><int>2</int><long>2</long><float>2.0</float><double>2.0</double></group>
</sextet>

(as always without the spacing).

Using the implementation

To run the above examples you can download the “Virtual XML Garden” release [Virtual] from IBM alphaWorks. Once installed, you can invoke the DFDL engine from the command with something like this:

java... 'v:dfdl(“dfdl-schema.xsd”, v:file(“binary-file”))'

assuming “java...” invokes the Java runtime with the virtual XML program and the proper class path and that the command is executed in the same directory as the dfdl-schema.xsd and binary-file that you wish to use.

Limitations and the relationship to DFDL working group drafts

Our prototype implementation is subject to some limitations compared to the language described in the referenced draft [DFDL]:

  1. We only include a very minimal set of data representation properties, specifically all byte oriented and with no good provisions for skipping and reordering data. This is intentional as we wish to encourage experimentation with using DFDL for just the initial step of getting the data in “some” XML form and then using XML procesing for further transformation.

  2. The “encoding” property captures the combination of the draft's “repType” and “charset” properties.

  3. The “format” directive captures the draft's “dataFormat”, “useType”, and “configuration” directives.

  4. The “format” directive implies no scoping or inheritance of the defined properties.

  5. We have kept the “defaults” directive separate as it is, in fact, completely orthogonal to the DFDL annotations and could be generic across many annotation languages.

  6. Annotations on XML schema restrictions are not presently handled, especially there is no good way to generate values with enumeration restrictions.

  7. There is currently no way to write embedded XPath expressions that generate non-XML character codes.

  8. XPath expressions are evaluated with the context node set to the parent of the node to be generated, and disallows references to the node to be constructed. The current draft suggests using the (not yet constructed) new node as the context node, however, this seems to be hard to define the exact meaning of.

  9. It is currently not possible to specifiy that the value of a property is the empty string.

  10. There are some restrictions on the use of XML Schema that are imposed by our DFDL engine to ensure that the engine can deterministically construct a unique document:

    1. Mixed content is not supported.

    2. There is no good way to specificy the seuquence of multiple attributes beyond what can be done with the initiator and guard format properties.

    3. Every choice, including among possible document (root) elements, must be guarded in such a way that at most one choice is applicable for any concrete data. (The choice can either be explicit by through XML schema constraining facets or implicit from DFDL directives that restrict the possible data that can generate each choice, such as the “initiator” property.)

    4. Other than that there are no restrictions on the use of complex XML schema constructions...

Other implementations

The “Defuddle” project (http://sourceforge.net/projects/defuddle) started by researchers from the Pacific Northwest National Laboratory also implements a variant of DFDL.


Kristoffer H. Rose, Virtual XML Garden, November 8, 2005.