Copyright ©2000 W3C® (MIT, INRIA, Keio), All Rights Reserved. W3C liability, trademark, document use and software licensing rules apply.
This document describes a subset of the information contained in an XML document and a syntax for expressing that subset. This syntax, called Canonical XML, is designed to encode the logical structure of XML documents; two XML documents whose Canonical-XML form is identical will be considered equivalent for the purposes of many applications.
The XML Core Working Group, with this 19 January 2000 Infoset Last Call working draft, invites comment on this specification. The Last Call period ends the 22 February 2000.
The W3C Membership and other interested parties are invited to review the specification and report implementation experience. Please send comments to [email protected] (archive).
Note: The XML Core Working Group strongly solicits commentary, especially from early implementors of this Working Draft, on the appropriateness of the requirement that Canonical XML be in W3C normalized text form as well. The Working Group has published a minority report on this question at http://lists.w3.org/Archives/Public/www-xml-canonicalization-comments/2000Jan/0000.html. A rationale for the majority viewpoint embodied in this draft has been published at http://lists.w3.org/Archives/Public/www-xml-canonicalization-comments/2000Jan/0001.html.
For background on this work, please see the XML Activity Statement . While we welcome implementation experience reports, the XML Core Working Group will not allow early implementation to constrain its ability to make changes to this specification prior to final release.
A list of current W3C Recommendations and other technical documents can be found at http://www.w3.org/TR.
1 Introduction
2 Information Included in Canonical
XML
2.1 The Document
Information Item
2.2 Element
Information Items
2.3 Attribute
Information Items
2.4 Processing
Instruction Information Item
2.5 Reference to
Skipped Entity Information Items
2.6 Character
Information Items
2.7 Comment
Information Items
2.8 Document
Type Declaration Information Items
2.9 Entity
Information Items
2.10 Notation
Information Items
2.11 Entity
Start Marker Information Items
2.12 Entity End
Marker Information Items
2.13 CDATA
Start Marker Information Items
2.14 CDATA End
Marker Information Items
2.15 Namespace
Declaration Information Items
3 Document Type Definition Processing
4 Entity and Reference Processing
5 The Syntax of Canonical XML
5.1 Character
Encoding
5.2 Character
Escaping
5.3 Prolog
5.4 Epilog
5.5 Elements
5.6 Tags
5.7
Attributes
5.8 Processing
Instructions
5.9
Namespaces
A References
B Acknowledgements (Non-Normative)
The XML 1.0 Recommendation [XML] describes the syntax of a class of resources called XML documents. It is possible for XML documents which are equivalent for the purposes of many applications to differ in their physical representation. In particular, they may differ in their entity structure, attribute ordering, and character encoding. This means that much equivalence testing of XML documents cannot be done at the byte-comparison level. This Canonical XML specification aims to introduce a notion of equivalence between XML documents which can be tested at the syntactic level and, in particular, by byte-for-byte comparison. In the syntax it describes, logically equivalent documents are byte-for-byte identical.
The syntax described in this specification is called Canonical XML. XML documents may be transformed into Canonical XML (with potentially some information loss) - the result of this transformation is described as the canonical form of the original document. Canonical XML is XML - that is to say, the canonical form of any XML document is an XML document.
There are two essential aspects to the specification of Canonical XML:
For the purposes of this specification, the information in an XML document is that described by the XML Information Set Specification [Infoset]. The canonical form of an XML document, which is itself an XML document, also has an information set. This section describes what portion of an XML document's information set is included in that of its canonical form.
Note that information not included in Canonical XML may still be used to produce it. In particular:
The information set of the canonical form includes only the "children" property of the document information item. It does not include any of the peripheral properties of the document information item, nor the "notations" or "entities" properties.
The information set of the canonical form includes the properties: "namespace URI," "local name," "children" and "attributes" from each element information item. It does not include the "declared namespaces" property, nor any of the peripheral properties. Note that the infoset lists the "children" property as including references to skipped entity information items, but the canonical form does not include these.
The information set of the canonical form includes all of the core properties, but none of the peripheral properties, of the attribute information item.
For Processing Instructions appearing outside of the Document Type Definition, the information set of the canonical form includes all of the core properties, but none of the peripheral properties, of the processing instruction information item. For those which appear in the Document Type Definition, the information set of the canonical form includes no Processing Instruction information items.
Reference to skipped entity information items are not included in the information set of the canonical form of a document. Such information items could not appear in Canonical XML because canonicalization requires the reading of declarations for all entities referenced in a document.
The information set of the canonical form includes the core "character code" property of the character information item. None of the peripheral properties of the character information item are included.
The information set of the canonical form does not include comment information items.
The information set of the canonical form does not include document type declaration information items.
The information set of the canonical form does not include entity information items.
The information set of the canonical form does not include notation information items.
The information set of the canonical form does not include entity start marker information items.
The information set of the canonical form does not include entity end marker information items.
The information set of the canonical form does not include CDATA start marker information items.
No CDATA sections occur in the information set of the canonical form. They are not necessary since all syntactically-significant characters in Canonical XML are escaped in the fashion described in this specification.
The information set of the canonical form does not include CDATA end marker information items.
The information set of the canonical form does not include namespace declaration information items.
The process of canonicalizing an XML document depends on its standalone document declaration. If the declaration is present and its value is "yes", then assuming the XML document satisfies the Standalone Document Declaration validity constraint, no external portion of the DTD can contain material which affects its canonical form.
In all other cases, the process of canonicalization requires reading the whole of the DTD. The following information from the DTD affects the canonical form of an XML document:
Note that the process of canonicalization is effectively impossible for a non-standalone document for which some external component of the DTD cannot be retrieved. Implementors of software which is designed to produce Canonical XML should provide an interface to users which allows this error condition to be signaled.
The canonical form of an XML document is standalone.
The canonical form of an XML document contains no general entity references - all such references are expanded so that the canonical form contains only the replacement text. Since it contains no DTD, it also contains no parameter entity references.
Suppose a file named "e1.xml" contains the following text, with no trailing newline (#xA) character.
Hallelujah, I'm a bum!
then if the following XML document is stored in a file in the same directory
<!DOCTYPE d [ <!ENTITY lsb '['> <!ENTITY rsb ']'> <!ENTITY bum SYSTEM "e1.xml"> ]> <d>&lsb;&bum;&rsb;</d>
its canonical form is
<d>[Hallelujah, I'm a bum!]</d>
This section describes the syntax of Canonical XML. This syntax is a proper subset of the syntax of XML 1.0. The canonical form of an XML document is identical to its original form except as described in this section.
Each Canonical XML document must match the production labeled canonXML in the grammar below, where the notation and the semantics of the word "match" are those described in the XML 1.0 specification.
[1] | canonXML | ::= | (PI #xA)* element #xA (PI #xA)* | |
[2] | element | ::= | Stag (Datachar | element | PI)* Etag | |
[3] | Stag | ::= | '<' Name NSDecl? (Att NSDecl?)* '>' | |
[4] | Etag | ::= | '</' Name '>' | |
[5] | NSDecl | ::= | #x20 'xmlns:' Prefix '=' '"' Attvalchar* '"' | |
[6] | Att | ::= | #x20 Name '=' '"' Attvalchar* '"' | |
[7] | Datachar | ::= | '&' | '<' | '>' | '
' | |
| (Char - ('&' | '<' | '>' | #xD )) | ||||
[8] | Attvalchar | ::= | '&' | '<' | '"' | '	' | '
' | '
' | |
| (Char - ('&' | '<' | '"' | #x9 | #xA | #xD)) | ||||
[9] | Name | ::= | (Prefix ':')? NCName | |
[10] | Prefix | ::= | 'n' [1-9] [0-9]* | |
[11] | PI | ::= | '<?' PITarget (#x20 (Char+ - (Char* '?>' Char*)))? '?>' | |
[12] | PITarget | ::= | NCName - (('X' | 'x') ('M' | 'm') ('L' | 'l')) |
The remainder of this section expresses additional constraints beyond those expressed in the grammar and provides further explanatory material on key aspects of Canonical XML.
Canonical XML uses UTF-8 in the normalized form recommended by [CharModel] as the character encoding.
For example, consider the following small document:
<?xml version="1.0" encoding="ISO-8859-1"?> <lang>Español</lang>
Since it is encoded in ISO-8859-1 ("ISO Latin-1"), the character "ñ" is represented as #xF1. In Canonical XML, however, that character must be represented using UTF-8 in two bytes whose values are #xC3 and #xB1.
The Unicode standard [Unicode] allows multiple different representations of certain "precomposed characters" (a simple example is "ç"). Thus two XML documents with content that is equivalent for the purposes of most applications may contain differing character sequences. The W3C has recommended a normalized representation [CharModel]. Canonical XML uses this normalized form.
Note: The XML Core Working Group strongly solicits commentary, especially from early implementors of this Working Draft, on the appropriateness of this requirement for normalized form. The Working Group has published a minority report on this question at http://lists.w3.org/Archives/Public/www-xml-canonicalization-comments/2000Jan/0000.html. A rationale for the majority viewpoint embodied in this draft has been published at http://lists.w3.org/Archives/Public/www-xml-canonicalization-comments/2000Jan/0001.html.
The XML 1.0 specification requires XML processors to perform certain simple transformations on white-space characters in XML documents, when they serve as line separators and when they appear in attribute values. For each character in the result of the transformation, there will be a character information item as described by the Information Set. For example, in an XML 1.0 document:
All character information items are represented in a Canonical XML document by their UTF-8 encoding, with the following exceptions:
Canonical-XML documents have a prolog which contains only those Processing Instructions appearing before the start-tag of the root element but not within the Document Type Definition. Each PI is followed by a single newline (#xA) character. These PIs and newline characters make up the whole content of the prolog. If there are no such PIs, the first character is the "<" marking the beginning of the root element's start-tag.
For the following XML document
<!DOCTYPE x PUBLIC "myX" "x.dtd" [ <!ENTITY a "aVal"> ]> <x>y</x>
the canonical form is
<x>y</x>
If PIs are involved
<?t1 t1-body ?> <!DOCTYPE x PUBLIC "myX" "x.dtd" [ <?t2 t2-body ?> <!ENTITY a "aVal"> ]> <?xml-stylesheet href="mystyle.css" type="text/css" ?> <?rating mostly-harmless?> <x>y</x><?t3 ?>
the canonical form is
<?t1 t1-body ?> <?xml-stylesheet href="mystyle.css" type="text/css" ?> <?rating mostly-harmless?> <x>y</x> <?t3?>
The epilog of all Canonical-XML documents contains a single newline (#xA) character, which immediately follows the ">" marking the end of the root element's end-tag. If the epilog contains Processing Instructions they are preserved in the Canonical-XML epilog, each followed by a newline (#xA) character.
For the following XML document
<x>y</x><?audio stop here ?> <!-- Local variables: mode: xml End: --><?pi?>
the canonical form is
<x>y</x> <?audio stop here ?> <?pi?>
In Canonical XML, all elements have a start-tag and an end-tag. For elements which have no content, the end-tag follows the start-tag with no intervening characters.
For the following element
<x> <a n="1"/><b n="2"/> <c n="3"/></x>
the canonical form is
<x> <a n="1"></a><b n="2"></b> <c n="3"></c></x>
In Canonical XML, for end-tags and start-tags which contain no attributes, the ">" character closing the tag follows the element type immediately with no intervening white space. Any attributes and namespace declarations follow with each attribute and namespace declaration preceded by one space (#x20) character. When the element type and the attribute names do not have namespaces, the attributes are sorted lexicographically by attribute name (based on Unicode character code points); the ordering when namespaces are present is described in [5.9 Namespaces].
The canonical form of an XML document includes all its attributes, whether provided explicitly or by default in the original document.
For the following element
<x a="Earth" ñ="Wind" z="Fire" >!!</x >
the canonical form is
<x a="Earth" z="Fire" ñ="Wind">!!</x>
In the canonical form of an XML document, attribute values are normalized in the fashion required of an XML processor.
In Canonical XML, attribute names and values are separated by a single "=" character and no spaces. All attribute values are delimited by double-quote (") characters. Within attribute values, all occurrences of double-quote are replaced by """.
For the following start-tag
<x a = '"Don't!", he cried.' b = "'>'">
the canonical form is
<x a=""Don't!", he cried." b="'>'">
In Canonical XML, there is no Document Type Definition and thus no PIs contained in it. PIs which precede and follow the root element are normalized as follows:
PIs which are contained in the content of an element are normalized as follows:
For the following XML document
<?pi1 v1 ?><?pi2 v2 ?><root>Hello <?audio bang! ?> he said.</root><?pi3?>
the canonical form is
<?pi1 v1 ?> <?pi2 v2 ?> <root>Hello <?audio bang! ?> he said.</root> <?pi3?>
In Canonical XML, namespace prefixes always have the form
n1
, n2
and so on. The positive integer
following the n
is called the index of the prefix.
A start-tag always contains namespace declarations for exactly those prefixes that are used in the element type and the attribute names occurring in the start-tag. Namespace declarations are never inherited.
NOTE: This approach was chosen so that canonicalization is context-independent: the canonical form of an element is independent of where it occurs in the document.
The default namespace is never used. An attribute name never has the same prefix as the element type or another attribute name. The namespace declaration for a prefix immediately follows the element type or attribute that uses the prefix. Attributes are ordered primarily by the lexicographic order of the namespace URI with which the prefix of the attribute name is associated, and secondarily by the lexicographic order of the local part of the attribute name. A null namespace URI is considered to precede a non-null namespace URI: thus all attributes without prefixes precede all attributes with prefixes.
In the start-tag namespace prefixes occur in order of prefix
index. The index of the first namespace prefix in the start-tag is
always 1. The indices of the prefixes occurring in the start-tag
are always consecutive integers. Thus if the element type has a
prefix, its prefix will be n1
; the prefix of the first
attribute name in the start-tag that has a prefix will be
n2
if the element type has a prefix, and n1
otherwise; for subsequent attributes, the index of the prefix of
the attribute name will be one greater than the index of the prefix
of the name of the preceding attribute.
For example, for the following element
<doc xmlns:x="http://w3.org/2" xmlns:y="http://w3.org/1"> <x:e a="a"/> <x:e x:a="x:a"/> <e x:a="x:a"/> <e x:a="x:a" y:a="y:a"/> <e x:a="x:a" a="a"/> <e x:a="x:a" x:b="x:b"/> </doc>
the canonical form is
<doc> <n1:e xmlns:n1="http://w3.org/2" a="a"></n1:e> <n1:e xmlns:n1="http://w3.org/2" n2:a="x:a" xmlns:n2="http://w3.org/2"></n1:e> <e n1:a="x:a" xmlns:n1="http://w3.org/2"></e> <e n1:a="y:a" xmlns:n1="http://w3.org/1" n2:a="x:a" xmlns:n2="http://w3.org/2"></e> <e a="a" n1:a="x:a" xmlns:n1="http://w3.org/2"></e> <e n1:a="x:a" xmlns:n1="http://w3.org/2" n2:b="x:b" xmlns:n2="http://w3.org/2"></e> </doc>
The work of producing this specification was accomplished by the membership of the W3C XML Syntax Working Group and its successor, the W3C XML Core Working Group: