A modern, accurate HTML parser and serializer for PHP
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

3.8 KiB

HTML DOM serialization tests

The format of these tests is essentially the format of html5lib's tree construction tests in reverse. There are, however, important differences, so the format is documented in full here.

Each file containing tree construction tests consists of any number of tests separated by two newlines (LF) and a single newline before the end of the file. For instance:

[TEST]LF
LF
[TEST]LF
LF
[TEST]LF

Where [TEST] is the following format:

Each test begins with a line reading #document or #fragment; subsequent lines represent the document or document fragment (respectively) used as input, until a line is encountered which reads #output, #script-on, or #script-off.

Each DOM node in the input is written on its own line beginning with the characters "| " (a vertical bar followed by a single space); lines which begin with other characters are a continuation of the previous line. Attributes are treated as distinct nodes and have their own entries. There is no escape mechanism: all input is literal, including newlines and quotation marks. Two spaces are used to denote each level of nesting. For example:

| node
|   child node
continuation of child node
|     grandchild node
|   child node
|     attribute node of child
|     grandchild node

The different types of nodes are:

  • Element nodes in the form <body> for an element in the HTML namespace, or <svg svg> for an element in a foreign namespace. Qualified names are written as usual e.g. <math math:math>, though such elements are not produced by the parser
  • Attribute nodes in the form id="value" or e.g. xml xml:id="value", with a quotation mark immediately followed by a newline marking the end of the attribute value (in other words, attribute values may contain literal quotation marks)
  • Text nodes in the form "text data"; like attributes, only a quotation mark followed a newline marks the end of text data
  • Comment nodes of the form <!-- comment data -->; the space characters are padding and are not part of the comment data
  • Document type nodes in the form <!DOCTYPE html "public" "system">, or <!DOCTYPE html> or simply <!DOCTYPE> depending on its contents
  • Processing instructions in the form <?target PI data>. Processing instructions are not generated by the HTML parser, but may appear in documents by other means

Namespaces are represented by the following short names:

Name URL
xml http://www.w3.org/XML/1998/namespace
xmlns http://www.w3.org/2000/xmlns/
xlink http://www.w3.org/1999/xlink
math http://www.w3.org/1998/Math/MathML
svg http://www.w3.org/2000/svg

Other namespaces may also appear; these should be interpreted as literal URLs.

After the input block either #script-on or #script-off may appear. These signal that the test should be run with scripting on or off, respectively. If neither line is present, the test should be run in both modes.

Finally, #output marks the beginning of output. All subsequent text is literal characters until two consecutive newlines following by either #document or #fragment are seen.

Below is a complete example:

#document
| <!-- This is longer than most tests -->
| <!DOCTYPE html "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
| <html>
|   lang="en"
|   <head>
|   <body>
|     style="font-family: "Times New Roman""
|     <svg svg>
|       xml xml:id="image"
|     <div>
|       "This is a text node.
It has an embedded newline. It is in fact pretty "busy" and has
multiple newlines.

And even a blank line."
|       <!-- This comment also
has a newline -->