# HTML-Parser
A modern, accurate HTML parser and serializer for PHP.
## Usage
### Parsing documents
```php
public static MensBeam\HTML\Parser::parse(
string $data,
?string $encodingOrContentType = null,
?MensBeam\HTML\Parser\Config $config = null
): MensBeam\HTML\Parser\Output
```
The `MensBeam\HTML\Parser::parse` static method is used to parse documents. An arbitrary string and optional encoding are taken as input, and a `MensBeam\HTML\Parser\Output` object is returned as output. The `Output` object has the following properties:
- `document`: A `DOMDocument` object representing the parsed document
- `encoding`: The original character encoding of the document, as supplied by the user or otherwise detected during parsing
- `quirksMode`: The detected "quirks mode" property of the document. This will be one of `Parser::NO_QURIKS_MODE` (`0`), `Parser::QUIRKS_MODE` (`1`), or `Parser::LIMITED_QUIRKS_MODE` (`2`)
- `errors`: An array containing the list of parse errors emitted during processing if parse error reporting was turned on (see **Configuration** below), or `null` otherwise
Extra configuration parameters may be given to the parser by passing a `MensBeam\HTML\Parser\Config` object as the final `$config` argument. See the **Configuration** section below for more details.
### Parsing with `DOMParser`
Since version 1.3.0, the library also provides an implemention of [the `DOMParser` interface](https://html.spec.whatwg.org/multipage/dynamic-markup-insertion.html#dom-parsing-and-serialization).
```php
class MensBeam\HTML\DOMParser {
public function parseFromString(
string $string,
string $type
): \DOMDocument
}
```
Like the standard interface, it will parse either HTML or XML documents. This implementation does, however, differ in the following ways:
- Any XML MIME content-type (e.g. `application/rss+xml`) is acceptable, not just the restricted list mandated by the interface
- MIME content-types may include a `charset` parameter to specify an authoritative encoding of the document
- If no `charset` is provided encoding will be detected from document hints; the default encoding for HTML is `windows-1252` and for XML `UTF-8`
- `InvalidArgumentException` is thrown in place of JavaScript's `TypeError`
### Parsing into existing documents
```php
public static MensBeam\HTML\Parser::parseInto(
string $data,
\DOMDocument $document,
?string $encodingOrContentType = null,
?MensBeam\HTML\Parser\Config $config = null
): MensBeam\HTML\Parser\Output
```
The `MensBeam\HTML\Parser::parseInto` static method is used to parse into an existing document. The supplied document must be an instance of (or derived from) `\DOMDocument` and also must be empty. All other arguments are identical to those used when parsing documents normally.
*NOTE:* The `documentClass` configuration option has no effect when using this method.
### Parsing fragments
```php
public static MensBeam\HTML\Parser::parse(
DOMElement $contextElement,
int $quirksMode,
string $data,
?string $encodingOrContentType = null,
?MensBeam\HTML\Parser\Config $config = null
): DOMDocumentFragment
```
The `MensBeam\HTML\Parser::parseFragment` static method is used to parse document fragments. The primary use case for this method is in the implementation of the `innerHTML` setter of HTML elements. Consequently a context element is required, as well as the "quirks mode" property of the context element's document (which must be one of `Parser::NO_QURIKS_MODE` (`0`), `Parser::QUIRKS_MODE` (`1`), or `Parser::LIMITED_QUIRKS_MODE` (`2`)). The further arguments are identical to those used when parsing documents.
If the "quirks mode" property of the document is not known, using `Parser::NO_QUIRKS_MODE` (`0`) is usually the best choice.
Unlike the `parse()` method, the `parseFragment()` method returns a `DOMDocumentFragment` object belonging to `$contextElement`'s owner document.
### Serializing nodes
```php
public static MensBeam\HTML\Parser::serialize(
DOMNode $node,
array $config = []
): string
```
```php
public static MensBeam\HTML\Parser::serializeInner(
DOMNode $node,
array $config = []
): string
```
The `MensBeam\HTML\Parser::serialize` method can be used to convert most `DOMNode` objects into strings, using the basic algorithm defined in the HTML specification. Nodes of the following types can be successfully serialized:
- `DOMDocument`
- `DOMElement`
- `DOMText`
- `DOMComment`
- `DOMDocumentFragment`
- `DOMDocumentType`
- `DOMProcessingInstruction`
Similarly, the `MensBeam\HTML\Parser::serializeInner` method can be used to convert the children of non-leaf `DOMNode` objects into strings, using the basic algorithm defined in the HTML specification. Children of nodes of the following types can be successfully serialized:
- `DOMDocument`
- `DOMElement`
- `DOMDocumentFragment`
The serialization methods use an associative array for configuration, and the possible keys and value types are:
- `booleanAttributeValues` (`bool|null`): Whether to include the values of boolean attributes on HTML elements during serialization. Per the standard this is `true` by default
- `foreignVoidEndTags` (`bool|null`): Whether to print the end tags of foreign void elements rather than self-closing their start tags. Per the standard this is `true` by default
- `groupElements` (`bool|null`): Group like "block" elements and insert extra newlines between groups
- `indentStep` (`int|null`): The number of spaces or tabs (depending on setting of indentStep) to indent at each step. This is `1` by default and has no effect unless `reformatWhitespace` is `true`
- `indentWithSpaces` (`bool|null`): Whether to use spaces or tabs to indent. This is `true` by default and has no effect unless `reformatWhitespace` is `true`
- `reformatWhitespace` (`bool|null`): Whether to reformat whitespace (pretty-print) or not. This is `false` by default
## Examples
- Parsing a document with unknown encoding:
```php
use MensBeam\HTML\Parser;
echo Parser::parse('Hello world!')->encoding;
// prints "windows-1252"
echo Parser::parse('Hello world!')->encoding;
// prints "UTF-8"
```
- Parsing a document with a known encoding:
```php
use MensBeam\HTML\Parser;
echo Parser::parse("\u{3088}", "UTF-8")
->document
->getElementsByTagName("body")[0]
->textContent;
// prints "よ"
echo Parser::parse("\u{3088}", "text/html; charset=utf-8")
->document
->getElementsByTagName("body")[0]
->textContent;
// also prints "よ"
```
- Parsing a document with a different default encoding:
```php
use MensBeam\HTML\Parser;
use MensBeam\HTML\Parser\Config;
$config = new Config;
$config->encodingFallback = "Shift_JIS";
echo Parser::parse("\x82\xE6", null, $config)
->document
->getElementsByTagName("body")[0]
->textContent;
// also also prints "よ"
```
- Parsing document fragments:
```php
use MensBeam\HTML\Parser;
use MensBeam\HTML\Parser\Config;
$config = new Config;
$config->htmlNamespace = true;
// set up two context nodes
$document = Parser::parse("", "UTF-8", $config)->document;
$body = $document->getElementsByTagName("body")[0];
$math = $document->getElementsByTagName("math")[0];
echo $body->namespaceURI; // prints "http://www.w3.org/1999/xhtml"
echo $math->namespaceURI; // prints "http://www.w3.org/1998/Math/MathML"
// parse two identical fragments using different context elements
$htmlFragment = Parser::parseFragment($body, 0, " Eek