Modern DOM library written in PHP for HTML documents
Find a file
2021-04-01 09:50:13 -05:00
lib Remove lowercasing of namespace URIs 2021-04-01 09:50:13 -05:00
tests More tests 2021-03-31 15:48:01 -04:00
vendor-bin Update dependencies 2021-03-25 15:13:53 -04:00
.gitattributes Add missing tests for charset pre-scan 2021-03-17 15:35:44 -04:00
.gitignore Added innerHTML to Element, getting of outerHTML, started on setting 2021-03-16 14:37:20 -05:00
AUTHORS Added authors file and updated license 2018-08-03 23:21:15 -05:00
composer.json Update dependencies 2021-03-25 15:13:53 -04:00
composer.lock Update dependencies 2021-03-25 15:13:53 -04:00
LICENSE Added authors file and updated license 2018-08-03 23:21:15 -05:00
README.md Started DOM documentation 2021-03-25 10:42:10 -05:00
robo Basic skeleton of test suite 2019-12-10 18:00:08 -05:00
robo.bat Basic skeleton of test suite 2019-12-10 18:00:08 -05:00
RoboFile.php Don't claim copyright on machine-generated data 2021-03-22 10:26:28 -04:00

HTML

Tools for parsing and printing HTML5 documents and fragments.

<?php
$dom = MensBeam\HTML\Parser::parse('<!DOCTYPE html><html lang="en" charset="utf-8"><head><title>Ook!</title></head><body><h1>Ook!</h1><p>Ook-ook? Oooook. Ook ook oook ook oooooook ook ooook ook.</p><p>Eek!</p></body></html>');
?>

or:

<?php
$dom = new MensBeam\HTML\Document;
$dom->loadHTML('<!DOCTYPE html><html lang="en" charset="utf-8"><head><title>Ook!</title></head><body><h1>Ook!</h1><p>Ook-ook? Oooook. Ook ook oook ook oooooook ook ooook ook.</p><p>Eek!</p></body></html>');
?>

Comparison with masterminds/html5

This library and masterminds/html5 serve similar purposes. Generally, we are more accurate, but they are much faster. The following table summarizes the main functional differences.

DOMDocument Masterminds MensBeam
Minimum PHP version 5.0 5.3 7.1
Extensions required dom dom, ctype, mbstring or iconv dom
Target HTML version HTML 4.01 HTML 5.0 WHATWG Living Standard
Supported encodings System-dependent System-dependent Per specification
Encoding detection BOM, http-equiv None Per specification (Steps 1-5 & 9)
Fallback encoding ISO 8859-1 UTF-8, configurable Windows-1252, configurable
Handling of invalid characters Bytes are passed through Characters are dropped Per specification
Handling of invalid XML element names Variable Name is changed to "invalid" Per specification
Handling of invalid XML attribute names Variable Attribute is dropped Per specification
Handling of misnested tags Parent end tags always close children Parent end tags always close children Per specification
Handling of data between table cells Left as-is Left as-is Per specification
Handling of omitted start tags Elements are not inserted Elements are not inserted Per specification
Handling of processing instructions Processing instructions are retained Processing instructions are retained Per specification
Handling of bogus XLink namespace* Foreign content not supported XLink attributes are lost if preceded by bogus namespace Bogus namespace is ignored
Namespace for HTML elements Null Per specification, configurable Null
Time needed to parse single-page HTML specification 0.5 seconds 2.7 seconds† 6.0 seconds‡
Peak memory needed for same 11.6 MB 38 MB 13.9 MB

* For example: <svg xmlns:xlink='http://www.w3.org/1999/xhtml' xlink:href='http://example.com/'/>. It is unclear what correct behaviour is, but we believe our behaviour to be more consistent with the intent of the specification.

† With HTML namespace disabled. With HTML namespace enabled it does not finish in a reasonable time due to a PHP bug.

‡ With parse errors suppressed. Reporting parse errors adds approximately 10% overhead.

Document Object Model

This library works by parsing HTML strings into PHP's existing XML DOM. It, however, has to force the antiquated PHP DOM extension into working properly with modern HTML DOM by extending many of the node types. The documentation below follows PHP's doc style guide with the exception of inherited methods and properties not being listed. Therefore, only new constants, properties, and methods will be listed; in addition, extended methods which change outward behavior from their parent class will be listed.

MensBeam\HTML\Document

MensBeam\HTML\Document extends \DOMDocument {

    /* Constants */
    public const NO_QUIRKS_MODE = 0;
    public const QUIRKS_MODE = 1;
    public const LIMITED_QUIRKS_MODE = 2;

    /* Properties */
    public string|null $documentEncoding = null;
    public int $quirksMode = 0;

    /* Methods */
    public load ( string $filename , null $options = null , string|null $encodingOrContentType = null ) : bool
    public loadHTML ( string $source , null $options = null , string|null $encodingOrContentType = null ) : bool
    public loadHTMLFile ( string $filename , null $options = null , string|null $encodingOrContentType = null ) : bool
    public loadXML ( string $source , null $options = null ) : false
    public save ( string $filename , null $options = null ) : int|false
    public saveXML ( DOMNode|null $node = null , null $options = null ) : false
    public validate ( ) : true
    public xinclude ( null $options = null ) : false

}

Properties

documentEncoding
Encoding of the document, as specified when parsing or when determining encoding type.
quirksMode
Used when parsing. Can be not in quirks mode, quirks mode, or limited quirks mode. See the `MensBeam\HTML\Document` constants to see the valid values.

The following properties inherited from \DOMDocument have no effect on Mensbeam\HTML\Document:

  • actualEncoding
  • config
  • encoding
  • formatOutput
  • preserveWhiteSpace
  • recover
  • resolveExternals
  • standalone
  • substituteEntities
  • validateOnParse
  • version
  • xmlEncoding
  • xmlStandalone
  • xmlVersion