Browse Source

Added properties to Document

• Added Document::compatMode
• Removed Document::quirksMode
• Added Document::characterSet
• Added Document::charset
• Added Document::inputEncoding
• Added Document::contentType
wrapper-classes
Dustin Wilson 3 years ago
parent
commit
bbe22e2e61
  1. 16
      README.md
  2. 66
      lib/Document.php
  3. 6
      lib/Element.php
  4. 59
      tests/cases/TestDocument.php
  5. 7
      tests/test.html

16
README.md

@ -65,4 +65,18 @@ The primary aim of this library is accuracy. If the document model differs from
3. While `DOMDocumentType` can be extended and registered by PHP's `DOMDocument::registerNodeClass` `DOMImplementation` cannot; this means that doctypes created with `DOMImplementation::createDocumentType` can't ever be a registered class. Therefore, doctypes remain as `DOMDocumentType` in this library and retain the same limitations as ones in PHP's DOM.
4. The DOM specification mentions that [`HTMLCollection`][a] has to be kept around for backwards compatibility in browsers, but any new implementations should use [`sequence<T>`][b] instead which is essentially just a typed array object of some kind. Any methods should also return a copy of an object instead of a reference to the platform object, meaning the bane of any web developer's existence -- live lists -- shouldn't be in any new additions to the DOM. Since this implementation is not a fully userland PHP implementation of the DOM but instead an extension of it, this implementation will use `DOMNodeList` where the specification says to use an `HTMLCollection` and an array where the specification says to use a `sequence<T>`. In addition, if the specification states to return a static `NodeList` this implementation will use `MensBeam\\HTML\\DOM\\NodeList` instead; this is because `DOMNodeList` is always live in PHP.
5. Aside from `HTMLTemplateElement` there are no other specific element classes such as `HTMLAnchorElement`, `HTMLDivElement`, etc. and therefore are no DOM methods and properties that are specific to those elements. Implementing them is possible, but we weighed it against its utility as each specific element slows down the DOM seemingly exponentially especially when parsing serialized HTML because each element has to be converted to the specific variety manually and recursively. For instance, when parsing the WHATWG's single page HTML specification (which is an absurdly enormous HTML document on the very edge of what we should be able to parse) in our tests it takes around 6.5 seconds; with specific element classes it instead takes *15 minutes*. [`phpgt/dom`][c] mitigates this by only converting when querying for elements, but it's still slow. We decided not to go this route.
6. This implementation will not implement the `NodeIterator` and `TreeWalker` APIs. They are horribly conceived and impractical APIs that few people actually use because it's literally easier to write recursive loops to walk through the DOM than it is to use those APIs. They have instead been replaced with the `ChildNode::moonwalk`, `ParentNode::walk`, `ChildNode::walkFollowing`, and `ChildNode::walkPreceding` generators.
6. This implementation will not implement the `NodeIterator` and `TreeWalker` APIs. They are horribly conceived and impractical APIs that few people actually use because it's literally easier to write recursive loops to walk through the DOM than it is to use those APIs. They have instead been replaced with the `ChildNode::moonwalk`, `ParentNode::walk`, `ChildNode::walkFollowing`, and `ChildNode::walkPreceding` generators.
7. Readonly properties inherited from PHP DOM cannot be overridden in this implementation and therefore might produce incorrect data. Below are the properties that will show invalid or useless data along with suggested replacements:
| property | replacement(s) |
| ------------------------------- | ------------------------------------------------------------------------ |
| `Document::documentURI` | `Document::URL` |
| `Document::actualEncoding` | `Document::characterSet`, `Document::charset`, `Document::inputEncoding` |
| `Document::encoding` | `Document::characterSet`, `Document::charset`, `Document::inputEncoding` |
| `Document::preserveWhitespace` | |
| `Document::recover` | |
| `Document::resolveExternals` | |
| `Document::standalone` | |
| `Document::strictErrorChecking` | |
| `Document::substituteEntities` | |
| `Document::validateOnParse` | |

66
lib/Document.php

@ -19,9 +19,9 @@ class Document extends \DOMDocument implements Node {
use DocumentOrElement, MagicProperties, ParentNode;
protected ?Element $_body = null;
/** Non-standard */
protected ?string $_documentEncoding = null;
protected int $_quirksMode = Parser::NO_QUIRKS_MODE;
protected string $_charset = 'windows-1252';
protected string $_compatMode = 'CSS1Compat';
protected string $_URL = '';
/** Non-standard */
protected ?\DOMXPath $_xpath = null;
@ -108,12 +108,28 @@ class Document extends \DOMDocument implements Node {
$this->_body = $value;
}
protected function __get_documentEncoding(): ?string {
return $this->_documentEncoding;
protected function __get_characterSet(): string {
return $this->_charset;
}
protected function __get_charset(): string {
return $this->_charset;
}
protected function __get_compatMode(): string {
return $this->_compatMode;
}
protected function __get_contentType(): string {
return 'text/html';
}
protected function __get_quirksMode(): int {
return $this->_quirksMode;
protected function __get_inputEncoding(): string {
return $this->_charset;
}
protected function __get_URL(): string {
return $this->_URL;
}
protected function __get_xpath(): \DOMXPath {
@ -135,14 +151,14 @@ class Document extends \DOMDocument implements Node {
parent::registerNodeClass('DOMProcessingInstruction', '\MensBeam\HTML\DOM\ProcessingInstruction');
parent::registerNodeClass('DOMText', '\MensBeam\HTML\DOM\Text');
$this->_documentEncoding = $encoding;
if ($source !== null) {
if (is_string($source)) {
$this->loadHTML($source, null, $encoding);
} else {
$this->loadDOM($source, $encoding);
}
} elseif ($encoding !== null) {
$this->_charset = Charset::fromCharset((string)$encoding) ?? 'windows-1252';
}
}
@ -404,28 +420,34 @@ class Document extends \DOMDocument implements Node {
$data = stream_get_contents($f);
$encoding = Charset::fromCharset((string)$encoding) ?? Charset::fromTransport((string)$encoding);
if (!$encoding) {
$meta = stream_get_meta_data($f);
if ($meta['wrapper_type'] === 'http') {
// Try to find a Content-Type header field
foreach ($meta['wrapper_data'] as $h) {
$h = explode(':', $h, 2);
if (count($h) === 2 && preg_match("/^\s*Content-Type\s*$/i", $h[0])) {
// Try to get an encoding from it
$encoding = Charset::fromTransport($h[1]);
break;
}
$meta = stream_get_meta_data($f);
$wrapperType = $meta['wrapper_type'];
if (!$encoding && $wrapperType === 'http') {
// Try to find a Content-Type header field
foreach ($meta['wrapper_data'] as $h) {
$h = explode(':', $h, 2);
if (count($h) === 2 && preg_match("/^\s*Content-Type\s*$/i", $h[0])) {
// Try to get an encoding from it
$encoding = Charset::fromTransport($h[1]);
break;
}
}
}
if ($wrapperType === 'plainfile') {
$filename = realpath($filename);
$this->_URL = "file://$filename";
} else {
$this->_URL = $filename;
}
$this->loadHTML($data, null, $encoding);
return true;
}
public function loadDOM(\DOMDocument $source, ?string $encoding = null, int $quirksMode = Parser::NO_QUIRKS_MODE) {
$this->_documentEncoding = $encoding;
$this->_quirksMode = $quirksMode;
$this->_charset = Charset::fromCharset((string)$encoding) ?? 'windows-1252';
$this->_compatMode = ($quirksMode === Parser::NO_QUIRKS_MODE || $quirksMode === Parser::LIMITED_QUIRKS_MODE) ? 'CSS1Compat' : 'BackCompat';
// If there are already-existing child nodes then remove them before loading the
// DOM.

6
lib/Element.php

@ -41,7 +41,9 @@ class Element extends \DOMElement implements Node {
# 2. Let fragment be the result of invoking the fragment parsing algorithm with
# the new value as markup, and with context element.
$fragment = Parser::parseFragment($this, $this->ownerDocument->quirksMode, $value, 'UTF-8');
$fragment = Parser::parseFragment($this, ($this->ownerDocument->compatMode === 'CSS1Compat') ? Parser::NO_QUIRKS_MODE : Parser::QUIRKS_MODE, $value, 'UTF-8');
$fragment = $this->ownerDocument->importNode($fragment);
# 3. If the context object is a template element, then let context object be the
@ -132,7 +134,7 @@ class Element extends \DOMElement implements Node {
# 5. Let fragment be the result of invoking the fragment parsing algorithm with
# the new value as markup, and parent as the context element.
$fragment = Parser::parseFragment($parent, $this->ownerDocument->quirksMode, $value, 'UTF-8');
$fragment = Parser::parseFragment($parent, ($this->ownerDocument->compatMode === 'CSS1Compat') ? Parser::NO_QUIRKS_MODE : Parser::QUIRKS_MODE, $value, 'UTF-8');
$fragment = $this->ownerDocument->importNode($fragment);
# 6. Replace the context object with fragment within the context object's

59
tests/cases/TestDocument.php

@ -137,7 +137,8 @@ class TestDocument extends \PHPUnit\Framework\TestCase {
* @covers \MensBeam\HTML\DOM\Document::loadHTMLFile
* @covers \MensBeam\HTML\DOM\Document::preInsertionValidity
* @covers \MensBeam\HTML\DOM\Document::replaceTemplates
* @covers \MensBeam\HTML\DOM\Document::__get_quirksMode
* @covers \MensBeam\HTML\DOM\Document::__get_compatMode
* @covers \MensBeam\HTML\DOM\Document::__get_URL
* @covers \MensBeam\HTML\DOM\NodeTrait::getRootNode
*/
public function testDocumentCreation(): void {
@ -146,9 +147,9 @@ class TestDocument extends \PHPUnit\Framework\TestCase {
$this->assertSame('MensBeam\HTML\DOM\Document', get_class($d));
$this->assertSame(null, $d->firstChild);
// Test string source
// Test compatibility mode
$d = new Document('<html><body>Ook!</body></html>');
$this->assertSame(Parser::QUIRKS_MODE, $d->quirksMode);
$this->assertSame('BackCompat', $d->compatMode);
// Test DOM source
$d = new \DOMDocument();
@ -177,18 +178,25 @@ class TestDocument extends \PHPUnit\Framework\TestCase {
$this->assertFalse(@$d->load('fileDoesNotExist.html'));
$d->load($f);
$this->assertNotNull($d->documentElement);
$this->assertSame('ISO-2022-JP', $d->documentEncoding);
$this->assertSame('ISO-2022-JP', $d->charset);
// Test http source
$d = new Document();
$d->load('https://google.com');
$this->assertNotNull($d->documentElement);
$this->assertSame('UTF-8', $d->documentEncoding);
$this->assertSame('UTF-8', $d->charset);
$this->assertSame('https://google.com', $d->URL);
$this->assertNull($d->documentURI);
// Test document encoding
$d = new Document();
$d->loadHTMLFile($f, null, 'UTF-8');
$this->assertSame('UTF-8', $d->documentEncoding);
$this->assertSame('UTF-8', $d->charset);
// Test real document loading
$d = new Document();
$d->loadHTMLFile(dirname(__FILE__) . '/../test.html', null, 'UTF-8');
$this->assertStringStartsWith('file://', $d->URL);
// Test templates in source
$d = new Document('<!DOCTYPE html><html><body><template class="test"><template></template></template></body></html>');
@ -419,34 +427,49 @@ class TestDocument extends \PHPUnit\Framework\TestCase {
}
/** @covers \MensBeam\HTML\DOM\Document::__get_documentEncoding */
public function testPropertyGetDocumentEncoding(): void {
/**
* @covers \MensBeam\HTML\DOM\Document::__get_charset
* @covers \MensBeam\HTML\DOM\Document::__get_characterSet
* @covers \MensBeam\HTML\DOM\Document::__get_inputEncoding
*/
public function testPropertyGetCharset(): void {
$d = new Document(null, 'UTF-8');
$this->assertSame('UTF-8', $d->documentEncoding);
$this->assertSame('UTF-8', $d->charset);
$this->assertSame('UTF-8', $d->characterSet);
$this->assertSame('UTF-8', $d->inputEncoding);
$d = new Document('<!DOCTYPE html><html><head><meta charset="GB18030"></head></html>');
$this->assertSame('gb18030', $d->documentEncoding);
$this->assertSame('gb18030', $d->charset);
$this->assertSame('gb18030', $d->characterSet);
$this->assertSame('gb18030', $d->inputEncoding);
}
public function providePropertyGetQuirksMode(): iterable {
public function providePropertyGetCompatMode(): iterable {
return [
// Empty document
[ null, Parser::NO_QUIRKS_MODE ],
[ null, 'CSS1Compat' ],
// Document without doctype
[ '<html></html>', Parser::QUIRKS_MODE ],
[ '<html></html>', 'BackCompat' ],
// Document with doctype
[ '<!DOCTYPE html><html></html>', Parser::NO_QUIRKS_MODE ]
[ '<!DOCTYPE html><html></html>', 'CSS1Compat' ]
];
}
/**
* @dataProvider providePropertyGetQuirksMode
* @covers \MensBeam\HTML\DOM\Document::__get_quirksMode
* @dataProvider providePropertyGetCompatMode
* @covers \MensBeam\HTML\DOM\Document::__get_compatMode
*/
public function testPropertyGetQuirksMode(?string $html, int $quirksMode): void {
public function testPropertyGetCompatMode(?string $html, string $compatMode): void {
$d = new Document($html);
$this->assertSame($quirksMode, $d->quirksMode);
$this->assertSame($compatMode, $d->compatMode);
}
/** @covers \MensBeam\HTML\DOM\Document::__get_contentType */
public function testPropertyGetContentType(): void {
$d = new Document();
$this->assertSame('text/html', $d->contentType);
}

7
tests/test.html

@ -0,0 +1,7 @@
<!DOCTYPE html>
<html>
<head>
<meta charset="ISO-2022-JP">
<title>Ook</title>
</head>
</html>
Loading…
Cancel
Save