Browse Source

Added support for closures as resolvers in XPath expressions

master 1.0.7
Dustin Wilson 2 years ago
parent
commit
edd22dc35d
  1. 7
      README.md
  2. 20
      lib/Node.php
  3. 13
      lib/XPathEvaluate.php
  4. 2
      lib/XPathEvaluatorBase.php
  5. 5
      tests/cases/TestXPathEvaluate.php

7
README.md

@ -405,7 +405,7 @@ Returns the wrapper node that corresponds to the provided inner node. If one doe
## Limitations & Differences from Specification ##
The primary aim of this library is accuracy. However, due either to limitations imposed by PHP's DOM, by assumptions made by the specification that aren't applicable to a PHP library, or simply because of impracticality some changes have needed to be made. These are as follows:
The primary aim of this library is accuracy. However, due either to limitations imposed by PHP's DOM, by assumptions made by the specification that aren't applicable to a PHP library, or simply because of impracticality some changes have needed to be made. There appears to be a lot of deviations from the specification below, but this is simply an exhaustive list of details about the implementation with a few even explaining why we follow the specification instead of what browsers do.
1. Any mention of scripting or anything necessary because of scripting (such as the `ElementCreationOptions` options dictionary on `Document::createElement`) will not be implemented.
2. Due to a PHP bug which severely degrades performance with large documents and in consideration of existing PHP software and because of bizarre uncircumventable `xmlns` attribute bugs when the document is in the HTML namespace, HTML elements in HTML documents are placed in the null namespace internally rather than in the HTML namespace. However, externally they will be shown as having the HTML namespace. Even though null namespaced elements do not exist in the HTML specification one can create them using the DOM. However, in this implementation they will be treated as HTML namespaced elements due to the HTML namespace limitation.
@ -420,8 +420,9 @@ The primary aim of this library is accuracy. However, due either to limitations
11. All of the `Range` APIs will also not be implemented due to the sheer complexity of creating them in userland and how it adds undue difficulty to node manipulation in the "core" DOM. Numerous operations reference in excrutiating detail what to do with Ranges when manipulating nodes and would have to be added here to be compliant or mostly so -- slowing everything else down in the process on an already extremely front-heavy library.
12. The `DOMParser` and `XMLSerializer` APIs will not be implemented because they are ridiculous and limited in their scope. For instance, `DOMParser::parseFromString` won't set a document's character set to anything but UTF-8. This library needs to be able to print to other encodings due to the nature of how it is used. `Document::__construct` will accept optional `$source` and `$charset` arguments, and there are both `Document::load` and `Document::loadFile` methods for loading DOM from a string or a file respectively.
13. Aside from `HTMLElement`, `HTMLPreElement`, `HTMLTemplateElement`, `HTMLUnknownElement`, `MathMLElement`, and `SVGElement` none of the specific derived element classes (such as `HTMLAnchorElement` or `SVGSVGElement`) are implemented. The ones listed before are required for the element interface algorithm. The focus on this library will be on the core DOM before moving onto those -- if ever.
14. This class is meant to be used with HTML, but it will work -MOSTLY- as needed work with XML. Loading of XML uses PHP DOM's XML parser which does not completely conform to the XML specification. Writing an actual conforming XML parser is outside of the scope of this library.
14. This class is meant to be used with HTML, but it will work -MOSTLY- as needed work with XML. Loading of XML uses PHP DOM's XML parser which does not completely conform to the XML specification. Writing an actual conforming XML parser is outside of the scope of this library. One notable feature of this library which won't work per the XML specification are unicode characters in element names. XML allows for capital letters while HTML doesn't. This implementation's workaround (because PHP's DOM doesn't support unicode at all in element names) internally coerces all non-ascii characters to 'Uxxxx' which would be valid modern XML names. Something like a lookup table would be necessary for XML instead, but this isn't implemented and may not be because of complexity.
15. While there is implementation of much of the XPath extensions, there will only be support for XPath 1.0 because that is all PHP DOM's XPath supports.
16. This library's XPath API is -- like the rest of the library itself -- a wrapper that wraps PHP's implementation but instead works like the specification, so there is no need to manually register namespaces. Namespaces that are associated with prefixes will be looked up when evaluating the expression if a `XPathNSResolver` is specified. However, access to registering PHP functions for use within XPath isn't in the specification but is available through `Document::registerXPathFunctions` and `XPathEvaluator::registerXPathFunctions`.
17. `XPathEvaluatorBase::evaluate` has a `result` argument where one provides it with an existing result object to use. I can't find any usable documentation on what this is supposed to do, and the specifications on it are vague. So, at present it does nothing until what it needs to do can be deduced.
18. At present XPath expressions cannot select elements or attributes which use any valid non-ascii character. This is because those nodes are coerced internally to work within PHP's DOM which doesn't support those characters. This can be worked around by coercing names in XPath queries, but that can only be reliably accomplished through an XPath parser. Writing an entire XPath parser for what amounts to an edge case isn't desirable.
18. At present XPath expressions cannot select elements or attributes which use any valid non-ascii character. This is because those nodes are coerced internally to work within PHP's DOM which doesn't support those characters. This can be worked around by coercing names in XPath queries, but that can only be reliably accomplished through an XPath parser. Writing an entire XPath parser for what amounts to an edge case isn't desirable.
19. The XPath API itself is an ill-conceived API that is entirely impractical to use because doing anything with the `XPathResult` object is cumbersome and stupid. Per the specification one cannot iterate over the result even if the result type is an iterator type (why in the hell call it that, then?). One has to instead repeatedly call the `XPathResult::iterateNext()` method. This implementation will allow for treating `XPathResult` snapshot or iterator types as arrays.

20
lib/Node.php

@ -813,11 +813,20 @@ abstract class Node implements \Stringable {
// contents of the node should be appended to the wrapper element's content
// document fragment. Otherwise, clone the content document fragment instead.
if (!$parsing) {
$copyWrapperContent = $copyWrapperContent->innerNode;
$copyWrapperContentInner = $copyWrapperContent->innerNode;
$nodeWrapperContent = $node->ownerDocument->getWrapperNode($node)->content->innerNode;
$childNodes = $nodeWrapperContent->childNodes;
foreach ($childNodes as $child) {
$copyWrapperContent->appendChild($this->cloneInnerNode($child, $document, true));
if ($childNodes->length > 0) {
// This garbage is necessary because the appendChildInner method
// needs to be invoked on another document here. This is because of a nasty PHP
// DOM bug (see Node::preInsertionBugFixes for a description).
$appendChildInner = new \ReflectionMethod($copyWrapperContent->ownerDocument, 'appendChildInner');
$appendChildInner->setAccessible(true);
$copyWrapperContentDocument = $copyWrapperContent->ownerDocument;
foreach ($childNodes as $child) {
$appendChildInner->invoke($copyWrapperContentDocument, $copyWrapperContentInner, $this->cloneInnerNode($child, $document, true));
}
}
} else {
$copyContent = $copyWrapperContent->innerNode;
@ -837,7 +846,8 @@ abstract class Node implements \Stringable {
if ($node instanceof \DOMElement || $node instanceof \DOMDocumentFragment) {
$childNodes = $node->childNodes;
foreach ($childNodes as $child) {
$this->appendChildInner($copy, $this->cloneInnerNode($child, $document, true, $parsing));
$clone = $this->cloneInnerNode($child, $document, true, $parsing);
$this->appendChildInner($copy, $clone);
}
}
}
@ -970,7 +980,7 @@ abstract class Node implements \Stringable {
if ($node instanceof Element || $node instanceof DocumentFragment || $node instanceof Document) {
$childNodes = $innerNode->childNodes;
foreach ($childNodes as $child) {
$copy->appendChild($this->cloneInnerNode($child, $innerDocument, true));
$this->appendChildInner($copy, $this->cloneInnerNode($child, $innerDocument, true));
}
}
}

13
lib/XPathEvaluate.php

@ -22,20 +22,27 @@ trait XPathEvaluate {
}
} // @codeCoverageIgnore
protected function xpathEvaluate(string $expression, Node $contextNode, ?XPathNSResolver $resolver = null, int $type = XPathResult::ANY_TYPE, ?XPathResult $result = null): XPathResult {
protected function xpathEvaluate(string $expression, Node $contextNode, \Closure|XPathNSResolver|null $resolver = null, int $type = XPathResult::ANY_TYPE, ?XPathResult $result = null): XPathResult {
$innerContextNode = $contextNode->innerNode;
$doc = ($innerContextNode instanceof \DOMDocument) ? $innerContextNode : $innerContextNode->ownerDocument;
if ($resolver !== null && preg_match_all('/([A-Z_a-z\x{C0}-\x{D6}\x{D8}-\x{F6}\x{F8}-\x{2FF}\x{370}-\x{37D}\x{37F}-\x{1FFF}\x{200C}-\x{200D}\x{2070}-\x{218F}\x{2C00}-\x{2FEF}\x{3001}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFFD}\x{10000}-\x{EFFFF}][A-Z_a-z\x{C0}-\x{D6}\x{D8}-\x{F6}\x{F8}-\x{2FF}\x{370}-\x{37D}\x{37F}-\x{1FFF}\x{200C}-\x{200D}\x{2070}-\x{218F}\x{2C00}-\x{2FEF}\x{3001}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFFD}\x{10000}-\x{EFFFF}-\.0-9\x{B7}\x{0300}-\x{036F}\x{203F}-\x{2040}]+):([A-Z_a-z\x{C0}-\x{D6}\x{D8}-\x{F6}\x{F8}-\x{2FF}\x{370}-\x{37D}\x{37F}-\x{1FFF}\x{200C}-\x{200D}\x{2070}-\x{218F}\x{2C00}-\x{2FEF}\x{3001}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFFD}\x{10000}-\x{EFFFF}][A-Z_a-z\x{C0}-\x{D6}\x{D8}-\x{F6}\x{F8}-\x{2FF}\x{370}-\x{37D}\x{37F}-\x{1FFF}\x{200C}-\x{200D}\x{2070}-\x{218F}\x{2C00}-\x{2FEF}\x{3001}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFFD}\x{10000}-\x{EFFFF}-\.0-9\x{B7}\x{0300}-\x{036F}\x{203F}-\x{2040}]+)/u', $expression, $m, \PREG_SET_ORDER)) {
foreach ($m as $prefix) {
$prefix = $prefix[1];
if ($namespace = $contextNode->lookupNamespaceURI($prefix)) {
if ($resolver instanceof XPathNSResolver) {
$namespace = $contextNode->lookupNamespaceURI($prefix);
} elseif ($namespace = $resolver($prefix)) {
$namespace = (string)$namespace;
}
if ($namespace !== null) {
$doc->xpath->registerNamespace($prefix, $namespace);
}
}
}
// PHP's DOM XPath incorrectly issues a warnings rather than exceptions when
// PHP's DOM XPath incorrectly issues warnings rather than exceptions when
// expressions are incorrect, so we must use a custom error handler here to
// "catch" it and throw an exception in its place.
set_error_handler([ $this, 'xpathErrorHandler' ]);

2
lib/XPathEvaluatorBase.php

@ -24,7 +24,7 @@ trait XPathEvaluatorBase {
return Reflection::createFromProtectedConstructor(__NAMESPACE__ . '\\XPathNSResolver', $nodeResolver);
}
public function evaluate(string $expression, Node $contextNode, ?XPathNSResolver $resolver = null, int $type = XPathResult::ANY_TYPE, ?XPathResult $result = null): XPathResult {
public function evaluate(string $expression, Node $contextNode, \Closure|XPathNSResolver|null $resolver = null, int $type = XPathResult::ANY_TYPE, ?XPathResult $result = null): XPathResult {
return $this->xpathEvaluate($expression, $contextNode, $resolver, $type, $result);
}
}

5
tests/cases/TestXPathEvaluate.php

@ -106,6 +106,11 @@ class TestXPathEvaluate extends \PHPUnit\Framework\TestCase {
$d->documentElement->setAttributeNS(Node::XMLNS_NAMESPACE, 'xmlns:poop', 'https://poop.poop');
$poop = $d->body->appendChild($d->createElementNS('https://poop.poop', 'poop:poop'));
$this->assertSame($poop, $d->evaluate('//poop:poop', $d->body, $d->createNSResolver($d->body), XPathResult::FIRST_ORDERED_NODE_TYPE)->singleNodeValue);
$svg = $d->body->appendChild($d->createElementNS(Node::SVG_NAMESPACE, 'svg'));
$this->assertSame($svg, $d->evaluate('//svg:svg', $d->body, function(string $prefix): ?string {
return ($prefix === 'svg') ? Node::SVG_NAMESPACE : null;
}, XPathResult::FIRST_ORDERED_NODE_TYPE)->singleNodeValue);
}

Loading…
Cancel
Save