Commit graph

37 commits

Author SHA1 Message Date
d33929f4a1 Change namespace; add copyright info 2021-03-21 17:38:05 -04:00
aaf85387be Remove uses of is_null for consistency 2021-03-21 12:33:24 -04:00
82621a11e3 Sort out namespaced attributes 2021-03-18 12:40:54 -04:00
6cac402375 Minor cleanup 2021-03-16 14:42:21 -04:00
7f53465951 Fix remaining error positions 2021-03-13 21:00:59 -05:00
c6c51475cf Convert tokenizer to generator
Some error positions still need to be fixed
2021-03-13 18:03:15 -05:00
3f23040e1d Fix most parse error counts
More remain, though most have been addressed
2021-03-10 22:42:53 -05:00
01361efdb8 Various fixes 2021-03-06 21:41:12 -05:00
752ab05464 Implement rest of in-body insertion mode 2021-02-20 12:18:03 -05:00
a8d2ee4174 Fill out more of the "in body" insertion mode
This only passes a few morectests because handling of end tags
is still mostly missing
2021-02-19 20:18:13 -05:00
baaa00e544 Implement a in body
Adoption agency will be handled later
2021-02-18 23:13:55 -05:00
6798c128e4 Correct unknown DOCTYPE checking 2021-02-14 19:33:23 -05:00
a8ff431370 Corrective pass over exising insertion modes 2021-02-14 15:09:00 -05:00
4e5fd35775 Fix a few tree tests 2021-02-12 23:26:57 -05:00
00bf9974c5 Fix up most error reporting positions 2019-12-19 22:28:11 -05:00
58a1177888 Address errors and omissions in error emission
One test still fails, though it is arguably immaterial. This does not
account for line and column number, which are known to be mostly
off by one.
2019-12-19 15:13:20 -05:00
ec199f4f11 Report input stream errors 2019-12-18 21:10:18 -05:00
19fb541806 New from-scratch character reference consumer 2019-12-16 22:39:16 -05:00
43f380c1f9 Fix EOF and end tags
- End tags now emit errors if they have attributes
- End tags now emit errors if they are self-closing
- The last character before EOF is now correctly reconsumed

Also changed the tokenizer debug log to be zero-cost
2019-12-15 19:45:59 -05:00
d08438052a Baseline pass over tokenizer
- Implemented missing states (except entity and char ref states)
- Re-copied and reformated most text from the specification
- Emitted parse errors per spec (except invalid characters)
- Properly handled null characters
- Passed through invalid characters (these do not yet emit errors)
- Added assertions before manipulation of tokens and temporary buffers
- Removed problematic optimizations
- Reoved explicit continue statements
- Allowed end tags to have attributes
- Simplified duplicate attribute detection
- Corrected DOCTYPE properties not being "missing"
- Skipped BOM in encoding-neutral way

I may have introduced regressions, and the assertions are mostly serving to
mask undefined-variable errors rather than helping to fix them, but at least
warnings and notices are not being spammed this way.

Work still need to be done in emitting errors for invalid characters (and
invalid character sequences), also well as in consuming character
references and entities correctly, not to mention general debugging.
2019-12-15 17:47:45 -05:00
a0c3883363 Another infinite loop in Tokenizer caused by Data 2019-12-12 22:45:13 -06:00
6b42f08fbc Change some if-the-exception blocks to assertions
This has only been done some parts of the code that are internal
to the parser at large.
2019-12-12 17:35:24 -05:00
bb2a7b5a95 Rewrite how parse errors are handled
Everything which can emit a parse error should have the error handler
and data stream as properties and use the ParseErrorEmitter trait to
avoid complicating the task of actually producing an error.

Normally the Parser would be expected to set the error handler before it
begins (this commit does not do this) and unset it after it's done.
Alternatively, the entire means of reporting errors can now be easily
replaced.
2019-12-12 15:23:15 -05:00
8644b6c757 Explicitly index state names and error messages 2019-12-12 10:11:36 -05:00
51ac79128b Multiple minor fixes 2019-12-11 23:28:32 -05:00
30003fce1f Fixed various issues with Data::consumeCharacterReference 2019-12-11 21:38:04 -06:00
0624e0be93 Pushing forward on TreeBuilder
• Updated mensbeam/intl dependency.
• Moved scope methods from Element to OpenElementsStack. They don't need to be used outside of the parser and don't make sense there.
• Cleaned up parse errors. Displaying what is expected or found is not helpful.
2018-09-19 09:09:36 -05:00
33363ab2d3 Fixed Data bug
• Fixed bug where Data::consumeWhile and Data::consumeUntil wouldn't move the pointer back one position if there were no matches.
• Changed DataStream to Data.
• Made each class have its own debug static property so each can print debug information separately.
2018-08-27 14:57:47 -05:00
d95f3e37e4 Fixed document building
• The document was being rewritten when tree building and therefore not being output when the parser completed.
• Allowed DOM to be instanced, containing an implementation and document so the tree builder can create a document when a doctype is found.
2018-08-17 16:26:27 -05:00
48d125e18a Continuing work on TreeBuilder 2018-08-09 16:59:35 -05:00
298decab24 Decouple ParseError from Parser 2018-08-03 23:08:18 -05:00
222d60579c Have Parser destroy its instance when finished
• Getting ready to work on fragment parsing, simplifying Parser::parseFragment.
• Added short example in README
2018-08-03 16:57:51 -05:00
027e5b9f58 Moved tokenizer to its own class
• Changed the name of the parser instance variable from Parser::$self to Parser::$instance
• Added parse errors for entities into ParseError.
• Moved Parser::fixDOM to DOM::fixIdAttributes.
• Added an exception for when the tokenizer enters an invalid state (infinite looping).
• Made ParseError use Parser::$instance->data instead of a passed around DataStream object.
2018-08-01 16:40:03 -05:00
1fc65f85bd Started HTML content tree building
• Removed html5.php; shouldn't have been there to begin with.
• Fixed bug where when feeding ParseError::trigger the wrong number of parameters it wouldn't have the correct exception to throw.
2018-07-26 16:30:29 -05:00
de7cc7cbfa Fixing foreign content stuff
• Changes to the spec since the last edit required a rewrite of the tree building algorithm.
• Searching the stack should search from reverse by default because the spec works that way.
• Rewrote StartTagToken because the token attributes need to be easily editable as per the spec foreign attributes are edited before the token goes through the element creation process and not after.
• Yes, there's a goto. Sue me.
2018-07-25 09:57:27 -05:00
6f74630c98 Begin Implementation of Tree Builder
• Added parsing instructions for tokens in foreign content
2018-04-08 10:46:30 -05:00
a89f6c9f09 Beginning Rewrite 2018-03-21 10:55:32 -05:00