Commit graph

23 commits

Author SHA1 Message Date
164e5ff1e8 Add standard charset detection tests
- Various new tests needed for full coverage, noted in comment
- Byte Order Mark detection methopd added
- Japanese encodings nt yet supported, so tests marked incomplete
- Tests requiring scripting suppressed
2019-12-22 22:51:18 -05:00
a7e1083681 Prototype character encoding detection 2019-12-22 13:36:59 -05:00
49f31015ac Start on character encoding detection 2019-12-21 14:53:51 -05:00
00bf9974c5 Fix up most error reporting positions 2019-12-19 22:28:11 -05:00
58a1177888 Address errors and omissions in error emission
One test still fails, though it is arguably immaterial. This does not
account for line and column number, which are known to be mostly
off by one.
2019-12-19 15:13:20 -05:00
ec199f4f11 Report input stream errors 2019-12-18 21:10:18 -05:00
9560358021 Character consumption cleanup
- Newline normalization now done on-the-fly
- Consequently, original input string is used as-is
- Byte order mark is not supposed to be skipped
- Use more straightforward method of tracking column position
- Simplify backtracking when spanning
- Genericize character interpretation: this will be expanded to emit
illegal-character parse errors when appropriate
2019-12-18 18:03:47 -05:00
1ed679c50d Pass through surrogate characters
This fixes the last four failing tests
2019-12-18 15:15:02 -05:00
5a12fa8ad7 Tidying 2019-12-17 17:08:19 -05:00
59456b078f Fix consuming of overlong entitiy 2019-12-17 12:32:29 -05:00
b9b892e6a6 Remove obsolete character reference consumer 2019-12-16 22:56:47 -05:00
43f380c1f9 Fix EOF and end tags
- End tags now emit errors if they have attributes
- End tags now emit errors if they are self-closing
- The last character before EOF is now correctly reconsumed

Also changed the tokenizer debug log to be zero-cost
2019-12-15 19:45:59 -05:00
d08438052a Baseline pass over tokenizer
- Implemented missing states (except entity and char ref states)
- Re-copied and reformated most text from the specification
- Emitted parse errors per spec (except invalid characters)
- Properly handled null characters
- Passed through invalid characters (these do not yet emit errors)
- Added assertions before manipulation of tokens and temporary buffers
- Removed problematic optimizations
- Reoved explicit continue statements
- Allowed end tags to have attributes
- Simplified duplicate attribute detection
- Corrected DOCTYPE properties not being "missing"
- Skipped BOM in encoding-neutral way

I may have introduced regressions, and the assertions are mostly serving to
mask undefined-variable errors rather than helping to fix them, but at least
warnings and notices are not being spammed this way.

Work still need to be done in emitting errors for invalid characters (and
invalid character sequences), also well as in consuming character
references and entities correctly, not to mention general debugging.
2019-12-15 17:47:45 -05:00
4e4aee2edd Update intl dependency 2019-12-13 12:13:44 -05:00
a0c3883363 Another infinite loop in Tokenizer caused by Data 2019-12-12 22:45:13 -06:00
6b42f08fbc Change some if-the-exception blocks to assertions
This has only been done some parts of the code that are internal
to the parser at large.
2019-12-12 17:35:24 -05:00
bb2a7b5a95 Rewrite how parse errors are handled
Everything which can emit a parse error should have the error handler
and data stream as properties and use the ParseErrorEmitter trait to
avoid complicating the task of actually producing an error.

Normally the Parser would be expected to set the error handler before it
begins (this commit does not do this) and unset it after it's done.
Alternatively, the entire means of reporting errors can now be easily
replaced.
2019-12-12 15:23:15 -05:00
51ac79128b Multiple minor fixes 2019-12-11 23:28:32 -05:00
30003fce1f Fixed various issues with Data::consumeCharacterReference 2019-12-11 21:38:04 -06:00
ab507a177f Data::consumeCharacterReference checked for false instead of empty string 2019-12-11 19:44:45 -06:00
64d8a2ab2c Fixed infinite loop caused by Data::consumeWhile and consumeUntil 2019-12-10 22:48:02 -06:00
66ec4dab27 Fix character reference parsing 2018-08-31 13:25:05 -05:00
33363ab2d3 Fixed Data bug
• Fixed bug where Data::consumeWhile and Data::consumeUntil wouldn't move the pointer back one position if there were no matches.
• Changed DataStream to Data.
• Made each class have its own debug static property so each can print debug information separately.
2018-08-27 14:57:47 -05:00
Renamed from lib/DataStream.php (Browse further)