The while loop has been replaced with gotos where appropriate, and
switching has been replaced with a series of if-blocks in line with the
same logic in the tokenizer.
- Various new tests needed for full coverage, noted in comment
- Byte Order Mark detection methopd added
- Japanese encodings nt yet supported, so tests marked incomplete
- Tests requiring scripting suppressed
- Newline normalization now done on-the-fly
- Consequently, original input string is used as-is
- Byte order mark is not supposed to be skipped
- Use more straightforward method of tracking column position
- Simplify backtracking when spanning
- Genericize character interpretation: this will be expanded to emit
illegal-character parse errors when appropriate
- End tags now emit errors if they have attributes
- End tags now emit errors if they are self-closing
- The last character before EOF is now correctly reconsumed
Also changed the tokenizer debug log to be zero-cost
- Implemented missing states (except entity and char ref states)
- Re-copied and reformated most text from the specification
- Emitted parse errors per spec (except invalid characters)
- Properly handled null characters
- Passed through invalid characters (these do not yet emit errors)
- Added assertions before manipulation of tokens and temporary buffers
- Removed problematic optimizations
- Reoved explicit continue statements
- Allowed end tags to have attributes
- Simplified duplicate attribute detection
- Corrected DOCTYPE properties not being "missing"
- Skipped BOM in encoding-neutral way
I may have introduced regressions, and the assertions are mostly serving to
mask undefined-variable errors rather than helping to fix them, but at least
warnings and notices are not being spammed this way.
Work still need to be done in emitting errors for invalid characters (and
invalid character sequences), also well as in consuming character
references and entities correctly, not to mention general debugging.