Nick Wellnhofer
3ac214f01e
xmllint: Support --html --sax
2024-10-06 20:04:00 +02:00
Nick Wellnhofer
225ed70737
html: Accelerate htmlParseCharData
2024-10-06 20:04:00 +02:00
Nick Wellnhofer
74dfc49b5f
parser: Clarify logic in xmlParseStartTag2
2024-10-06 20:04:00 +02:00
Nick Wellnhofer
207999793f
html: Handle numeric character references directly
2024-10-06 20:04:00 +02:00
Nick Wellnhofer
0bc4608c50
html: Use hash table to check for duplicate attributes
2024-10-06 20:04:00 +02:00
Nick Wellnhofer
24a6149fc4
html: Make sure that character data mode is reset
2024-10-06 20:04:00 +02:00
Nick Wellnhofer
c32397d51f
html: Improve character class macros
2024-10-06 20:04:00 +02:00
Nick Wellnhofer
e840655414
html: Rewrite parsing of most data
2024-10-06 20:04:00 +02:00
Nick Wellnhofer
f77ec16db0
html: Optimize htmlParseCharData
2024-10-06 20:04:00 +02:00
Nick Wellnhofer
440bd64c69
html: Optimize htmlParseHTMLName
2024-10-06 20:04:00 +02:00
Nick Wellnhofer
c34d0ae9cc
html: Deprecate htmlIsBooleanAttr
2024-10-06 20:04:00 +02:00
Nick Wellnhofer
6040785ac4
html: Deprecate AutoClose API
2024-10-06 20:04:00 +02:00
Nick Wellnhofer
188cad68a4
html: Remove obsolete content model
2024-10-06 20:04:00 +02:00
Nick Wellnhofer
0144f662d7
html: Remove obsolete code
2024-10-06 20:04:00 +02:00
Nick Wellnhofer
0ce7bfe559
html: Try to avoid passing XML options to HTML parser
2024-10-06 20:04:00 +02:00
Nick Wellnhofer
76cc63942a
test: Fix XML_PARSE_HTML constant
2024-10-06 20:04:00 +02:00
Nick Wellnhofer
575be6c1f1
html: Fix line numbers with CRs
2024-10-06 20:04:00 +02:00
Nick Wellnhofer
be874d7831
html: Ignore unexpected DOCTYPE declarations
2024-10-06 20:04:00 +02:00
Nick Wellnhofer
462bf0b7a5
html: Rework options
...
Introduce htmlCtxtSetOptions, see similar changes made to XML parser.
Add HTML_PARSE_HUGE alias. Support HTML_PARSE_BIG_LINES.
2024-10-06 20:04:00 +02:00
Nick Wellnhofer
16de1346eb
parser: Make new options actually work
2024-10-06 20:04:00 +02:00
Nick Wellnhofer
42c3823df0
html: Update comment
2024-10-06 20:04:00 +02:00
Nick Wellnhofer
9f04cce695
html: Remove unused or useless return codes
...
htmlParseStartTag should always succeed (except for malloc failures).
2024-10-06 20:04:00 +02:00
Nick Wellnhofer
e179f3ec0e
html: Stop reporting syntax errors
...
It doesn't make much sense to keep the old syntax error handling which
doesn't conform to HTML5.
Handling HTML5 parser errors is rather involved and not essential for
parsers.
2024-10-06 20:04:00 +02:00
Nick Wellnhofer
c6af101728
html: Test tokenizer against html5lib test suite
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
27752f75ca
html: Fix EOF handling in start tags
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
b19d353970
html: Fix EOF handling in comments
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
17e56ac54a
html: Fix parsing of end tags
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
24a09033c9
html: Fix bogus end tags
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
bca6485476
html: Allow U+000C FORM FEED as whitespace
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
6edf1a645e
html: Fix DOCTYPE parsing
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
9678163f54
html: Don't check for valid XML characters
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
a6955c13c7
html: Parse numeric character references according to HTML5
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
4eeac30944
html: Start to fix EOF and U+0000 handling
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
e062a4a9b3
html: Add HTML5 parser option
...
This option passes tokenizer output directly to the SAX callbacks,
making it possible to test the tokenizer against the html5lib test
suite.
This will produce unbalanced calls to the startElement and endElement
callbacks, but it's the only way to support a SAX like interface for
HTML5. It can be used for filtering or rewriting HTML5, for example.
A HTML5 tree builder could then be implemented on top of the SAX
callbacks.
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
17da54c522
html: Normalize newlines
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
341dc78f24
html: Deduplicate code in htmlCurrentChar
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
3adb396d87
html: Parse bogus comments instead of ignoring them
...
Also treat XML processing instructions as bogus comments.
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
8444017578
html: Add missing calls to htmlCheckParagraph()
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
86d6b9b051
html: Deduplicate some code
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
0d324bde36
html: Simplify node info accounting
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
ccb61f599e
html: Remove duplicate calls to htmlAutoClose
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
e1834745e0
html: Add character data tests
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
f9ed30e972
html: HTML5 character data states
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
5951179239
html: Parse named character references according to HTML5
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
d5cd0f07f8
html: Prefer SKIP(1) over NEXT in HTML parser
...
Use SKIP(1) where it's safe to avoid a function call.
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
dc2d498318
html: Rework htmlLookupSequence
...
Rename to htmlLookupString and use strstr for increased performance.
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
637215a4de
html: Always terminate doctype declarations on '>'
...
Align with HTML5 spec. This allows to remove the old quote handling in
htmlLookupSequence.
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
72e29f9a3d
html: Fix quadratic behavior in push parser
...
Fix quadratic behavior related to unquoted attribute values. We really
have to replicate parts of the HTML5 state machine to find the end of
tags relibably.
Fixes #533 .
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
a80f8b64a9
html: Allow attributes in end tags
...
Attribute are syntactically allowed in HTML5 end tags but otherwise
ignored.
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
f2272c231b
html: Handle unexpected-solidus-in-tag according to HTML5
2024-10-06 18:13:05 +02:00