libxml2's HTML parser adds <p> start tags in some situations. This
behavior, which doesn't follow any standard, was added in 2000, see
here: http://veillard.com/XML/messages/0655.html
Text nodes that only contain whitespace don't imply a <p> tag, but the
whitespace check cannot work reliably if we're parsing partial text data
which can happen with both pull and push parser.
The logic in `areBlanks` is hard to follow. The checks involving `CUR`
depend on the position of the input pointer and seem dubious. It's also
possible that the behavior changed inadvertently with a later commit.
As a result, it's hard to come up with good test cases.
We now process leading whitespace before creating implied tags. This is
more in line with HTML5 and should avoid at least some issues with
partial text data.
For example, parsing the string "<head> x" used to result in:
<html>
<head></head>
<body><p> x</p></body>
</html>
And now results in:
<html>
<head> </head>
<body><p>x</p></body>
</html>
Except for the implied <p> tag, this matches HTML5.
Not long ago, Apple decided to replace GNU libiconv with a patched up
version of FreeBSD's iconv implementation in their operating systems.
Unfortunately, the quality of both the original implementation as well
as Apple's patches is so abysmal that you routinely find issues when
fuzzing your own code.
For now, incomplete UTF-8 is always an error in push mode.
Eventually, we could pass chunked data to the character handler when
push-parsing. Then we'd have to handle incomplete sequences.
Don't check for UTF-32 or EBCDIC.
We now perform BOM sniffing and the first step of the HTML5 prescan
algorithm (detect UTF-16 XML declarations). The rest of the algorithm
still has to be implemented.
It doesn't make much sense to keep the old syntax error handling which
doesn't conform to HTML5.
Handling HTML5 parser errors is rather involved and not essential for
parsers.