Unlike iconv or the internal converters, ICU consumes truncated multi-
byte sequences at the end of an input buffer. We currently check for a
non-empty raw input buffer to detect truncated sequences, so this fails
with ICU.
It might be possible to inspect the pivot buffer pointers, but it seems
cleaner to implement a `flush` flag for some encoding and I/O functions.
After flushing, we can check for U_TRUNCATED_CHAR_FOUND with ICU, or
detect remaining input with other converters.
Also fix detection of truncated sequences for HTML, XML content and
DTDs with iconv.
libxml2's HTML parser adds <p> start tags in some situations. This
behavior, which doesn't follow any standard, was added in 2000, see
here: http://veillard.com/XML/messages/0655.html
Text nodes that only contain whitespace don't imply a <p> tag, but the
whitespace check cannot work reliably if we're parsing partial text data
which can happen with both pull and push parser.
The logic in `areBlanks` is hard to follow. The checks involving `CUR`
depend on the position of the input pointer and seem dubious. It's also
possible that the behavior changed inadvertently with a later commit.
As a result, it's hard to come up with good test cases.
We now process leading whitespace before creating implied tags. This is
more in line with HTML5 and should avoid at least some issues with
partial text data.
For example, parsing the string "<head> x" used to result in:
<html>
<head></head>
<body><p> x</p></body>
</html>
And now results in:
<html>
<head> </head>
<body><p>x</p></body>
</html>
Except for the implied <p> tag, this matches HTML5.
Fix a long-standing issue where QNames starting with a non-ASCII
character would be rejected. This became more visible after "streaming"
XPath evaluation was disabled since the latter handled non-ASCII names
correctly.
Fixes#818.
It doesn't make much sense to keep the old syntax error handling which
doesn't conform to HTML5.
Handling HTML5 parser errors is rather involved and not essential for
parsers.
The Document Object Model (DOM) Level 3 Core Specification says:
> Adjacent CDATASection nodes are not merged by use of the normalize
> method of the Node interface.
Fixes#412.
Revert a change from d025cfbb and don't overwrite ID table entries, so
that the first attribute will be returned if there are duplicate IDs.
This requires two other changes:
- Attributes in entity content are never added to the ID table. This
seems reasonable.
- Remove the optimization to skip ID lookup when copying and the target
document has an empty ID table. This also seems more correct since the
document could have ID declarations nevertheless or we could be
copying xml:ids into the document for the first time.
Fixes#757.
The latest spec for what it essentially an XPath extension seems to be
this working draft from 2002:
https://www.w3.org/TR/xptr-xpointer/
The xpointer() scheme is listed as "being reviewed" in the XPointer
registry since at least 2006. libxml2 seems to be the only modern
software that tries to implement this spec, but the code has many bugs
and quality issues.
If you configure --with-legacy, old symbols are retained for ABI
compatibility.
Throw an error if entity substitution was requested.
Now we only downgrade to a warning if
- XML_PARSE_DTDLOAD wasn't specified, and
- entity aren't substituted or XML_PARSE_NO_XXE was specified.
Should fix#724.
This should only be done in xmlParseReference.
The handling of undeclared entities is still somewhat inconsistent. In
element content we create references even if entity substitution is
enabled. In attribute values undeclared entities are always ignored.
Always use XML_WAR_UNDECLARED_ENTITY with warning error level in
documents with external subset or parameter entities. Use
XML_ERR_UNDECLARED_ENTITY otherwise.
Also decode entities in namespace URIs if entity substitution wasn't
requested. This should fix some corner cases when comparing namespace
URIs. The Namespaces in XML 1.0 spec says:
> In a namespace declaration, the URI reference is the normalized value
> of the attribute, so replacement of XML character and entity
> references has already been done before any comparison.
Make the serialization code escape special characters in namespace URIs
like in attribute values. This fixes serialization if entities were
substituted when parsing.
Fixes https://gitlab.gnome.org/GNOME/libxslt/-/issues/106
Commit 9e1c72da from 2001 introduced a bug where xmlAddPrevSibling and
xmlAddNextSibling would only try to merge text nodes with one of its
new siblings. Commit 4ccd3eb8 fixed this bug but unfortunately, lxml
and possibly other downstream code depend on text nodes not being
merged.
To avoid breaking downstream code while still having somewhat
consistent API behavior, it's probably best to make these functions
never coalesce text nodes.
Fixes a regression from commit e0dd330b, resulting in duplicate
attributes in the predefined XML namespace not being detected or
extraneous default attributes being passed.
Fixes#704.
This is a useful function to get a verbose error report.
Allows to remove duplicated code from runtest.c. Also reactivate check
for schema parser failures.