7333 Commits

Author SHA1 Message Date
Nick Wellnhofer
7a61c32bfa html: Use enum instead of magic values for insertion modes 2025-02-17 11:41:57 +01:00
Nick Wellnhofer
3793eaadb7 fuzz: Fix build 2025-02-16 13:55:18 +01:00
Nick Wellnhofer
69b91da3a8 Revert "xpath: Make contextSize and proximityPosition default to 1"
This reverts commit afbc0a0405236de4ab8cbac94745e9885db0a198.
2025-02-13 20:20:17 +01:00
Nick Wellnhofer
9c16a153d8 Revert "include: Make most IS_* macros private"
This reverts commit 84a6c82ff83d04963d6e1c5cd18ded68ea02d99f.
2025-02-13 20:20:17 +01:00
Nick Wellnhofer
6c716d491d pattern: Fix compilation of explicit child axis
The child axis is the default axis and should generate XML_OP_ELEM like
the case without an axis.
2025-02-13 20:20:17 +01:00
Nick Wellnhofer
8cf6129bbd html: Stop implying <p> start tags
Only <html>, <head> or <body> should be implied. Opening extra <p> tags
has always been a libxml2 quirk.
2025-02-13 20:20:17 +01:00
Nick Wellnhofer
71122421a1 html: Make implied <p> tags more deterministic
libxml2's HTML parser adds <p> start tags in some situations. This
behavior, which doesn't follow any standard, was added in 2000, see
here: http://veillard.com/XML/messages/0655.html

Text nodes that only contain whitespace don't imply a <p> tag, but the
whitespace check cannot work reliably if we're parsing partial text data
which can happen with both pull and push parser.

The logic in `areBlanks` is hard to follow. The checks involving `CUR`
depend on the position of the input pointer and seem dubious. It's also
possible that the behavior changed inadvertently with a later commit.
As a result, it's hard to come up with good test cases.

We now process leading whitespace before creating implied tags. This is
more in line with HTML5 and should avoid at least some issues with
partial text data.

For example, parsing the string "<head>   x" used to result in:

<html>
<head></head>
<body><p>   x</p></body>
</html>

And now results in:

<html>
<head>   </head>
<body><p>x</p></body>
</html>

Except for the implied <p> tag, this matches HTML5.
2025-02-13 14:31:44 +01:00
Nick Wellnhofer
ebbc31cc6b malloc-fail: Check for malloc failure in xhtmlNodeDumpOutput 2025-02-13 12:09:58 +01:00
Nick Wellnhofer
79ab721cb3 tests: Fix error return in testHugeEncodedChunk
Fixes #859.
2025-02-11 11:39:08 +01:00
Nick Wellnhofer
cfc854b839 fuzz: Work around glibc iconv() bug 2025-02-11 00:21:12 +01:00
Nick Wellnhofer
3a1526a5f7 xpath: Don't raise OOM error on long names
Short-lived regression.
2025-02-10 19:32:32 +01:00
Daniel Cheng
3dcde736d0 Use __has_attribute to check for __counted_by__ support
The initial clang patch to support __counted_by__ was landed and
reverted several times. There are some clang toolchains (e.g. the
Android toolchain) that report themselves as version 18 but do not
support __counted_by__. While it is debatable if Android should be
shipping a pre-release clang, using __has_attribute should be a bit
simpler overall.

Note that this doesn't migrate everything else to use __has_attribute:
while clang has always supported __has_attribute, gcc didn't support
it until a bit later.
2025-02-06 10:17:09 +01:00
Nick Wellnhofer
35d8a230a8 tests: Fix expected errors in runxmlconf
The extra failure if regexps weren't enabled was actually a regression
fixed by the previous commit.
2025-02-06 10:14:56 +01:00
Zak Ridouh
b466e70ae5
Fix early return in vstateVPush in valid.c
While looking over the code in the fallback method for `vstateVPush` in
valid.c when `LIBXML_REGEXP_ENABLED` is not defined, I noticed that
there is an ungated `return(-1)` after attempting to allocate memory.

I believe this should be inside a check, for if the malloc fails.
2025-02-05 14:11:04 -08:00
Nick Wellnhofer
62d4697db6 gitlab-ci: Disable cmake:mingw for now
Executing /mingw64/bin/cmake.exe with any arguments fails without error
message and exit code 127 since 2025-01-21. I have no idea why.
2025-02-02 18:07:14 +01:00
Nick Wellnhofer
a25dc4398f Debug CI failure 2025-02-02 15:49:49 +01:00
Nick Wellnhofer
cd491ac07d dict: Handle ENOSYS from getentropy gracefully
Also add some comments.

Should fix #854.
2025-02-02 13:23:20 +01:00
Nick Wellnhofer
bc43786822 fuzz: Improve HTML fuzzer
Verify that pull and push parser produce the same result.

Fixes #849.
2025-02-02 11:15:45 +01:00
Nick Wellnhofer
c4f760be8a encoding: Handle iconv() returning EOPNOTSUPP on Apple
iconv() really shouldn't return undocumented error codes.
2025-02-02 11:15:45 +01:00
Nick Wellnhofer
8d7e38d536 fuzz: Ignore encodings when fuzzing on Apple
Not long ago, Apple decided to replace GNU libiconv with a patched up
version of FreeBSD's iconv implementation in their operating systems.
Unfortunately, the quality of both the original implementation as well
as Apple's patches is so abysmal that you routinely find issues when
fuzzing your own code.
2025-02-02 11:15:45 +01:00
Nick Wellnhofer
68be036f29 fuzz: Disable HTML encoding detection for now
This doesn't work with the push parser.
2025-02-02 11:15:45 +01:00
Nick Wellnhofer
b4d3d87ed2 parser: Fix parsing of doctype declarations
Fix some long-standing issues.

Fixes #504.
2025-02-02 11:15:45 +01:00
Nick Wellnhofer
c13fcc1910 html: Chunk text data in push parser
Follow the logic of the XML parser and chunk large text nodes.
2025-02-02 11:15:45 +01:00
Nick Wellnhofer
080285724b html: Make data parsing modes work with push parser
This can't be solved with a simple scan for a terminator. Instead, we
make htmlParseCharData handle incomplete data if the "partial" flag is
set.
2025-02-02 11:15:45 +01:00
Nick Wellnhofer
4be1e8befb html: Simplify htmlParseTryOrFinish a little 2025-02-02 11:15:45 +01:00
Nick Wellnhofer
12732592ef html: Remove unused epilog state 2025-02-02 11:15:45 +01:00
Nick Wellnhofer
70bf754e24 html: Fix pull-parsing of incomplete end tags
Handle this HTML5 quirk in htmlParseEndTag.
2025-02-02 11:15:45 +01:00
Nick Wellnhofer
4a776c78ec html: Use htmlParseElementInternal in push parser 2025-02-02 11:15:45 +01:00
Nick Wellnhofer
ba1537374b html: Fix corner case when push-parsing HTML5 comments 2025-02-02 11:15:45 +01:00
Nick Wellnhofer
e48fb5e4f2 html: Handle incomplete UTF-8 when push-parsing
For now, incomplete UTF-8 is always an error in push mode.

Eventually, we could pass chunked data to the character handler when
push-parsing. Then we'd have to handle incomplete sequences.
2025-02-02 11:15:45 +01:00
Nick Wellnhofer
6bb2ea8e70 html: Adjust xmlDetectEncoding for HTML
Don't check for UTF-32 or EBCDIC.

We now perform BOM sniffing and the first step of the HTML5 prescan
algorithm (detect UTF-16 XML declarations). The rest of the algorithm
still has to be implemented.
2025-02-02 11:15:44 +01:00
Nick Wellnhofer
227d8f739b html: Support encoding auto-detection in push parser
Align with pull parser.
2025-02-02 11:15:44 +01:00
Nick Wellnhofer
641fb1acf5 html: Fix state update in push parser 2025-02-02 11:15:44 +01:00
Nick Wellnhofer
a86a8ae922 html: Fix push-parsing of empty documents
Also simplify end-of-document handling in push parser.

Align with pull parser.
2025-02-02 11:15:44 +01:00
Nick Wellnhofer
d2fb68ed24 fuzz: Make large chunk size more likely
This now detects issues like 3eced32e in about 30 seconds.
2025-01-31 19:02:33 +01:00
Nick Wellnhofer
cdfb54ff7b Fix typos 2025-01-31 18:41:41 +01:00
Nick Wellnhofer
57e4bbd803 parser: Improve handling of NOCDATA option
Don't modify the callback structure. This makes sure that unsetting the
option works.
2025-01-31 18:41:35 +01:00
Nick Wellnhofer
1f5b5371cf parser: Improve handling of NOBLANKS option
Don't change the SAX handler.

Use a helper function to invoke "characters" SAX callback.

The old code didn't advance the input pointer consistently before
invoking the callback. There was also some inconsistency wrt to
ctxt->space handling. I don't understand the ctxt->space thing, but
now we always behave like the non-complex case before.
2025-01-31 18:09:22 +01:00
Nick Wellnhofer
7a8722f557 parser: Document that XML_PARSE_NOBLANKS is broken
Long text content can generate multiple "characters" callbacks which can
lead to NOBLANKS removing whitespace in non-whitespace text nodes. So
the NOBLANKS option doesn't even work reliably with the pull parser.
This would be extremely hard to fix.

Unfortunately, `xmllint --format` relies on this option which is another
reason why this feature never really worked.
2025-01-31 18:09:03 +01:00
Nick Wellnhofer
40e423d6c2 fuzz: Improve fuzzing of push parser
Also serialize the result of push-parsing and compare whether pull and
push parser produce the same result (differential fuzzing).

We lose the ability to inject IO errors when serializing for now, but
this isn't too important.

Use variable chunk size for push parser.

Fixes #849.
2025-01-31 15:50:00 +01:00
Nick Wellnhofer
9efe141422 parser: Fix detection of ']]>' when push-parsing
Fixes #850.
2025-01-31 15:50:00 +01:00
Nick Wellnhofer
115b13f9d1 parser: Document push parser limitations 2025-01-31 15:50:00 +01:00
Nick Wellnhofer
53a48468ae xmllint: Make --push report parse errors
The push parser leaves documents in ctxt->myDoc even if they're invalid.

Also fix documentation.

Regressed with f8ff4d86.
2025-01-31 15:50:00 +01:00
Nick Wellnhofer
5535721f04 parser: Grow input buffer after lots of whitespace
Make sure that the input buffer is grown after consuming large amounts
of whitespace.

Also move a comment.
2025-01-31 15:49:53 +01:00
Nick Wellnhofer
218264fada parser: Always shrink input buffer
Shrinking the input buffer is cheap now and should be done as soon as
possible.
2025-01-30 01:26:01 +01:00
Nick Wellnhofer
0de90f518d parser: Define SIZE_MAX 2025-01-30 01:25:31 +01:00
Nick Wellnhofer
3eced32ea3 parser: Fix push parser with encoding and single chunk
When push-parsing with an encoding handler, we must convert the whole
buffer in the initial conversion. Otherwise, parsing a single chunk
larger than ~4KB would fail.

Regressed with commit 34c9108f.
2025-01-30 00:02:34 +01:00
Nick Wellnhofer
4bd66d4549 Mention contributors in Copyright
To clarify that libxml2 is the work of many people, add the following
copyright notice to Copyright:

    Copyright (C) The Libxml2 Contributors.
2025-01-29 13:11:38 +01:00
Nick Wellnhofer
fdc73dd07c README: Fix CMake example options
zlib is disabled by default now.
2025-01-29 12:58:31 +01:00
Nick Wellnhofer
64bfe1f7be README: Add note about security issues 2025-01-29 12:51:11 +01:00