7222 Commits

Author SHA1 Message Date
Daniel Cheng
3dcde736d0 Use __has_attribute to check for __counted_by__ support
The initial clang patch to support __counted_by__ was landed and
reverted several times. There are some clang toolchains (e.g. the
Android toolchain) that report themselves as version 18 but do not
support __counted_by__. While it is debatable if Android should be
shipping a pre-release clang, using __has_attribute should be a bit
simpler overall.

Note that this doesn't migrate everything else to use __has_attribute:
while clang has always supported __has_attribute, gcc didn't support
it until a bit later.
2025-02-06 10:17:09 +01:00
Nick Wellnhofer
35d8a230a8 tests: Fix expected errors in runxmlconf
The extra failure if regexps weren't enabled was actually a regression
fixed by the previous commit.
2025-02-06 10:14:56 +01:00
Zak Ridouh
b466e70ae5
Fix early return in vstateVPush in valid.c
While looking over the code in the fallback method for `vstateVPush` in
valid.c when `LIBXML_REGEXP_ENABLED` is not defined, I noticed that
there is an ungated `return(-1)` after attempting to allocate memory.

I believe this should be inside a check, for if the malloc fails.
2025-02-05 14:11:04 -08:00
Nick Wellnhofer
62d4697db6 gitlab-ci: Disable cmake:mingw for now
Executing /mingw64/bin/cmake.exe with any arguments fails without error
message and exit code 127 since 2025-01-21. I have no idea why.
2025-02-02 18:07:14 +01:00
Nick Wellnhofer
a25dc4398f Debug CI failure 2025-02-02 15:49:49 +01:00
Nick Wellnhofer
cd491ac07d dict: Handle ENOSYS from getentropy gracefully
Also add some comments.

Should fix #854.
2025-02-02 13:23:20 +01:00
Nick Wellnhofer
bc43786822 fuzz: Improve HTML fuzzer
Verify that pull and push parser produce the same result.

Fixes #849.
2025-02-02 11:15:45 +01:00
Nick Wellnhofer
c4f760be8a encoding: Handle iconv() returning EOPNOTSUPP on Apple
iconv() really shouldn't return undocumented error codes.
2025-02-02 11:15:45 +01:00
Nick Wellnhofer
8d7e38d536 fuzz: Ignore encodings when fuzzing on Apple
Not long ago, Apple decided to replace GNU libiconv with a patched up
version of FreeBSD's iconv implementation in their operating systems.
Unfortunately, the quality of both the original implementation as well
as Apple's patches is so abysmal that you routinely find issues when
fuzzing your own code.
2025-02-02 11:15:45 +01:00
Nick Wellnhofer
68be036f29 fuzz: Disable HTML encoding detection for now
This doesn't work with the push parser.
2025-02-02 11:15:45 +01:00
Nick Wellnhofer
b4d3d87ed2 parser: Fix parsing of doctype declarations
Fix some long-standing issues.

Fixes #504.
2025-02-02 11:15:45 +01:00
Nick Wellnhofer
c13fcc1910 html: Chunk text data in push parser
Follow the logic of the XML parser and chunk large text nodes.
2025-02-02 11:15:45 +01:00
Nick Wellnhofer
080285724b html: Make data parsing modes work with push parser
This can't be solved with a simple scan for a terminator. Instead, we
make htmlParseCharData handle incomplete data if the "partial" flag is
set.
2025-02-02 11:15:45 +01:00
Nick Wellnhofer
4be1e8befb html: Simplify htmlParseTryOrFinish a little 2025-02-02 11:15:45 +01:00
Nick Wellnhofer
12732592ef html: Remove unused epilog state 2025-02-02 11:15:45 +01:00
Nick Wellnhofer
70bf754e24 html: Fix pull-parsing of incomplete end tags
Handle this HTML5 quirk in htmlParseEndTag.
2025-02-02 11:15:45 +01:00
Nick Wellnhofer
4a776c78ec html: Use htmlParseElementInternal in push parser 2025-02-02 11:15:45 +01:00
Nick Wellnhofer
ba1537374b html: Fix corner case when push-parsing HTML5 comments 2025-02-02 11:15:45 +01:00
Nick Wellnhofer
e48fb5e4f2 html: Handle incomplete UTF-8 when push-parsing
For now, incomplete UTF-8 is always an error in push mode.

Eventually, we could pass chunked data to the character handler when
push-parsing. Then we'd have to handle incomplete sequences.
2025-02-02 11:15:45 +01:00
Nick Wellnhofer
6bb2ea8e70 html: Adjust xmlDetectEncoding for HTML
Don't check for UTF-32 or EBCDIC.

We now perform BOM sniffing and the first step of the HTML5 prescan
algorithm (detect UTF-16 XML declarations). The rest of the algorithm
still has to be implemented.
2025-02-02 11:15:44 +01:00
Nick Wellnhofer
227d8f739b html: Support encoding auto-detection in push parser
Align with pull parser.
2025-02-02 11:15:44 +01:00
Nick Wellnhofer
641fb1acf5 html: Fix state update in push parser 2025-02-02 11:15:44 +01:00
Nick Wellnhofer
a86a8ae922 html: Fix push-parsing of empty documents
Also simplify end-of-document handling in push parser.

Align with pull parser.
2025-02-02 11:15:44 +01:00
Nick Wellnhofer
d2fb68ed24 fuzz: Make large chunk size more likely
This now detects issues like 3eced32e in about 30 seconds.
2025-01-31 19:02:33 +01:00
Nick Wellnhofer
cdfb54ff7b Fix typos 2025-01-31 18:41:41 +01:00
Nick Wellnhofer
57e4bbd803 parser: Improve handling of NOCDATA option
Don't modify the callback structure. This makes sure that unsetting the
option works.
2025-01-31 18:41:35 +01:00
Nick Wellnhofer
1f5b5371cf parser: Improve handling of NOBLANKS option
Don't change the SAX handler.

Use a helper function to invoke "characters" SAX callback.

The old code didn't advance the input pointer consistently before
invoking the callback. There was also some inconsistency wrt to
ctxt->space handling. I don't understand the ctxt->space thing, but
now we always behave like the non-complex case before.
2025-01-31 18:09:22 +01:00
Nick Wellnhofer
7a8722f557 parser: Document that XML_PARSE_NOBLANKS is broken
Long text content can generate multiple "characters" callbacks which can
lead to NOBLANKS removing whitespace in non-whitespace text nodes. So
the NOBLANKS option doesn't even work reliably with the pull parser.
This would be extremely hard to fix.

Unfortunately, `xmllint --format` relies on this option which is another
reason why this feature never really worked.
2025-01-31 18:09:03 +01:00
Nick Wellnhofer
40e423d6c2 fuzz: Improve fuzzing of push parser
Also serialize the result of push-parsing and compare whether pull and
push parser produce the same result (differential fuzzing).

We lose the ability to inject IO errors when serializing for now, but
this isn't too important.

Use variable chunk size for push parser.

Fixes #849.
2025-01-31 15:50:00 +01:00
Nick Wellnhofer
9efe141422 parser: Fix detection of ']]>' when push-parsing
Fixes #850.
2025-01-31 15:50:00 +01:00
Nick Wellnhofer
115b13f9d1 parser: Document push parser limitations 2025-01-31 15:50:00 +01:00
Nick Wellnhofer
53a48468ae xmllint: Make --push report parse errors
The push parser leaves documents in ctxt->myDoc even if they're invalid.

Also fix documentation.

Regressed with f8ff4d86.
2025-01-31 15:50:00 +01:00
Nick Wellnhofer
5535721f04 parser: Grow input buffer after lots of whitespace
Make sure that the input buffer is grown after consuming large amounts
of whitespace.

Also move a comment.
2025-01-31 15:49:53 +01:00
Nick Wellnhofer
218264fada parser: Always shrink input buffer
Shrinking the input buffer is cheap now and should be done as soon as
possible.
2025-01-30 01:26:01 +01:00
Nick Wellnhofer
0de90f518d parser: Define SIZE_MAX 2025-01-30 01:25:31 +01:00
Nick Wellnhofer
3eced32ea3 parser: Fix push parser with encoding and single chunk
When push-parsing with an encoding handler, we must convert the whole
buffer in the initial conversion. Otherwise, parsing a single chunk
larger than ~4KB would fail.

Regressed with commit 34c9108f.
2025-01-30 00:02:34 +01:00
Nick Wellnhofer
4bd66d4549 Mention contributors in Copyright
To clarify that libxml2 is the work of many people, add the following
copyright notice to Copyright:

    Copyright (C) The Libxml2 Contributors.
2025-01-29 13:11:38 +01:00
Nick Wellnhofer
fdc73dd07c README: Fix CMake example options
zlib is disabled by default now.
2025-01-29 12:58:31 +01:00
Nick Wellnhofer
64bfe1f7be README: Add note about security issues 2025-01-29 12:51:11 +01:00
Nick Wellnhofer
93506d41cb parser: Make catalog PIs opt-in
This is an obscure feature that shouldn't be enabled by default.
2025-01-29 00:50:47 +01:00
Nick Wellnhofer
1082d813e8 parser: Prepare to make decompression opt-in
Add a new parser option XML_PARSE_UNZIP that enables decompression.
xmlReadFile, xmlCtxtReadFile and xmlCreateURLParserCtxt always set
this option currently, but downstream users should start to set the
option if they really need it.
2025-01-29 00:49:57 +01:00
Nick Wellnhofer
a78843be5e xmllint: Support compressed input from stdin
Another regression related to reading from stdin.

Making a "-" filename read from stdin was deeply baked into the core
IO code but is inherently insecure. I really want to reenable this
dangerous feature as sparingly as possible.

This now enables compressed input when using the "Fd" API functions
which wan't supported before. But XML_PARSE_NO_UNZIP will be
inverted later.

Allow compressed stdin in xmlReadFile to support xmlstarlet and older
versions of xsltproc. So far, these are the only known command-line
tools that rely on "-" meaning stdin.
2025-01-28 23:20:37 +01:00
Nick Wellnhofer
a8d8a70c51 uri: Fix handling of Windows drive letters
Allow drive letters in URI paths. Technically, these should be treated
as URI schemes, but this is not what users expect. This also makes sure
that paths with drive letters are resolved as filesystem paths and
unescaped, for example when used in libxslt's document() function.

Should fix #832.
2025-01-27 14:28:29 +01:00
Nick Wellnhofer
6904d4c225 fuzz: Fix OSS-Fuzz build of lint fuzzer 2025-01-25 13:55:23 +01:00
Benjamin Gilbert
cd7299a8e3 meson: Fix setup with ICU as sibling subproject
Meson wrapdb provides a wrap for ICU, so libxml2 and ICU could both be
built as subprojects of the same Meson parent project.  In this case, with
the icu option enabled, setup was failing with:

    subprojects/libxml2-2.13.5/meson.build:603:22: ERROR: Could not get an internal variable and no default provided for <InternalDependency dep228908115162702543524838879388991448872: True>

This is because we can't get a dependency variable from a subproject that
hasn't been built yet.  Fall back to assuming DEFS is empty, as it is on
my system.
2025-01-24 18:59:12 -08:00
Nick Wellnhofer
6ec616ba26 encoding: Don't allow POSIX indicator suffixes in encoding names
Suffixes like "//IGNORE" change the behavior of iconv.

Also add comment on how we currently rely on GNU libiconv behavior
which technically violates the POSIX spec.
2025-01-24 20:47:52 +01:00
Nick Wellnhofer
9b1028c906 fuzz: Fix comments 2025-01-23 20:37:37 +01:00
Nick Wellnhofer
e95c4b07ae fuzz: Also test xmllint --repeat option 2025-01-23 20:30:40 +01:00
Nick Wellnhofer
dc6270d110 xmllint: Fix UAF with --push --repeat
Short-lived regression. Fixes #841.
2025-01-23 20:30:25 +01:00
Grzegorz Szymaszek
9d7bbf1952 tree: Fix variable name in xmlAddChild documentation 2025-01-23 17:49:54 +00:00