350 Commits

Author SHA1 Message Date
Nick Wellnhofer
84c6524e26 encoding: Support input-only and output-only converters
Make it possible to open an encoding handler only for input or output.
This avoids the creation of unnecessary converters.

Should also fix #863.
2025-03-13 22:15:10 +01:00
Nick Wellnhofer
69b83bb68e encoding: Detect truncated multi-byte sequences with ICU
Unlike iconv or the internal converters, ICU consumes truncated multi-
byte sequences at the end of an input buffer. We currently check for a
non-empty raw input buffer to detect truncated sequences, so this fails
with ICU.

It might be possible to inspect the pivot buffer pointers, but it seems
cleaner to implement a `flush` flag for some encoding and I/O functions.
After flushing, we can check for U_TRUNCATED_CHAR_FOUND with ICU, or
detect remaining input with other converters.

Also fix detection of truncated sequences for HTML, XML content and
DTDs with iconv.
2025-03-13 22:15:10 +01:00
Nick Wellnhofer
25490528af parser: Fix spurious error in SAX mode
Short-lived regression from 5f0b1378.
2025-03-11 16:34:30 +01:00
Nick Wellnhofer
5f0b1378d7 parser: Add more parser context accessors
Fixes #763.
2025-03-08 22:36:06 +01:00
Nick Wellnhofer
6bb2ea8e70 html: Adjust xmlDetectEncoding for HTML
Don't check for UTF-32 or EBCDIC.

We now perform BOM sniffing and the first step of the HTML5 prescan
algorithm (detect UTF-16 XML declarations). The rest of the algorithm
still has to be implemented.
2025-02-02 11:15:44 +01:00
Nick Wellnhofer
0de90f518d parser: Define SIZE_MAX 2025-01-30 01:25:31 +01:00
Nick Wellnhofer
3eced32ea3 parser: Fix push parser with encoding and single chunk
When push-parsing with an encoding handler, we must convert the whole
buffer in the initial conversion. Otherwise, parsing a single chunk
larger than ~4KB would fail.

Regressed with commit 34c9108f.
2025-01-30 00:02:34 +01:00
Nick Wellnhofer
1082d813e8 parser: Prepare to make decompression opt-in
Add a new parser option XML_PARSE_UNZIP that enables decompression.
xmlReadFile, xmlCtxtReadFile and xmlCreateURLParserCtxt always set
this option currently, but downstream users should start to set the
option if they really need it.
2025-01-29 00:49:57 +01:00
Nick Wellnhofer
a78843be5e xmllint: Support compressed input from stdin
Another regression related to reading from stdin.

Making a "-" filename read from stdin was deeply baked into the core
IO code but is inherently insecure. I really want to reenable this
dangerous feature as sparingly as possible.

This now enables compressed input when using the "Fd" API functions
which wan't supported before. But XML_PARSE_NO_UNZIP will be
inverted later.

Allow compressed stdin in xmlReadFile to support xmlstarlet and older
versions of xsltproc. So far, these are the only known command-line
tools that rely on "-" meaning stdin.
2025-01-28 23:20:37 +01:00
Nick Wellnhofer
2e3a91a766 doc: Fix documentation 2024-12-26 21:05:39 +01:00
Nick Wellnhofer
8231c03663 parser: Check reallocations for overflow 2024-12-21 19:37:37 +01:00
Nick Wellnhofer
0dd910e82b save: Fix handling of catastrophic errors
Don't overwrite catastrophic errors xmlSaveErr.

Overwrite non-catastrophic errors in xmlOutputBufferClose.
2024-12-19 02:30:36 +01:00
Nick Wellnhofer
1e1b48918c parser: Also raise error if ctxt is NULL
Update global error variable even if context is missing because of an
invalid (NULL) argument.
2024-12-13 17:57:11 +01:00
Nick Wellnhofer
70cce2ece3 parser: Make XML_ERR_RESOURCE_LIMIT non-catastrophic 2024-11-26 14:20:25 +01:00
Nick Wellnhofer
57087e5fc7 parser: Don't overwrite catastrophic errors
Stop reporting errors after a catastrophic error.

Also make sure that ctxt->errNo matches ctxt->lastError.code.
2024-11-26 00:47:48 +01:00
Nick Wellnhofer
0f4f89005d parser: Rename inputPush to xmlCtxtPushInput 2024-11-19 00:25:23 +01:00
Nick Wellnhofer
e2ad249c23 parser: Deprecate more internal symbols
- xmlParseExternalSubset
- xmlPushInput
- xmlPopInput
- xmlCopyCharMultiByte
- xmlCreateEntityParserCtxt
- xmlStringComment
2024-11-19 00:25:23 +01:00
Nick Wellnhofer
bd9eed4694 parser: Make unsupported encodings an error in declarations
This was changed in 45157261, but in encoding declarations, unsupported
encodings should raise a fatal error.

Fixes #794.
2024-09-02 19:29:39 +02:00
Nick Wellnhofer
1d009fe35d parser: Report at least one fatal error 2024-08-05 15:14:21 +02:00
Nick Wellnhofer
bfed6e6ae8 parser: Fix error handling after reaching limit
Mark document as non-wellformed and stop parser even if error limit was
reached.

Regressed in abd74186.
2024-08-05 14:58:37 +02:00
Nick Wellnhofer
6a3c0b0d93 parser: Increase XML_MAX_DICTIONARY_LIMIT
This limit is somewhat arbitrary and can be reached when fuzzing
documents up to 1 MB.

Increase limit to 100 MB and disable limit if XML_PARSE_HUGE is set.
2024-07-22 12:53:00 +02:00
Nick Wellnhofer
a6f54f055b io: Fine-tune initial IO buffer size 2024-07-16 17:42:10 +02:00
Nick Wellnhofer
34c9108f15 encoding: Add sizeOut argument to xmlCharEncInput
When push parsing, we want to convert as much of the input as possible.
When pull parsing memory buffers, we want to convert data chunk by chunk
to save memory.
2024-07-16 17:42:10 +02:00
Nick Wellnhofer
92f30711de parser: Optimize buffer shrinking
Remove checks now that we can shrink memory buffers efficiently.

Shrink more aggressively.
2024-07-16 17:42:10 +02:00
Nick Wellnhofer
a221cd7849 buf: Rework xmlBuf code
Always use what the old implementation called the "IO" allocation
scheme, allowing to move the content pointer past the initial
allocation. This is inexpensive and allows efficient shrinking.

Optimize xmlBufGrow, reusing shrunken memory as much as possible.

Simplify xmlBufAdd.

Make xmlBufBackToBuffer return an error on overflow.

Make "size" exclude the terminating NULL byte.

Always provide an initial size.

Reintroduce static buffers.

Remove xmlBufResize and several other functions.
2024-07-16 17:42:10 +02:00
Nick Wellnhofer
728869809e error: Add helper functions to print errors and abort 2024-07-15 16:33:38 +02:00
Nick Wellnhofer
aa6aec19b0 parser: Fix xmlInputSetEncodingHandler again
Short-lived regression.
2024-07-11 12:42:13 +02:00
Nick Wellnhofer
8af55c8d20 parser: Rename new input API functions
These weren't made public yet.
2024-07-11 01:33:29 +02:00
Nick Wellnhofer
d74ca59491 parser: Rename internal xmlNewInput functions 2024-07-11 01:31:50 +02:00
Nick Wellnhofer
4f329dc524 parser: Implement xmlCtxtParseContent
This implements xmlCtxtParseContent, a better alternative to
xmlParseInNodeContext or xmlParseBalancedChunkMemory. It accepts a
parser context and a parser input, making it a lot more versatile.

xmlParseInNodeContext is now implemented in terms of
xmlCtxtParseContent. This makes sure that xmlParseInNodeContext never
modifies the target document, improving thread safety.
xmlParseInNodeContext is also more lenient now with regard to undeclared
entities.

Fixes #727.
2024-07-11 01:26:32 +02:00
Nick Wellnhofer
4fec0889e0 parser: Fix memory leak in xmlInputSetEncodingHandler
Short-lived regression.
2024-07-10 22:32:33 +02:00
Nick Wellnhofer
5935471732 parser: Fix malloc failure handling in xmlInputSetEncodingHandler
Don't set encoder if allocating buffer failed. This could lead to
xmlByteConsumed processing invalid UTF-8.
2024-07-09 14:11:28 +02:00
Nick Wellnhofer
ea31ac5bba fuzz: Fix spaceMax 2024-07-07 04:19:09 +02:00
Nick Wellnhofer
29e3ab92f0 fuzz: Make reallocs more likely 2024-07-06 15:48:43 +02:00
Nick Wellnhofer
38195cf596 parser: Don't produce names with invalid UTF-8 in recovery mode 2024-07-06 15:33:06 +02:00
Nick Wellnhofer
ec0881099b parser: Upgrade XML_IO_NETWORK_ATTEMPT to error
Fixes XML::LibXML test suite.
2024-07-04 15:47:20 +02:00
Nick Wellnhofer
fdfeecfe5e parser: Reenable ctxt->directory
Unused internally, but used in downstream code.

Should fix #753.
2024-07-02 22:06:53 +02:00
Nick Wellnhofer
606f410891 parser: Allow to disable catalogs with parser options
Implement XML_PARSE_NO_SYS_CATALOG and XML_PARSE_NO_CATALOG_PI.

Fixes #735.
2024-07-02 22:06:53 +02:00
Nick Wellnhofer
197e09d5c5 parser: Fix xmlLoadResource
Short-lived regression.
2024-07-02 20:03:23 +02:00
Nick Wellnhofer
ede5d99af3 parser: Fix typo 2024-07-02 16:38:15 +02:00
Nick Wellnhofer
30ef77554b parser: Don't use deprecated xmlCopyChar 2024-07-02 13:34:11 +02:00
Nick Wellnhofer
751ba00e00 parser: Don't use deprecated xmlSwitchInputEncoding 2024-07-02 13:34:04 +02:00
Nick Wellnhofer
9a4770ef84 doc: Improve documentation 2024-07-02 13:34:04 +02:00
Nick Wellnhofer
0b0dd98983 parser: Fix EBCDIC detection 2024-07-01 18:05:40 +02:00
Nick Wellnhofer
221df37529 parser: Support custom charset conversion implementations
Implement xmlCtxtSetCharEncConvImpl. I agree that the name is terrible.
2024-07-01 18:05:40 +02:00
Nick Wellnhofer
e72eda101e parser: Add NULL check in xmlNewIOInputStream 2024-06-29 01:22:02 +02:00
Nick Wellnhofer
bc793390d5 parser: Update documentation 2024-06-27 16:23:14 +02:00
Nick Wellnhofer
193f4653a5 parser: Implement xmlCtxtGetStatus
This allows access to ctxt->wellFormed, ctxt->nsWellFormed and
ctxt->valid. It also detects several fatal non-parser errors which
really should be another error level.
2024-06-27 15:17:40 +02:00
Nick Wellnhofer
cc0cc2d3b7 parser: Add more parser context accessors 2024-06-27 14:45:33 +02:00
Nick Wellnhofer
eca972e682 parser: Add getters for XML declaration to parser context
Access to struct members will be deprecated.
2024-06-27 14:44:49 +02:00