227 Commits

Author SHA1 Message Date
Nick Wellnhofer
b349225952 include: Change some return types from int to enum
This also affects some new functions from 2.13.
2025-03-14 02:31:01 +01:00
Nick Wellnhofer
84c6524e26 encoding: Support input-only and output-only converters
Make it possible to open an encoding handler only for input or output.
This avoids the creation of unnecessary converters.

Should also fix #863.
2025-03-13 22:15:10 +01:00
Nick Wellnhofer
69b83bb68e encoding: Detect truncated multi-byte sequences with ICU
Unlike iconv or the internal converters, ICU consumes truncated multi-
byte sequences at the end of an input buffer. We currently check for a
non-empty raw input buffer to detect truncated sequences, so this fails
with ICU.

It might be possible to inspect the pivot buffer pointers, but it seems
cleaner to implement a `flush` flag for some encoding and I/O functions.
After flushing, we can check for U_TRUNCATED_CHAR_FOUND with ICU, or
detect remaining input with other converters.

Also fix detection of truncated sequences for HTML, XML content and
DTDs with iconv.
2025-03-13 22:15:10 +01:00
Nick Wellnhofer
ef44c240f5 encoding: Fix memory leak in xmlCharEncNewCustomHandler
Short-lived regression.
2025-03-10 14:16:14 +01:00
Nick Wellnhofer
87c9e000e5 encoding: Rework custom encoding implementation API 2025-03-09 22:37:13 +01:00
Nick Wellnhofer
38f475072a encoding: Make conversion callbacks more type-safe 2025-03-05 22:25:14 +01:00
Nick Wellnhofer
a846d96468 encoding: Remove compatibility struct members 2025-03-05 16:49:42 +01:00
Nick Wellnhofer
0b27097a92 encoding: Rename unprefixed public functions 2025-03-04 16:46:53 +01:00
Nick Wellnhofer
3793eaadb7 fuzz: Fix build 2025-02-16 13:55:18 +01:00
Nick Wellnhofer
9c16a153d8 Revert "include: Make most IS_* macros private"
This reverts commit 84a6c82ff83d04963d6e1c5cd18ded68ea02d99f.
2025-02-13 20:20:17 +01:00
Nick Wellnhofer
cfc854b839 fuzz: Work around glibc iconv() bug 2025-02-11 00:21:12 +01:00
Nick Wellnhofer
c4f760be8a encoding: Handle iconv() returning EOPNOTSUPP on Apple
iconv() really shouldn't return undocumented error codes.
2025-02-02 11:15:45 +01:00
Nick Wellnhofer
cdfb54ff7b Fix typos 2025-01-31 18:41:41 +01:00
Nick Wellnhofer
6ec616ba26 encoding: Don't allow POSIX indicator suffixes in encoding names
Suffixes like "//IGNORE" change the behavior of iconv.

Also add comment on how we currently rely on GNU libiconv behavior
which technically violates the POSIX spec.
2025-01-24 20:47:52 +01:00
Nick Wellnhofer
fbaacfe223 encoding: Clean up UCS-4 encodings
Use "UCS-*" instead of "ISO-10646-UCS-*". While the XML spec recommends
"ISO-10646-UCS-2" and "ISO-10646-UCS-4", GNU iconv doesn't understand
these names.

Ignore UCS4_2143 and UCS4_3412 which were never supported.
2025-01-16 16:09:14 +01:00
Nick Wellnhofer
df0f16fa26 encoding: Check reallocations for overflow 2024-12-21 19:37:37 +01:00
Nick Wellnhofer
dae160c64b encoding: Fix table entry for "UTF16" 2024-09-13 12:08:20 +02:00
Nick Wellnhofer
6e503eb742 encoding: Handle more ICU error codes
U_ILLEGAL_ESCAPE_SEQUENCE and U_UNSUPPORTED_ESCAPE_SEQUENCE can occur
with ISO-2022.
2024-09-10 03:34:46 +02:00
Nick Wellnhofer
55d36c5990 encoding: Fix error code in xmlUconvConvert
Broke in 46ec621e.
2024-09-10 03:11:18 +02:00
Nick Wellnhofer
34c9108f15 encoding: Add sizeOut argument to xmlCharEncInput
When push parsing, we want to convert as much of the input as possible.
When pull parsing memory buffers, we want to convert data chunk by chunk
to save memory.
2024-07-16 17:42:10 +02:00
Nick Wellnhofer
1cfc5b8089 entities: Rework serialization of numeric character references 2024-07-16 17:42:10 +02:00
Nick Wellnhofer
69f12d6d47 encoding: Deprecate xmlByteConsumed
This was only used by Chromium/WebKit to detect whether xmlParseContent
really succeeded. It's a horrible, overcomplicated hack.

See 8c5848bd and #767.
2024-07-13 15:42:02 +02:00
Nick Wellnhofer
d099795611 encoding: Readd some UTF-8 validation to encoders
This isn't strictly needed but avoids generating invalid UTF-16 and
unsigned integer overflows.
2024-07-10 22:26:19 +02:00
Nick Wellnhofer
f48eefe3d0 encoding: Rework xmlByteConsumed
Don't loop infinitely if input buffer is too large. Allocate conversion
buffer on the heap.
2024-07-09 14:25:32 +02:00
Nick Wellnhofer
f86d17c163 encoding: Fix xmlParseCharEncoding
Make "UTF-16" return the UTF16LE handler as before.

Fix error return.
2024-07-04 15:47:20 +02:00
Nick Wellnhofer
46ec621eb7 encoding: Clarify xmlUconvConvert 2024-07-03 16:06:59 +02:00
Nick Wellnhofer
48fec2429b encoding: Remove duplicate code
Fix recent commit.
2024-07-03 15:11:20 +02:00
Nick Wellnhofer
71fb257912 encoding: Fix ICU build 2024-07-03 14:35:49 +02:00
Nick Wellnhofer
9a4770ef84 doc: Improve documentation 2024-07-02 13:34:04 +02:00
Nick Wellnhofer
0b0dd98983 parser: Fix EBCDIC detection 2024-07-01 18:05:40 +02:00
Nick Wellnhofer
37a9ff11d8 encoding: Simplify xmlCharEncCloseFunc 2024-07-01 18:05:40 +02:00
Nick Wellnhofer
1167c3340e encoding: Don't include iconv.h from libxml/encoding.h 2024-07-01 18:05:40 +02:00
Nick Wellnhofer
30be984a0f encoding: Rework ISO-8859-X conversion
Optimize code. Pass tables as context parameter. Check for
XML_ENC_ERR_SPACE.
2024-07-01 18:05:40 +02:00
Nick Wellnhofer
282ec1d548 encoding: Rework xmlCharEncodingHandler layout
Reuse some of the old members.

The "input" and "output" function pointers are actually of type
xmlCharEncConvFunc, accepting an additional argument. For default
handlers, this argument is unused, so this should work with most ABIs.
For iconv handlers, these function pointers used to be NULL but now
point to a function which requires the extra argument.

"iconv_in" and "iconv_out" are made void pointers. "uconv_in" and
"uconv_out" are renamed and made void pointers. This is unlikely to
cause issues.

We now expect that the built-in conversion functions correctly report
XML_ENC_ERR_SPACE. For UTF8ToHtml and the ISO-8859-X code, this will be
done in the following commits.
2024-07-01 18:05:40 +02:00
Nick Wellnhofer
57e37dff4e encoding: Rework UTF-16 conversion functions
Optimize UTF-16 conversion functions. Avoid misaligned memory access.
Don't rely on 'sizeof(short) == 2'. Check for XML_ENC_ERR_SPACE. Add
some tests for UTF-16 conversion.
2024-07-01 18:05:40 +02:00
Nick Wellnhofer
bb8e81c788 encoding: Rework simple conversions function
Use a single function for ASCII conversion. Optimize code. Check for
XML_ENC_ERR_SPACE.
2024-07-01 18:05:40 +02:00
Nick Wellnhofer
501e5d195d encoding: Stop using XML_ENC_ERR_PARTIAL 2024-07-01 18:05:40 +02:00
Nick Wellnhofer
c59c24494d encoding: Support custom implementations 2024-07-01 18:05:40 +02:00
Nick Wellnhofer
1e3da9f4d4 encoding: Start with callbacks 2024-07-01 18:05:40 +02:00
Nick Wellnhofer
6d8427dc97 encoding: Rework encoding lookup
Add missing xmlCharEncoding enum values.

Simplify and speed up encoding lookup by using a table mapping names to
xmlCharEncoding enums and binary search. Rearrange the default handler
table to match the enum layout.

For some encodings we now only lookup the provided or most canonical
name instead of trying several names, expecting that iconv or ICU handle
aliases:

- IBM037 (EBCDIC)
- UCS-2
- UCS-4
- Shift_JIS
2024-07-01 18:05:40 +02:00
Nick Wellnhofer
f4e63f7a4f Regenerate libxml2-api.xml and testapi.c 2024-06-27 15:17:40 +02:00
Nick Wellnhofer
b1a416bf52 encoding: Restore old lookup order in xmlOpenCharEncodingHandler
When looking up encodings with xmlLookupCharEncodingHandler, the
returned handler can have a different name than requested
(capitalization, internal aliases). This should eventually be fixed.
For now we revert part of commit 5b893fa9, start the lookup with
xmlFindHandler and add an explicit check for UTF-8.

Should fix the encoding name issue mentioned in #749.
2024-06-27 12:34:45 +02:00
Nick Wellnhofer
c4d8343b7c encoding: Make xmlFindCharEncodingHandler return UTF-8 handler
xmlFindCharEncodingHandler must always return a handler.

Remove UTF-8 handler from default handler list.

Fixes 5b893fa9.
2024-06-24 20:08:27 +02:00
Nick Wellnhofer
5b893fa999 encoding: Fix encoding lookup with xmlOpenCharEncodingHandler
Make xmlOpenCharEncodingHandler call xmlParseCharEncoding first so we
prefer our own handlers for names like "UTF8". Only UTF-16 needs an
exception.

Make callers check the return value. For UTF-8, a NULL encoding doesn't
mean an error.

Remove unnecessary UTF-8 check from htmlFindOutputEncoder. Don't try to
look up ASCII handler since the HTML handler is always available.

Fix return code of xmlParseCharEncoding.

Should fix #744.
2024-06-22 21:59:03 +02:00
Rosen Penev
2def7b4b28 clang-tidy: move assignments out of if
Found with bugprone-assignment-in-if-condition

Signed-off-by: Rosen Penev <rosenp@gmail.com>
2024-06-20 21:11:44 -07:00
Nick Wellnhofer
63ce5f9aed Make some globals const 2024-04-28 17:53:39 +02:00
Nick Wellnhofer
072facc49e encoding: Don't shrink input too early in xmlCharEncOutput
Some exotic encodings like ISO646-FR don't support '#' characters, so
encoding a character reference can actually fail. Don't skip the
offending input in this case so the error will be reported on the next
call.
2024-03-18 15:14:43 +01:00
Nick Wellnhofer
0821efc8ee encoding: Check whether encoding handlers support input/output
The "HTML" encoding handler doesn't support input which could lead to a
wrong error report.
2024-01-02 19:48:23 +01:00
Nick Wellnhofer
023aecc474 encoding: Support ASCII in xmlLookupCharEncodingHandler
Return our built-in ASCII handler. This was never implemented and
triggered the new and stricter error checks.
2023-12-13 23:58:45 +01:00
Nick Wellnhofer
bd5ad0308d encoding: Report malloc failures
Introduce new API functions that return a separate error code if a
memory allocation fails.

- xmlOpenCharEncodingHandler
- xmlLookupCharEncodingHandler

Fix a few places where malloc failures weren't reported.
2023-12-11 22:05:47 +01:00