444 Commits

Author SHA1 Message Date
Nick Wellnhofer
320f5084cd parser: Improve handling of encoding and IO errors
Make sure that xmlCharEncInput, xmlParserInputBufferPush and
xmlParserInputBufferGrow set the correct error code in the
xmlParserInputBuffer. Handle errors when calling these functions.
2023-04-30 21:31:54 +02:00
Nick Wellnhofer
1061537efd malloc-fail: Fix buffer overread with HTML doctype declarations
Found by OSS-Fuzz, see #344.
2023-03-26 22:42:13 +02:00
Nick Wellnhofer
7fbd454d9f parser: Grow input buffer earlier when reading characters
Make more bytes available after invoking CUR_CHAR or NEXT.
2023-03-21 21:35:53 +01:00
Nick Wellnhofer
04d1bedd8c parser: Rework shrinking of input buffers
Don't try to grow the input buffer in xmlParserShrink. This makes sure
that no memory allocations are made and the function always succeeds.

Remove unnecessary invocations of SHRINK. Invoke SHRINK at the end of
DTD parsing loops.

Shrink before growing.
2023-03-21 13:19:18 +01:00
Nick Wellnhofer
44ecefc8cc malloc-fail: Fix buffer overread after htmlParseScript
Found by OSS-Fuzz, see #344.
2023-03-20 15:53:42 +01:00
Nick Wellnhofer
067986fa67 parser: Fix regressions from previous commits
- Fix memory leak in xmlParseNmtoken.
- Fix buffer overread after htmlParseCharDataInternal.
2023-03-18 16:51:40 +01:00
Nick Wellnhofer
9ef2a9abf3 html: Rely on CUR_CHAR to grow the input buffer
- Remove useless invocations of GROW.
- Add some error checks.
- Fix invocations of SHRINK.
2023-03-17 14:14:04 +01:00
Nick Wellnhofer
62f199ed7d malloc-fail: Add error check in htmlParseHTMLAttribute
This function must return NULL is an error occurs.

Found by OSS-Fuzz, see #344.
2023-03-17 12:40:46 +01:00
Nick Wellnhofer
8090e58564 malloc-fail: Fix buffer overread in htmlParseScript
Found by OSS-Fuzz, see #344.
2023-03-17 12:27:07 +01:00
Nick Wellnhofer
ca2bfecea9 malloc-fail: Fix buffer overread when reading from input
Found by OSS-Fuzz, see #344.
2023-03-15 17:34:32 +01:00
Nick Wellnhofer
4b3452d171 html: Fix quadratic behavior in htmlParseTryOrFinish
Fix check for end of script content.

Found by OSS-Fuzz.
2023-03-15 17:02:46 +01:00
Nick Wellnhofer
14c62e0dd3 html: Use NEXTL in htmlParseHTMLAttribute
This is more efficient than NEXT.
2023-03-15 17:02:46 +01:00
Nick Wellnhofer
2099441f32 parser: Stop calling xmlParserInputShrink
Introduce xmlParserShrink which takes a parser context to simplify error
handling.
2023-03-13 17:51:13 +01:00
Nick Wellnhofer
cabde70f8b parser: Simplify calculation of available buffer space 2023-03-12 19:07:23 +01:00
Nick Wellnhofer
b75976e029 parser: Use size_t when subtracting input buffer pointers
Avoid integer overflows.
2023-03-12 19:06:19 +01:00
Nick Wellnhofer
9a6ca81612 parser: Check for integer overflow when updating checkIndex
Unfortunately, checkIndex is a long, not a size_t. Check for integer
overflow before updating the value.
2023-03-12 19:03:11 +01:00
Nick Wellnhofer
bd63d730b8 html: Impose some length limits
Impose length limits on names, attribute values, PIs and comments,
similar to the XML parser.
2023-03-12 17:40:55 +01:00
Nick Wellnhofer
3eb6bf0386 parser: Stop calling xmlParserInputGrow
Introduce xmlParserGrow which takes a parser context to simplify error
handling.
2023-03-12 17:05:51 +01:00
Nick Wellnhofer
53d1cc98cf malloc-fail: Fix error code in htmlParseChunk
Found with libFuzzer, see #344.
2023-02-17 17:18:51 +01:00
Nick Wellnhofer
15b0ed0815 malloc-fail: Fix infinite loop in htmlParseDocTypeDecl
Found with libFuzzer, see #344.
2023-02-17 17:18:47 +01:00
Nick Wellnhofer
041789d9ec malloc-fail: Fix null deref in htmlnamePush
Found with libFuzzer, see #344.
2023-02-17 17:18:43 +01:00
Nick Wellnhofer
0ec9c91064 malloc-fail: Fix infinite loop in htmlParseStartTag
Found with libFuzzer, see #344.
2023-02-17 17:18:38 +01:00
Nick Wellnhofer
04c2955197 malloc-fail: Fix infinite loop in htmlParseContentInternal
Found with libFuzzer, see #344.
2023-02-17 17:18:34 +01:00
Nick Wellnhofer
f3e62035d8 malloc-fail: Fix memory leak in htmlCreatePushParserCtxt
Found with libFuzzer, see #344.
2023-02-17 17:18:29 +01:00
Nick Wellnhofer
fc256953d2 malloc-fail: Fix memory leak in htmlCreateMemoryParserCtxt
Found with libFuzzer, see #344.
2023-02-17 17:18:25 +01:00
Nick Wellnhofer
643b4e90eb malloc-fail: Fix infinite loop in htmlParseStartTag
Found with libFuzzer, see #344.
2023-02-17 17:16:52 +01:00
Nick Wellnhofer
59b3366178 error: Limit number of parser errors
Reporting errors is expensive and some abusive test cases can generate
an error for each invalid input byte. This causes the parser to spend
most of the time with error handling. Limit the number of errors and
warnings to 100.
2022-12-27 14:41:19 +01:00
Alex Richardson
4b959ee168 Remove hacky heuristic from b2dc5675e94aa6b5557ba63f7d66b0f08dd17e4d
Checking whether the context is close to the parent context by hardcoding
250 is not portable (I noticed tests were failing on Morello since the value
is 288 there due to pointers being 128 bits). Instead we should ensure
that the XML_VCTXT_USE_PCTXT flag is not set in cases where the user data
is not actually a parser context (or ideally add a separate field but that
would be an ABI break.

From what I can see in the source, the XML_VCTXT_USE_PCTXT is only set if
the userData field points to a valid context, and if this is not the case
the flag should be cleared when changing userData rather than relying on
the offset between the two. Looking at the history, I think
d7cb33cf44aa688f24215c9cd398c1a26f0d25ff fixed most of the need for this
workaround, but it looks like there are a few more locations that need
updating; This commit changes two more places to set/clear/copy the
XML_VCTXT_USE_PCTXT flag, so this heuristic should not be needed anymore.
I've also drop two = NULL assignment in xmllint since this is not needed
after a call to memset().

There was also an uninitialized vctxt.flags (and other fields) in
`xmlShellValidate()`, which I've fixed by adding a memset() call.
2022-12-01 15:31:25 +00:00
Alex Richardson
c715ded086 Avoid creating an out-of-bounds pointer by rewriting a check
Creating more than one-past-the-end pointers is undefined behaviour in C
and while this code is unlikely to be miscompiled, I discovered that an
out-of-bounds pointer is being created using UBSan on a CHERI-enabled
system.
2022-12-01 15:30:12 +00:00
Nick Wellnhofer
c7a9b85cbb html: Improve parsing of nested lists
Allow ul/ol as immediate children of ul/ol. This is more in line with
the HTML5 spec.

Fixes #447.
2022-11-30 17:11:33 +01:00
Nick Wellnhofer
e414f82585 html: Fix htmlInitAutoClose documentation 2022-11-27 02:11:07 +01:00
Nick Wellnhofer
c93679381c html: Fix check for end of comment in push parser
Make sure to reset checkIndex. Handle case where "--" or "--!" is at the
end of the buffer. Fix "avail" check in htmlParseOrTryFinish.
2022-11-20 21:27:59 +01:00
Nick Wellnhofer
68a6518c45 parser: Rewrite push parser boundary checks
Remove inaccurate xmlParseCheckTransition check.

Remove non-incremental xmlParseGetLasts check.

Add functions that check for several boundary constructs more
accurately, keeping track of progress in ctxt->checkIndex.

Fixes #439.
2022-11-20 21:27:08 +01:00
Nick Wellnhofer
6843fc726f Remove or annotate char casts 2022-09-01 04:31:30 +02:00
Nick Wellnhofer
2cac626976 Don't use sizeof(xmlChar) or sizeof(char) 2022-09-01 03:35:19 +02:00
Nick Wellnhofer
ad338ca737 Remove explicit integer casts
Remove explicit integer casts as final operation

- in assignments
- when passing arguments
- when returning values

Remove casts

- to the same type
- from certain range-bound values

The main motivation is that these explicit casts don't change the result
of operations and only render UBSan's implicit-conversion checks
useless. Removing these casts allows UBSan to detect cases where
truncation or sign-changes occur unexpectedly.

Document some explicit casts as truncating and add a few missing ones.
2022-09-01 02:33:57 +02:00
Nick Wellnhofer
65dc8a63ac Make xmlNewSAXParserCtx take a const sax handler
Also improve documentation.
2022-09-01 00:17:45 +02:00
Nick Wellnhofer
0f568c0b73 Consolidate private header files
Private functions were previously declared

- in header files in the root directory
- in public headers guarded with IN_LIBXML
- in libxml.h
- redundantly in source files that used them.

Consolidate all private header files in include/private.
2022-08-26 02:11:56 +02:00
Nick Wellnhofer
58fc89e8a9 Deprecate internal parser functions 2022-08-25 21:04:57 +02:00
Nick Wellnhofer
a308c0cdf7 Deprecate old HTML SAX API 2022-08-25 21:04:57 +02:00
Nick Wellnhofer
9a82b94a94 Introduce xmlNewSAXParserCtxt and htmlNewSAXParserCtxt
Add API functions to create a parser context with a custom SAX handler
without having to mess with ctxt->sax manually.
2022-08-24 14:07:55 +02:00
Nick Wellnhofer
0a04db19fc Don't mess with parser options in htmlParseDocument
Don't set ctxt->html. This member should already be initialized.

Set ctxt->linenumbers in htmlCtxtUseOptions like the XML parser does.
2022-08-24 14:06:00 +02:00
Nick Wellnhofer
d45263a262 Remove useless call to htmlDefaultSAXHandlerInit
This function is already called from xmlInitParser.
2022-08-24 14:04:35 +02:00
Nick Wellnhofer
4b184240be Remove htmlDefaultSAXHandler from non-SAX1 build
This matches long-standing behavior of the XML counterpart.
2022-08-22 14:24:25 +02:00
Nick Wellnhofer
80bd34c3c6 Don't initialize SAX handler in htmlReadMemory
The SAX handler is already initialized when creating the parser
context.
2022-08-22 14:06:37 +02:00
Nick Wellnhofer
37cedc0b15 Fix htmlReadMemory mixing up XML and HTML functions
Also see fe6890e2.
2022-08-22 14:04:07 +02:00
Nick Wellnhofer
920753c4aa Don't use default SAX handler to report unrelated errors 2022-08-22 13:48:59 +02:00
Nick Wellnhofer
38f04779f7 Fix HTML parser with threads and --without-legacy
If the legacy functions are disabled, the default "V1" HTML SAX handler
isn't initialized in threads other than the main thread.
htmlInitParserCtxt would later use the empty V1 SAX handler, resulting
in NULL documents.

Change htmlInitParserCtxt to initialize the HTML SAX handler by calling
xmlSAX2InitHtmlDefaultSAXHandler. This removes the ability to change the
default handler but is more in line with the XML parser which
initializes the SAX handler by calling xmlSAXVersion, ignoring the V1
default handler.

Fixes #399.
2022-08-22 13:48:59 +02:00
Nick Wellnhofer
5b2d07a726 Use xmlStrlen in *CtxtReadDoc
xmlStrlen handles buffers larger than INT_MAX more gracefully.
2022-08-20 17:00:50 +02:00
Nick Wellnhofer
4ad71c2d72 Fix xmlCtxtReadDoc with encoding
xmlCtxtReadDoc used to create an input stream involving
xmlNewStringInputStream. This would create a stream without an input
buffer, causing problems with encodings (see #34).

After commit aab584dc3, an error was returned even with UTF-8 encodings
which happened to work before.

Make xmlCtxtReadDoc call xmlCtxtReadMemory which doesn't suffer from
these issues. Also fix htmlCtxtReadDoc.

Fixes #397.
2022-08-20 16:34:08 +02:00