Make sure that htmlParseStartTag doesn't terminate on characters for
which IS_CHAR_CH is false like control chars.
In htmlParseTryOrFinish, only switch to START_TAG if the next character
starts a valid name. Otherwise, htmlParseStartTag might return without
consuming all characters up to the final '>'.
Found by OSS-Fuzz.
The HTML push parser would look ahead for characters in "; >/" to
terminate an entity reference but actual parsing could stop earlier,
potentially resulting in quadratic runtime.
Parse char data and references alternately in htmlParseTryOrFinish
and only look ahead once for a terminating '<' character.
Found by OSS-Fuzz.
The parsing rules when looking for terminating chars or sequences in
the push parser differed from the actual parsing code. This could
result in the lookahead to overshoot and data being rescanned,
potentially leading to quadratic runtime.
Comments must never be handled during lookahead. Attribute values must
only be skipped for start tags and doctype declarations, not for end
tags, comments, PIs and script content.
The general assumption is that htmlCurrentChar only returns 0 if the
end of the input buffer is reached. The UTF-8 path already logged an
error if a zero byte U+0000 was found and returned a space character
instead. Make the ASCII code path do the same.
htmlParseTryOrFinish skips zero bytes at the beginning of a buffer, so
even if 0 was returned from htmlCurrentChar, the push parser would make
progress. But rescanning the input could cause performance problems.
The pull parser would abort parsing and now handles zero bytes in ASCII
mode the same way as the push parser or as in UTF-8 mode.
It would be better to return the replacement character U+FFFD instead,
but some of the client code assumes that the UTF-8 length of input and
output matches.
Reject sequences starting with a continuation byte as well as overlong
sequences like the XML parser.
Also fixes an infinite loop in connection with previous commit 50078922
since htmlCurrentChar would return 0 even if not at the end of the
buffer.
Found by OSS-Fuzz.
If htmlParseScript returns upon hitting an invalid character,
htmlParseLookupSequence will be called again with checkIndex reset to
zero, potentially resulting in quadratic runtime. Make sure that
htmlParseScript consumes all input in one go and simply skips over
invalid characters similar to htmlParseCharDataInternal.
Found by OSS-Fuzz.
Make sure that checkIndex is set when returning without match from
inside a comment. Also track parser state in htmlParseLookupChars.
Found by OSS-Fuzz.
Make xmlEncInputChunk and xmlEncOutputChunk return 0 on success and
never a positive value.
Make xmlCharEncFirstLineInt, xmlCharEncFirstLineInt and
xmlCharEncOutFunc return the number of bytes written.
- Bug 757711: heap-buffer-overflow in xmlFAParsePosCharGroup
<https://bugzilla.gnome.org/show_bug.cgi?id=757711>
- Bug 783015 - Integer-overflow in xmlFAParseQuantExact
<https://bugzilla.gnome.org/show_bug.cgi?id=783015>
(Regexptests): Add support for checking stderr output when
running regexp tests. This makes it possible to check in test
cases that fail and not see false-positive error output when
running the tests. Unlike other libxml2 test suites, if there
is no stderr output, no *.err file needs to be created.
Commit eeb99329 removed an important optimization avoiding quadratic
runtime when repeatedly scanning the input buffer for terminating
characters in the HTML push parser. The related bug is
https://bugzilla.gnome.org/show_bug.cgi?id=444994
Make sure that ctxt->checkIndex is always written and store additional
parser state in ctxt->inSubset which is unused in the HTML parser.
Found by OSS-Fuzz.
If charset conversion fails, reset the input pointers before reporting
the error and bailing out. Otherwise, the input pointers are left in an
invalid state which could lead to use-after-free and other memory
errors.
Similar to f9e7997e. Found by OSS-Fuzz.
When enabled via `./configure --enable-rebuild-docs`,
`make -C doc libxml2-api.xml` will invoke apibuild.py
to rebuild libxml2-api.xml from the sources.
But the code added in
9fa3200cb366c726f7c8ef234282603bb9e8816d made it error out with
```
Parsing ../parser.c
Parse Error: parsing type : expecting a name
('Got token ', ('sep', '('))
('Last token: ', ('sep', '('))
('Token queue: ', [('name', 'destructor'), ('sep', ')'), ('sep', ')')])
('Line 14689 end: ', '')
```
RVTs from libxslt are document nodes which are linked using the 'next'
pointer. These pointers must never be used to navigate the document
tree. Otherwise, random content from other RVTs could be returned
when evaluating XPath expressions.
It's interesting that this seemingly long-standing bug wasn't
discovered earlier. This issue could also cause severe performance
degradation.
Fixes https://gitlab.gnome.org/GNOME/libxslt/-/issues/37
Cast to signed type before subtraction to avoid unsigned integer
overflow. Also use ptrdiff_t to avoid potential integer truncation.
Found with libFuzzer and UBSan.
Commit 407b393d introduced a regression caused by xmlCharEncOutput
returning 0 in case of success instead of the number of bytes written.
Always use its return value for nbchars in xmlOutputBufferWrite.
Fixes#166.
When parsing the text declaration of external DTDs or entities, make
sure that parameter entities are not expanded. This also fixes a memory
leak in certain error cases.
The change to xmlSkipBlankChars assumes that the parser state is
maintained correctly when parsing external DTDs or parameter entities,
and might expose bugs in the code that were hidden previously.
Found by OSS-Fuzz.
This will be picked up OSS-Fuzz, limiting the maximum input size to
80 KB and hopefully avoiding timeouts. Some of the timeouts seem to be
related to our suboptimal handling of excessive entity expansion.
The new fuzzers support external entities and make this problem even
more prominent.
Just like IDs, IDREF attributes must be removed from the document's
refs table when they're freed by a reader. This bug is often hidden
because xmlAttr structs are reused and strings are stored in a
dictionary unless XML_PARSE_NODICT is specified.
Found by OSS-Fuzz.
- XML fuzzer
Currently tests the pull parser, push parser and reader, as well as
serialization. Supports splitting fuzz data into multiple documents
for things like external DTDs or entities. The seed corpus is built
from parts of the test suite.
- Regexp fuzzer
Seed corpus was statically generated from test suite.
- URI fuzzer
Tests parsing and most other functions from uri.c.