4923 Commits

Author SHA1 Message Date
Nick Wellnhofer
10d0947249 Fix .gitattributes
The files in 'test' and 'result' have mixed line endings, so disable
end-of-line conversion.
2020-07-23 20:46:42 +02:00
Nick Wellnhofer
173a0830dc Fix quadratic runtime when push parsing HTML start tags
Make sure that htmlParseStartTag doesn't terminate on characters for
which IS_CHAR_CH is false like control chars.

In htmlParseTryOrFinish, only switch to START_TAG if the next character
starts a valid name. Otherwise, htmlParseStartTag might return without
consuming all characters up to the final '>'.

Found by OSS-Fuzz.
2020-07-22 23:33:04 +02:00
David Kilzer
0e5c4fec15 Reset XML parser input before reporting errors
Apply changes to htmlParseChunk() in 13ba5b61 and 3f18e748 to
xmlParseChunk().
2020-07-19 14:10:33 +02:00
Nick Wellnhofer
6995eed077 Fix quadratic runtime when push parsing HTML entity refs
The HTML push parser would look ahead for characters in "; >/" to
terminate an entity reference but actual parsing could stop earlier,
potentially resulting in quadratic runtime.

Parse char data and references alternately in htmlParseTryOrFinish
and only look ahead once for a terminating '<' character.

Found by OSS-Fuzz.
2020-07-19 14:05:57 +02:00
Nick Wellnhofer
8e219b154e Fix HTML push parser lookahead
The parsing rules when looking for terminating chars or sequences in
the push parser differed from the actual parsing code. This could
result in the lookahead to overshoot and data being rescanned,
potentially leading to quadratic runtime.

Comments must never be handled during lookahead. Attribute values must
only be skipped for start tags and doctype declarations, not for end
tags, comments, PIs and script content.
2020-07-15 16:44:36 +02:00
Nick Wellnhofer
e050062ca9 Make htmlCurrentChar always translate U+0000
The general assumption is that htmlCurrentChar only returns 0 if the
end of the input buffer is reached. The UTF-8 path already logged an
error if a zero byte U+0000 was found and returned a space character
instead. Make the ASCII code path do the same.

htmlParseTryOrFinish skips zero bytes at the beginning of a buffer, so
even if 0 was returned from htmlCurrentChar, the push parser would make
progress. But rescanning the input could cause performance problems.

The pull parser would abort parsing and now handles zero bytes in ASCII
mode the same way as the push parser or as in UTF-8 mode.

It would be better to return the replacement character U+FFFD instead,
but some of the client code assumes that the UTF-8 length of input and
output matches.
2020-07-15 16:10:13 +02:00
Nick Wellnhofer
dfd4e33048 Rework control flow in htmlCurrentChar
Don't call xmlCurrentChar after switching encodings. Rearrange code
blocks and fall through to normal UTF-8 handling.
2020-07-15 16:10:13 +02:00
Nick Wellnhofer
922bebccdd Make 'xmllint --html --push -' read from stdin 2020-07-15 14:20:42 +02:00
Nick Wellnhofer
1493130ef2 Fix UTF-8 decoder in HTML parser
Reject sequences starting with a continuation byte as well as overlong
sequences like the XML parser.

Also fixes an infinite loop in connection with previous commit 50078922
since htmlCurrentChar would return 0 even if not at the end of the
buffer.

Found by OSS-Fuzz.
2020-07-15 12:54:25 +02:00
Nick Wellnhofer
beb7d71a8f Remove misleading comments in xpath.c
Fixes #169
2020-07-13 12:41:19 +02:00
Nick Wellnhofer
500789224b Fix quadratic runtime when parsing HTML script content
If htmlParseScript returns upon hitting an invalid character,
htmlParseLookupSequence will be called again with checkIndex reset to
zero, potentially resulting in quadratic runtime. Make sure that
htmlParseScript consumes all input in one go and simply skips over
invalid characters similar to htmlParseCharDataInternal.

Found by OSS-Fuzz.
2020-07-13 12:19:24 +02:00
Andre Klapper
d6761e706f Update to Devhelp index file format version 2
Fixes #89
2020-07-13 12:18:24 +02:00
Markus Rickert
d514e2bd40 Set project language to C 2020-07-12 18:42:49 +02:00
Markus Rickert
5ddf02f2a5 Update config.h.cmake.in 2020-07-12 18:42:18 +02:00
Markus Rickert
8bec210d4d Add variable for working directory of XML Conformance Test Suite 2020-07-12 18:42:18 +02:00
Markus Rickert
270e165552 Add additional tests and XML Conformance Test Suite 2020-07-12 18:33:35 +02:00
Markus Rickert
e6ba4bd775 Add command line option for temp directory in runtest 2020-07-12 18:33:35 +02:00
Markus Rickert
40e7ceaaaf Ensure LF line endings for test files 2020-07-12 18:33:35 +02:00
Markus Rickert
9ecf5ad6b1 Enable runtests and testThreads 2020-07-12 18:33:35 +02:00
Nick Wellnhofer
3f18e7486d Reset HTML parser input before reporting error
Avoid use-after-free, similar to 13ba5b61. Also make sure that
xmlBufSetInputBaseCur sets valid pointers in case of buffer errors.

Found by OSS-Fuzz.
2020-07-11 14:39:52 +02:00
Nick Wellnhofer
3da8d947df Fix more quadratic runtime issues in HTML push parser
Make sure that checkIndex is set when returning without match from
inside a comment. Also track parser state in htmlParseLookupChars.

Found by OSS-Fuzz.
2020-07-09 16:08:38 +02:00
Nick Wellnhofer
741b0d0a8b Fix regression introduced with 477c7f6a
The 'inSubset' member is actually used by the SAX2 handlers. Store
extra parser state in 'hasPErefs'.
2020-07-07 12:57:01 +02:00
Nick Wellnhofer
fc842f6eba Limit regexp nesting depth
Enforce a maximum nesting depth of 50 for regular expressions. Avoids
stack overflows with deeply nested regexes.

Found by OSS-Fuzz.
2020-07-06 15:22:12 +02:00
Nick Wellnhofer
1e41e4fa8e Fix return values and documentation in encoding.c
Make xmlEncInputChunk and xmlEncOutputChunk return 0 on success and
never a positive value.

Make xmlCharEncFirstLineInt, xmlCharEncFirstLineInt and
xmlCharEncOutFunc return the number of bytes written.
2020-07-06 15:06:13 +02:00
David Kilzer
6b4717d61d Add regexp regression tests
- Bug 757711: heap-buffer-overflow in xmlFAParsePosCharGroup
  <https://bugzilla.gnome.org/show_bug.cgi?id=757711>
- Bug 783015 - Integer-overflow in xmlFAParseQuantExact
  <https://bugzilla.gnome.org/show_bug.cgi?id=783015>

(Regexptests): Add support for checking stderr output when
running regexp tests.  This makes it possible to check in test
cases that fail and not see false-positive error output when
running the tests.  Unlike other libxml2 test suites, if there
is no stderr output, no *.err file needs to be created.
2020-07-06 12:37:53 +02:00
Nick Wellnhofer
477c7f6aff Fix quadratic runtime in HTML parser
Commit eeb99329 removed an important optimization avoiding quadratic
runtime when repeatedly scanning the input buffer for terminating
characters in the HTML push parser. The related bug is

    https://bugzilla.gnome.org/show_bug.cgi?id=444994

Make sure that ctxt->checkIndex is always written and store additional
parser state in ctxt->inSubset which is unused in the HTML parser.

Found by OSS-Fuzz.
2020-07-06 12:17:20 +02:00
Nick Wellnhofer
f8329fdc23 Report error for invalid regexp quantifiers 2020-07-02 11:54:28 +02:00
Nick Wellnhofer
13ba5b619a Reset HTML parser input before reporting encoding error
If charset conversion fails, reset the input pointers before reporting
the error and bailing out. Otherwise, the input pointers are left in an
invalid state which could lead to use-after-free and other memory
errors.

Similar to f9e7997e. Found by OSS-Fuzz.
2020-06-28 13:21:50 +02:00
Nick Wellnhofer
1e7851b5ae Fix integer overflow in xmlFAParseQuantExact
Found by OSS-Fuzz.
2020-06-25 12:18:21 +02:00
Nick Wellnhofer
84bab955fe Fix return value of xmlC14NDocDumpMemory
Make sure to return -1 in case of buffer errors.

Fixes #174.
2020-06-24 20:07:32 +02:00
Martin Vidner
43a8836cde Fix rebuilding docs, by hiding __attribute__((...)) behind a macro.
When enabled via `./configure --enable-rebuild-docs`,
`make -C doc libxml2-api.xml` will invoke apibuild.py
to rebuild libxml2-api.xml from the sources.
But the code added in
9fa3200cb366c726f7c8ef234282603bb9e8816d made it error out with

```
Parsing ../parser.c
Parse Error: parsing type : expecting a name
('Got token ', ('sep', '('))
('Last token: ', ('sep', '('))
('Token queue: ', [('name', 'destructor'), ('sep', ')'), ('sep', ')')])
('Line 14689 end: ', '')
```
2020-06-24 19:55:52 +02:00
Nick Wellnhofer
9f42f6baaa Don't follow next pointer on documents in xmlXPathRunStreamEval
RVTs from libxslt are document nodes which are linked using the 'next'
pointer. These pointers must never be used to navigate the document
tree. Otherwise, random content from other RVTs could be returned
when evaluating XPath expressions.

It's interesting that this seemingly long-standing bug wasn't
discovered earlier. This issue could also cause severe performance
degradation.

Fixes https://gitlab.gnome.org/GNOME/libxslt/-/issues/37
2020-06-24 15:33:38 +02:00
Nick Wellnhofer
c0440868c3 Copy xs:duration parser from libexslt
The duration parser in libexslt checks for integer overflows.
2020-06-23 16:20:28 +02:00
Nick Wellnhofer
18425d3ad5 Fix integer overflow in _xmlSchemaParseGYear
Found with libFuzzer and UBSan.
2020-06-23 16:20:28 +02:00
Nick Wellnhofer
070d635e77 Fix integer overflow when parsing {min,max}Occurs
Clamp value to INT_MAX.

Found with libFuzzer and UBSan.
2020-06-23 16:20:28 +02:00
Nick Wellnhofer
50f18830e1 Fix another memory leak in xmlSchemaValAtomicType
Don't collapse language IDs twice.

Found with libFuzzer and ASan.
2020-06-23 16:20:28 +02:00
Nick Wellnhofer
eac1c7e2e5 Fuzz target for XML Schemas
This only tests the schema parser for now.
2020-06-23 16:20:27 +02:00
Nick Wellnhofer
ffd31dbefd Move entity recorder to fuzz.c 2020-06-21 12:15:46 +02:00
Nick Wellnhofer
681f094e5b Fix unsigned integer overflow in htmlParseTryOrFinish
Cast to signed type before subtraction to avoid unsigned integer
overflow. Also use ptrdiff_t to avoid potential integer truncation.

Found with libFuzzer and UBSan.
2020-06-15 21:25:22 +02:00
Nick Wellnhofer
31ca4a728c Fix integer overflow in htmlParseCharRef
Fixes #115.
2020-06-15 21:23:54 +02:00
Nick Wellnhofer
2f9382033e Fix undefined behavior in UTF16LEToUTF8
Don't perform arithmetic on null pointer.

Found with libFuzzer and UBSan.
2020-06-15 21:23:54 +02:00
Nick Wellnhofer
536f421d37 Fuzz target for HTML parser 2020-06-15 15:23:38 +02:00
Nick Wellnhofer
a697ed1e24 Fix return value of xmlCharEncOutput
Commit 407b393d introduced a regression caused by xmlCharEncOutput
returning 0 in case of success instead of the number of bytes written.
Always use its return value for nbchars in xmlOutputBufferWrite.

Fixes #166.
2020-06-15 15:23:38 +02:00
Nick Wellnhofer
af893a58c6 Update GitLab CI container 2020-06-11 16:08:16 +02:00
Nick Wellnhofer
a28f7d8789 Never expand parameter entities in text declaration
When parsing the text declaration of external DTDs or entities, make
sure that parameter entities are not expanded. This also fixes a memory
leak in certain error cases.

The change to xmlSkipBlankChars assumes that the parser state is
maintained correctly when parsing external DTDs or parameter entities,
and might expose bugs in the code that were hidden previously.

Found by OSS-Fuzz.
2020-06-10 14:25:19 +02:00
Nick Wellnhofer
487871b0e3 Fix undefined behavior in xmlXPathTryStreamCompile
&NULL[0] is undefined behavior.
2020-06-10 13:23:43 +02:00
Nick Wellnhofer
e98150d444 Add options file for xml fuzzer
This will be picked up OSS-Fuzz, limiting the maximum input size to
80 KB and hopefully avoiding timeouts. Some of the timeouts seem to be
related to our suboptimal handling of excessive entity expansion.
The new fuzzers support external entities and make this problem even
more prominent.
2020-06-09 13:53:06 +02:00
Nick Wellnhofer
2af3c2a8b9 Fix use-after-free with validating reader
Just like IDs, IDREF attributes must be removed from the document's
refs table when they're freed by a reader. This bug is often hidden
because xmlAttr structs are reused and strings are stored in a
dictionary unless XML_PARSE_NODICT is specified.

Found by OSS-Fuzz.
2020-06-08 14:05:42 +02:00
Nick Wellnhofer
00ed736eec Add a couple of libFuzzer targets
- XML fuzzer
  Currently tests the pull parser, push parser and reader, as well as
  serialization. Supports splitting fuzz data into multiple documents
  for things like external DTDs or entities. The seed corpus is built
  from parts of the test suite.

- Regexp fuzzer
  Seed corpus was statically generated from test suite.

- URI fuzzer
  Tests parsing and most other functions from uri.c.
2020-06-05 13:53:11 +02:00
Nick Wellnhofer
2e8cc66d8f xmlParseBalancedChunkMemory must not be called with NULL doc
There is no way to avoid memory leaks without a document to hold the
namespace list.
2020-05-30 15:43:34 +02:00