libxml2

c/libxml2

mirror of https://gitlab.gnome.org/GNOME/libxml2 synced 2025-03-28 21:33:13 +00:00

Author	SHA1	Message	Date
Nick Wellnhofer	71122421a1	html: Make implied <p> tags more deterministic libxml2's HTML parser adds <p> start tags in some situations. This behavior, which doesn't follow any standard, was added in 2000, see here: http://veillard.com/XML/messages/0655.html Text nodes that only contain whitespace don't imply a <p> tag, but the whitespace check cannot work reliably if we're parsing partial text data which can happen with both pull and push parser. The logic in `areBlanks` is hard to follow. The checks involving `CUR` depend on the position of the input pointer and seem dubious. It's also possible that the behavior changed inadvertently with a later commit. As a result, it's hard to come up with good test cases. We now process leading whitespace before creating implied tags. This is more in line with HTML5 and should avoid at least some issues with partial text data. For example, parsing the string "<head> x" used to result in: <html> <head></head> <body><p> x</p></body> </html> And now results in: <html> <head> </head> <body><p>x</p></body> </html> Except for the implied <p> tag, this matches HTML5.	2025-02-13 14:31:44 +01:00
Nick Wellnhofer	e1834745e0	html: Add character data tests	2024-10-06 18:13:05 +02:00
Nick Wellnhofer	5951179239	html: Parse named character references according to HTML5	2024-10-06 18:13:05 +02:00
Nick Wellnhofer	e395946194	html: Reenable buggy detection of XML declarations Switch to UTF-8 if a document starts with '<?xm' to match old behavior. Also enable this check in the push parser. Fixes #637.	2023-11-30 16:22:59 +01:00
Nick Wellnhofer	d7d0bc6581	SAX2: Ignore namespaces in HTML documents In commit 21ca8829, we started to ignore namespaces in HTML element names but we still called xmlSplitQName, effectively stripping the namespace prefix. This would cause elements like <o:p> being parsed as <p>. Now we leave the name untouched. Fixes #508.	2023-03-31 17:08:43 +02:00
Nick Wellnhofer	e986d09cf5	Skip incorrectly opened HTML comments Commit 4fd69f3e fixed handling of '<' characters not followed by an ASCII letter. But a '<!' sequence followed by invalid characters should be treated as bogus comment and skipped. Fixes #380.	2022-08-02 14:38:09 +02:00
Mike Dalessio	24cdc89006	test coverage for abruptly-closed comments These establish baseline behavior so that the subsequent commit is clear about the behavior it will modify.	2022-03-02 14:42:47 +00:00
Nick Wellnhofer	2732b23466	Fix regression parsing public IDs literals in HTML Fix regression introduced when reworking htmlParsePubidLiteral in commit 93ce33c2. Fixes #318.	2022-01-10 13:37:59 +01:00
Mike Dalessio	e28d9347bc	add test coverage for incorrectly-closed comments this establishes the baseline behavior so that subsequent commits which modify this behavior are clear about what's being changed.	2020-12-16 16:12:07 +01:00
Nick Wellnhofer	477c7f6aff	Fix quadratic runtime in HTML parser Commit eeb99329 removed an important optimization avoiding quadratic runtime when repeatedly scanning the input buffer for terminating characters in the HTML push parser. The related bug is https://bugzilla.gnome.org/show_bug.cgi?id=444994 Make sure that ctxt->checkIndex is always written and store additional parser state in ctxt->inSubset which is unused in the HTML parser. Found by OSS-Fuzz.	2020-07-06 12:17:20 +02:00
David Kilzer	85c112a082	Add test cases for bug 758518 test/HTML/758518-entity.html exposed a bug in pushParseTest() in runtest.c which assumed that an input file was at least 4 bytes long. That test case is only 3 bytes, so we now take the minimum of 4 bytes or the length of the test input. We also now use 'chunkSize' in place of the hard-coded value '1024' later in the function.	2017-06-12 18:26:11 +02:00
Pranjal Jumde	0bcd05c5cd	Heap-based buffer overread in htmlCurrentChar For https://bugzilla.gnome.org/show_bug.cgi?id=758606 * parserInternals.c: (xmlNextChar): Add an test to catch other issues on ctxt->input corruption proactively. For non-UTF-8 charsets, xmlNextChar() failed to check for the end of the input buffer and would continuing reading. Fix this by pulling out the check for the end of the input buffer into common code, and return if we reach the end of the input buffer prematurely. * result/HTML/758606.html: Added. * result/HTML/758606.html.err: Added. * result/HTML/758606.html.sax: Added. * result/HTML/758606_2.html: Added. * result/HTML/758606_2.html.err: Added. * result/HTML/758606_2.html.sax: Added. * test/HTML/758606.html: Added test case. * test/HTML/758606_2.html: Added test case.	2016-05-23 15:01:07 +08:00
Pranjal Jumde	a820dbeac2	Bug 758605: Heap-based buffer overread in xmlDictAddString <https://bugzilla.gnome.org/show_bug.cgi?id=758605 > Reviewed by David Kilzer. * HTMLparser.c: (htmlParseName): Add bounds check. (htmlParseNameComplex): Ditto. * result/HTML/758605.html: Added. * result/HTML/758605.html.err: Added. * result/HTML/758605.html.sax: Added. * runtest.c: (pushParseTest): The input for the new test case was so small (4 bytes) that htmlParseChunk() was never called after htmlCreatePushParserCtxt(), thereby creating a false positive test failure. Fixed by using a do-while loop so we always call htmlParseChunk() at least once. * test/HTML/758605.html: Added.	2016-05-23 15:01:07 +08:00
Denis Pauk	a0cd075d94	HTML parser error with <noscript> in the <head> For https://bugzilla.gnome.org/show_bug.cgi?id=615785 When the <noscript> is found, <head> is closed and a <body> element is created. The real <body id="xxx"> gets skipped over, so I can't see any of the body's attributes. Just don't close <head> when encountering a <noscript> Add a regression test too	2012-05-11 19:31:12 +08:00
Denis Pauk	868d92da89	Add HTML parser support for HTML5 meta charset encoding declaration For https://bugzilla.gnome.org/show_bug.cgi?id=655218 http://www.w3.org/TR/2011/WD-html5-20110525/semantics.html#the-meta-element """ The charset attribute specifies the character encoding used by the document. This is a character encoding declaration. If the attribute is present in an XML document, its value must be an ASCII case-insensitive match for the string "UTF-8" (and the document is therefore forced to use UTF-8 as its encoding). """ However, while <meta http-equiv="Content-Type" content="text/html; charset=utf8"> works, <meta charset="utf8"> does not. While libxml2 HTML parser is not tuned for HTML5, this is a simple addition Also added a testcase	2012-05-10 15:34:57 +08:00
Daniel Veillard	a57ba4ce96	fix an HTML parsing error on large data sections reported by Mike Day add * HTMLparser.c: fix an HTML parsing error on large data sections reported by Mike Day * test/HTML/utf8bug.html result/HTML/utf8bug.html.err result/HTML/utf8bug.html.sax result/HTML/utf8bug.html: add the reproducer to the test suite daniel svn path=/trunk/; revision=3797	2008-09-25 16:06:18 +00:00
Daniel Veillard	48519092e5	fixing HTML entities in attributes parsing bug #362552 added to the * HTMLparser.c: fixing HTML entities in attributes parsing bug #362552 * result/HTML/entities2.html* test/HTML/entities2.html: added to the regression suite Daniel	2006-10-17 15:56:35 +00:00
Daniel Veillard	b990008f05	script HTML parser error fix, corrects bug #319715 added test from Michael * HTMLparser.c: script HTML parser error fix, corrects bug #319715 * result/HTML/53867* test/HTML/53867.html: added test from Michael Day to the regression suite Daniel	2005-10-25 12:36:29 +00:00
Daniel Veillard	358fef4b1e	applied UTF-8 script parsing bug #310229 fix from Jiri Netolicky added the * HTMLparser.c: applied UTF-8 script parsing bug #310229 fix from Jiri Netolicky * result/HTML/script2.html* test/HTML/script2.html: added the test case from the regression suite Daniel	2005-07-13 16:37:38 +00:00
Daniel Veillard	597f1c1f34	applied patch from James Bursa fixing an html parsing bug in push mode * HTMLparser.c: applied patch from James Bursa fixing an html parsing bug in push mode * result/HTML/repeat.html* test/HTML/repeat.html: added the test to the regression suite Daniel	2005-07-03 23:00:18 +00:00
Daniel Veillard	fc484dd0a0	added support for HTML PIs #156087 added specific tests Daniel * HTMLparser.c: added support for HTML PIs #156087 * test/HTML/python.html result/HTML/python.html*: added specific tests Daniel	2004-10-22 14:34:23 +00:00
Daniel Veillard	ce02dbc430	Mikhail Sogrine pointed out a bug in HTML parsing, applied his patch added * HTMLparser.c: Mikhail Sogrine pointed out a bug in HTML parsing, applied his patch * result/HTML/attrents.html result/HTML/attrents.html.err result/HTML/attrents.html.sax test/HTML/attrents.html: added the test and result case provided by Mikhail Sogrine Daniel	2002-10-22 19:14:58 +00:00
Daniel Veillard	957fdcf2a3	handle the case of < in quoted attributes, Bastian Kleineidam Daniel * HTMLparser.c test/HTML/lt.html result/HTML/lt.html*: handle the case of < in quoted attributes, Bastian Kleineidam Daniel	2001-11-06 22:50:19 +00:00
Daniel Veillard	f0c5376a03	- HTMLtree.c: when in a pre element no formatting space should be added. - test/HTML/pre.html result/HTML/pre.html*: added a regression test Daniel	2001-06-07 16:07:07 +00:00
Daniel Veillard	f69bb4b5bf	- HTMLparser.c: Closed bug #54891 - result/HTML/cf_128.html* test/HTML/cf_128.html: added the test to the suite forgot to commit this one yesterday - encoding.h hash.c nanoftp.h parser.h tree.h uri.h xlink.h xpointer.c: applied a documentation patch from LotR and filled in a few missing descriptions Daniel	2001-05-19 13:24:56 +00:00
Daniel Veillard	7eda8452f8	- HTMLparser.c HTMLtree.[ch] SAX.c testHTML.c tree.c: fixed HTML support for SCRIPT and STYLE with help from Bjorn Reese - test/HTML/* result/HTML/*: added simple testcase and updated the existing ones. Daniel	2000-10-14 23:38:43 +00:00
Daniel Veillard	87b9539573	Large sync between my W3C base and Gnome's one: - parser.[ch]: added xmlGetFeaturesList() xmlGetFeature() and xmlAddFeature() - tree.[ch]: added xmlAddChildList() - xmllint.c: MAP_FAILED macro test - parser.h: added xmlParseCtxtExternalEntity() - valid.c: applied bug fixes removed warning - tree.c: added CDATA block to elements content - testSAX.c: cleanup of output - testHTML.c: added SAX testing - encoding.c: better error recovery - SAX.c, parser.c: fixed one of the external entity processing of the OASis testsuite - Makefile.am: added HTML SAX regression tests - configure.in: bumped to 2.2.2 - test/HTML/ result/HTML: added a few of HTML tests, and added the SAX results Daniel	2000-08-12 21:12:04 +00:00
Daniel Veillard	71f93fca5a	Added a bunch of testsuite realted files missing, Daniel.	2000-07-14 14:54:24 +00:00
Daniel Veillard	5cb5ab8d94	- release 1.8.2 - HTML handling improvement - new tree handling functions - release 1.8.2 - HTML handling improvement - new tree handling functions - default namespace on attribute bug fixed - libxml use for C++ fixed (for good this time !) Daniel	1999-12-21 15:35:29 +00:00
Daniel Veillard	3500838f65	BUG FIXED #2784 HTML parsing/output improvements Rebuilt, updated the docs BUG FIXED #2784 HTML parsing/output improvements Rebuilt, updated the docs Improvement of regression scripts, make testall should look clean Released as 1.7.4	1999-10-25 13:15:52 +00:00
Daniel Veillard	7c1206fc06	Revamped HTML parsing, lots of bug fixes for HTML stuff, Added xmlValidGetValidElements and xmlValidGetPotentialChildren, Completed and cleaned up the tests, Added doc for new modules gnome-xml-xmlmemory.html and gnome-xml-nanohttp.html, Daniel	1999-10-14 09:10:25 +00:00
Daniel Veillard	b05deb7f5f	Huge commit: 1.5.0, XML validation, Xpath, bugfixes, examples .... Daniel	1999-08-10 19:04:08 +00:00
Daniel Veillard	82150d8a99	HTML parsing, output is now correct, added HTMLtests target and testcases, Daniel	1999-07-07 07:32:15 +00:00

33 Commits