33 Commits

Author SHA1 Message Date
Nick Wellnhofer
71122421a1 html: Make implied <p> tags more deterministic
libxml2's HTML parser adds <p> start tags in some situations. This
behavior, which doesn't follow any standard, was added in 2000, see
here: http://veillard.com/XML/messages/0655.html

Text nodes that only contain whitespace don't imply a <p> tag, but the
whitespace check cannot work reliably if we're parsing partial text data
which can happen with both pull and push parser.

The logic in `areBlanks` is hard to follow. The checks involving `CUR`
depend on the position of the input pointer and seem dubious. It's also
possible that the behavior changed inadvertently with a later commit.
As a result, it's hard to come up with good test cases.

We now process leading whitespace before creating implied tags. This is
more in line with HTML5 and should avoid at least some issues with
partial text data.

For example, parsing the string "<head>   x" used to result in:

<html>
<head></head>
<body><p>   x</p></body>
</html>

And now results in:

<html>
<head>   </head>
<body><p>x</p></body>
</html>

Except for the implied <p> tag, this matches HTML5.
2025-02-13 14:31:44 +01:00
Nick Wellnhofer
e1834745e0 html: Add character data tests 2024-10-06 18:13:05 +02:00
Nick Wellnhofer
5951179239 html: Parse named character references according to HTML5 2024-10-06 18:13:05 +02:00
Nick Wellnhofer
e395946194 html: Reenable buggy detection of XML declarations
Switch to UTF-8 if a document starts with '<?xm' to match old behavior.
Also enable this check in the push parser.

Fixes #637.
2023-11-30 16:22:59 +01:00
Nick Wellnhofer
d7d0bc6581 SAX2: Ignore namespaces in HTML documents
In commit 21ca8829, we started to ignore namespaces in HTML element
names but we still called xmlSplitQName, effectively stripping the
namespace prefix. This would cause elements like <o:p> being parsed
as <p>. Now we leave the name untouched.

Fixes #508.
2023-03-31 17:08:43 +02:00
Nick Wellnhofer
e986d09cf5 Skip incorrectly opened HTML comments
Commit 4fd69f3e fixed handling of '<' characters not followed by an
ASCII letter. But a '<!' sequence followed by invalid characters should
be treated as bogus comment and skipped.

Fixes #380.
2022-08-02 14:38:09 +02:00
Mike Dalessio
24cdc89006 test coverage for abruptly-closed comments
These establish baseline behavior so that the subsequent commit is
clear about the behavior it will modify.
2022-03-02 14:42:47 +00:00
Nick Wellnhofer
2732b23466 Fix regression parsing public IDs literals in HTML
Fix regression introduced when reworking htmlParsePubidLiteral in
commit 93ce33c2.

Fixes #318.
2022-01-10 13:37:59 +01:00
Mike Dalessio
e28d9347bc add test coverage for incorrectly-closed comments
this establishes the baseline behavior so that subsequent commits
which modify this behavior are clear about what's being changed.
2020-12-16 16:12:07 +01:00
Nick Wellnhofer
477c7f6aff Fix quadratic runtime in HTML parser
Commit eeb99329 removed an important optimization avoiding quadratic
runtime when repeatedly scanning the input buffer for terminating
characters in the HTML push parser. The related bug is

    https://bugzilla.gnome.org/show_bug.cgi?id=444994

Make sure that ctxt->checkIndex is always written and store additional
parser state in ctxt->inSubset which is unused in the HTML parser.

Found by OSS-Fuzz.
2020-07-06 12:17:20 +02:00
David Kilzer
85c112a082 Add test cases for bug 758518
test/HTML/758518-entity.html exposed a bug in pushParseTest() in
runtest.c which assumed that an input file was at least 4 bytes long.
That test case is only 3 bytes, so we now take the minimum of 4 bytes
or the length of the test input.  We also now use 'chunkSize' in place
of the hard-coded value '1024' later in the function.
2017-06-12 18:26:11 +02:00
Pranjal Jumde
0bcd05c5cd Heap-based buffer overread in htmlCurrentChar
For https://bugzilla.gnome.org/show_bug.cgi?id=758606

* parserInternals.c:
(xmlNextChar): Add an test to catch other issues on ctxt->input
corruption proactively.
For non-UTF-8 charsets, xmlNextChar() failed to check for the end
of the input buffer and would continuing reading.  Fix this by
pulling out the check for the end of the input buffer into common
code, and return if we reach the end of the input buffer
prematurely.
* result/HTML/758606.html: Added.
* result/HTML/758606.html.err: Added.
* result/HTML/758606.html.sax: Added.
* result/HTML/758606_2.html: Added.
* result/HTML/758606_2.html.err: Added.
* result/HTML/758606_2.html.sax: Added.
* test/HTML/758606.html: Added test case.
* test/HTML/758606_2.html: Added test case.
2016-05-23 15:01:07 +08:00
Pranjal Jumde
a820dbeac2 Bug 758605: Heap-based buffer overread in xmlDictAddString <https://bugzilla.gnome.org/show_bug.cgi?id=758605>
Reviewed by David Kilzer.

* HTMLparser.c:
(htmlParseName): Add bounds check.
(htmlParseNameComplex): Ditto.
* result/HTML/758605.html: Added.
* result/HTML/758605.html.err: Added.
* result/HTML/758605.html.sax: Added.
* runtest.c:
(pushParseTest): The input for the new test case was so small
(4 bytes) that htmlParseChunk() was never called after
htmlCreatePushParserCtxt(), thereby creating a false positive
test failure.  Fixed by using a do-while loop so we always call
htmlParseChunk() at least once.
* test/HTML/758605.html: Added.
2016-05-23 15:01:07 +08:00
Denis Pauk
a0cd075d94 HTML parser error with <noscript> in the <head>
For https://bugzilla.gnome.org/show_bug.cgi?id=615785
When the <noscript> is found, <head> is closed and a <body> element is created.
The real <body id="xxx"> gets skipped over, so I can't see any of the
body's attributes.
Just don't close <head> when encountering a <noscript>
Add a regression test too
2012-05-11 19:31:12 +08:00
Denis Pauk
868d92da89 Add HTML parser support for HTML5 meta charset encoding declaration
For https://bugzilla.gnome.org/show_bug.cgi?id=655218

http://www.w3.org/TR/2011/WD-html5-20110525/semantics.html#the-meta-element

"""
The charset attribute specifies the character encoding used by the document.
This is a character encoding declaration. If the attribute is present in an XML
document, its value must be an ASCII case-insensitive match for the string
"UTF-8" (and the document is therefore forced to use UTF-8 as its
encoding).
"""

However, while <meta http-equiv="Content-Type" content="text/html;
charset=utf8"> works, <meta charset="utf8"> does not.

While libxml2 HTML parser is not tuned for HTML5, this is a simple
addition

Also added a testcase
2012-05-10 15:34:57 +08:00
Daniel Veillard
a57ba4ce96 fix an HTML parsing error on large data sections reported by Mike Day add
* HTMLparser.c: fix an HTML parsing error on large data sections
  reported by Mike Day
* test/HTML/utf8bug.html result/HTML/utf8bug.html.err
  result/HTML/utf8bug.html.sax result/HTML/utf8bug.html: add the
  reproducer to the test suite
daniel

svn path=/trunk/; revision=3797
2008-09-25 16:06:18 +00:00
Daniel Veillard
48519092e5 fixing HTML entities in attributes parsing bug #362552 added to the
* HTMLparser.c: fixing HTML entities in attributes parsing bug #362552
* result/HTML/entities2.html* test/HTML/entities2.html: added to
  the regression suite
Daniel
2006-10-17 15:56:35 +00:00
Daniel Veillard
b990008f05 script HTML parser error fix, corrects bug #319715 added test from Michael
* HTMLparser.c: script HTML parser error fix, corrects bug #319715
* result/HTML/53867* test/HTML/53867.html: added test from Michael Day
  to the regression suite
Daniel
2005-10-25 12:36:29 +00:00
Daniel Veillard
358fef4b1e applied UTF-8 script parsing bug #310229 fix from Jiri Netolicky added the
* HTMLparser.c: applied UTF-8 script parsing bug #310229 fix from
  Jiri Netolicky
* result/HTML/script2.html* test/HTML/script2.html: added the test
  case from the regression suite
Daniel
2005-07-13 16:37:38 +00:00
Daniel Veillard
597f1c1f34 applied patch from James Bursa fixing an html parsing bug in push mode
* HTMLparser.c: applied patch from James Bursa fixing an html parsing
  bug in push mode
* result/HTML/repeat.html* test/HTML/repeat.html: added the test to the
  regression suite
Daniel
2005-07-03 23:00:18 +00:00
Daniel Veillard
fc484dd0a0 added support for HTML PIs #156087 added specific tests Daniel
* HTMLparser.c: added support for HTML PIs #156087
* test/HTML/python.html result/HTML/python.html*: added specific tests
Daniel
2004-10-22 14:34:23 +00:00
Daniel Veillard
ce02dbc430 Mikhail Sogrine pointed out a bug in HTML parsing, applied his patch added
* HTMLparser.c: Mikhail Sogrine pointed out a bug in HTML
  parsing, applied his patch
* result/HTML/attrents.html result/HTML/attrents.html.err
  result/HTML/attrents.html.sax test/HTML/attrents.html:
  added the test and result case provided by Mikhail Sogrine
Daniel
2002-10-22 19:14:58 +00:00
Daniel Veillard
957fdcf2a3 handle the case of < in quoted attributes, Bastian Kleineidam Daniel
* HTMLparser.c test/HTML/lt.html result/HTML/lt.html*:
  handle the case of < in quoted attributes, Bastian Kleineidam
Daniel
2001-11-06 22:50:19 +00:00
Daniel Veillard
f0c5376a03 - HTMLtree.c: when in a pre element no formatting space should
be added.
- test/HTML/pre.html result/HTML/pre.html*: added a regression test
Daniel
2001-06-07 16:07:07 +00:00
Daniel Veillard
f69bb4b5bf - HTMLparser.c: Closed bug #54891
- result/HTML/cf_128.html* test/HTML/cf_128.html: added the test
  to the suite
forgot to commit this one yesterday
- encoding.h hash.c nanoftp.h parser.h tree.h uri.h xlink.h xpointer.c:
  applied a documentation patch from LotR and filled in a few missing
  descriptions
Daniel
2001-05-19 13:24:56 +00:00
Daniel Veillard
7eda8452f8 - HTMLparser.c HTMLtree.[ch] SAX.c testHTML.c tree.c: fixed HTML
support for SCRIPT and STYLE with help from Bjorn Reese
- test/HTML/* result/HTML/*: added simple testcase and updated
  the existing ones.
Daniel
2000-10-14 23:38:43 +00:00
Daniel Veillard
87b9539573 Large sync between my W3C base and Gnome's one:
- parser.[ch]: added xmlGetFeaturesList() xmlGetFeature() and xmlAddFeature()
- tree.[ch]: added xmlAddChildList()
- xmllint.c: MAP_FAILED macro test
- parser.h: added xmlParseCtxtExternalEntity()
- valid.c: applied bug fixes removed warning
- tree.c: added CDATA block to elements content
- testSAX.c: cleanup of output
- testHTML.c: added SAX testing
- encoding.c: better error recovery
- SAX.c, parser.c: fixed one of the external entity processing of the OASis testsuite
- Makefile.am: added HTML SAX regression tests
- configure.in: bumped to 2.2.2
- test/HTML/ result/HTML: added a few of HTML tests, and added the SAX results

Daniel
2000-08-12 21:12:04 +00:00
Daniel Veillard
71f93fca5a Added a bunch of testsuite realted files missing, Daniel. 2000-07-14 14:54:24 +00:00
Daniel Veillard
5cb5ab8d94 - release 1.8.2 - HTML handling improvement - new tree handling functions
- release 1.8.2
- HTML handling improvement
- new tree handling functions
- default namespace on attribute bug fixed
- libxml use for C++ fixed (for good this time !)
Daniel
1999-12-21 15:35:29 +00:00
Daniel Veillard
3500838f65 BUG FIXED #2784 HTML parsing/output improvements Rebuilt, updated the docs
BUG FIXED #2784
HTML parsing/output improvements
Rebuilt, updated the docs
Improvement of regression scripts, make testall should look clean
Released as 1.7.4
1999-10-25 13:15:52 +00:00
Daniel Veillard
7c1206fc06 Revamped HTML parsing, lots of bug fixes for HTML stuff,
Added xmlValidGetValidElements and xmlValidGetPotentialChildren,
Completed and cleaned up the tests,
Added doc for new modules gnome-xml-xmlmemory.html and gnome-xml-nanohttp.html,
Daniel
1999-10-14 09:10:25 +00:00
Daniel Veillard
b05deb7f5f Huge commit: 1.5.0, XML validation, Xpath, bugfixes, examples .... Daniel 1999-08-10 19:04:08 +00:00
Daniel Veillard
82150d8a99 HTML parsing, output is now correct, added HTMLtests target and testcases, Daniel 1999-07-07 07:32:15 +00:00