There is now a new release of the Validator.nu HTML Parser. The new release contains files that were missing from the previous release package by accident. It also contains one tree builder correctness fix and one error reporting improvement.
Entries Tagged as 'Syntax'
Validator.nu HTML Parser Version 1.3.1 Released
March 9th, 2011 · No Comments
Tags: Syntax
Version 1.3 of the Validator.nu HTML Parser Released
January 13th, 2011 · No Comments
After over a year without proper releases, there is now a new release of the Validator.nu HTML Parser. There have been numerous changes to the HTML5 spec and, consequently, to the parser since the previous release. All users of the parser should update to the latest release in order to run a version that corresponds [...]
Tags: Syntax
XHTML5 in a nutshell
July 24th, 2010 · No Comments
The WHATWG Wiki portal has a nice section describing HTML vs. XHTML differences, as well as specifics of a polyglot HTML document that also would be able to serve HTML5 document as valid XML document. I’d like to review what it takes to transform an HTML5 polyglot document into a valid XHTML5 document: it appears, [...]
Tags: Syntax · WHATWG · What's Next
HTML5 Rationale document
May 10th, 2010 · No Comments
I’ve started a page on the wiki to document the rationale for the decisions made about the HTML specification. There are two goals for this document: Explain why things are the way they are Explain the difference between multiple similar elements by providing example usages. One person can not possibly write the entire thing so [...]
Validator.nu HTML Parser 1.2.0
March 27th, 2009 · No Comments
I put together a new release of the Validator.nu HTML Parser. This is a highly recommended update for everyone who is using a previous version the parser in an application.
- Fixed an issue where under rare circumstances attribute values were leaking into element content.
- Fixed a bug where
isindexprocessing added attributes to all elements that were supposed to have no attributes. - Implemented spec changes. (Too numerous to enumerate, but, as a highlight, framesets parse much better now.)
- Moved to WebKit-style foster parenting.
- Changed the API for tree builder subclasses again due to new constraints. If you have previously written your own tree builder subclass, you need to change it.
- Fixed the bundled XML serializer.
- Made it possible to generate a C++ version that does not leak memory from the Java source.
- Removed the C++ translator from the release. (Get it from SVN.)
Tags: Processing Model · Syntax
Supporting New Elements in IE
January 7th, 2009 · No Comments
Internet Explorer poses a small challenge when it comes to making use of the new elements introduced in HTML5. Among others, these include elements like section, article, header and footer. The problem is that due to the way parsing works in IE, these elements are not recognised properly and result in an anomalous DOM representation.
To illustrate, consider this simple document fragment:
<body>
<section>
<p>This is an example</p>
</section>
</body>
Strangely, IE 6, 7 and 8 all fail to parse the section element properly and the resulting DOM looks like this.
BODYSECTIONP#text: This is an example
/SECTION
Notice how IE actually creates 2 empty elements. One named SECTION and the other named /SECTION. Yes, it really is parsing the end tag as a start tag for an unknown empty element.
There is a handy workaround available to address this problem, which was first revealed in a comment by Sjoerd Visscher.
The basic concept is that by using document.createElement(tagName) to create each of the unknown elements, the parser in IE then recognises those elements and parses them in a more reasonable and useful way. e.g. By using the following script:
document.createElement("section");
The resulting DOM for the fragment given above looks like this:
BODYsectionP#text: This is an example
This same technique works for all unknown elements in IE 6, 7 and 8. Note that there is a known bug that prevented this from working in IE 8 beta 2, but this has since been resolved in the latest non-public technical preview.
For convenience, Remy Sharp has written and published a simple script that provides this enhancement for all new elements in the current draft of HTML5, which you can download and use.
This script is not needed for other browsers. Opera 9, Firefox 3 and Safari 3 all parse unknown elements in a more reasonable way by default. Note, however, that Firefox 2 does suffer from some related problems, for which there is unfortunately no known solution; but it is hoped that given the faster upgrade cycle for users of Firefox, relatively speaking compared with IE, Firefox 2 won’t pose too much of a problem in the future.
Tags: Browsers · DOM · Elements · Events · Syntax
Google Tech Talk: HTML5 demos
September 26th, 2008 · No Comments
I gave a talk at Google on Monday demonstrating the various features of HTML5 that are implemented in browsers today. The video is now on YouTube, so now you too can watch and laugh at my lame presentation skills!
The segments of this talk are as follows. Some of the demos are available online for you to play with and are linked to from the following list:
- Introduction
-
<video>(00:35) -
postMessage()(05:40) -
localStorage(15:20) -
sessionStorage(21:00) - Drag and Drop API (29:05)
-
onhashchange(37:30) - Form Controls (40:50)
-
<canvas>(56:55) - Validation (1:07:20)
- Questions and Answers (1:09:35)
If you’re very interested in watching my typos, the high quality version of the video on the YouTube site is clear enough to see the text being typed. More details about the demos can be found on the corresponding demo page.
Tags: Browser API · Browsers · Conformance Checking · DOM · Elements · Events · Forms · Multimedia · Syntax · WHATWG
Validator.nu HTML Parser 1.1.0
August 25th, 2008 · 1 Comment
I have released a new version of the Validator.nu HTML Parser (an implementation of the HTML5 parsing algorithm in Java). The new release supports SVG and MathML subtrees, is faster than the old version, fixes bugs, is more portable and supports applications that want to do document.write().
The parser comes with a sample app that makes it possible to use XSLT programs written for XHTML5+SVG+MathML with text/html.
Warning! The internal APIs have changed. Please refer to the Upgrade Guide below.
Change Log
- Made the SAX, DOM and XOM parser entry point constructors default to altering the infoset instead of throwing when the input needs coercing to be an XML 1.0 4th ed. plus Namespaces infoset.
- Isolated Java IO dependent code from the parser core. The parser core now compiles on Google Web Toolkit.
- Refactored the tokenizer to use a
switchbranch per state instead of method per state. - Made various performance tweaks to the tokenizer.
- Implemented support for MathML and SVG foreign content. (Note that the SVG part is based on spec text that has been commented out from the spec at the request of the SVG WG.)
- Made the parser suspendable after any input character.
- Made it possible for custom
TreeBuildersubclasses to request parser suspension. (Applications wishing to implementdocument.write()should provide their ownTreeBuildersubclass and adocument.write()-aware replacement of theDriverclass. Look in thegwt-src/directory for sample code.) - Made changes to the parser core to make it more suitable for mechanical translation into other object-oriented programming languages that have C-like control structures but not necessarily a garbage collector (with focus on targeting C++). This work is not complete.
- Made the HTML serializer do the right thing when input represents a conforming XHTML+SVG+MathML tree. (Results may be bad for non-conforming input trees.)
- Developed sample programs for converting between HTML5 and XHTML5 when the input is known to be conforming.
- Provided an XML serializer so that the sample code no longer depends on the Xalan serializer.
- Improved API documentation.
- Fixed bugs in the tokenizer, tree builder and the input stream character encoding decoder.
- Made coercion to an XML infoset work according to the HTML5 spec.
- Added ID uniqueness checking.
- Various other fixes.
Upgrade Guide from 1.0.7 to 1.1.0
In all cases, you need to check that your application does not break when it receives SVG or MathML subtrees.
- If you use the parser through the SAX, DOM or XOM API and do not pass an explicit
XmlViolationPolicyto the constructor ofHtmlParser,HtmlDocumentBuilderorHtmlBuilder: -
If you really wanted the old default behavior, you should now pass
XmlViolationPolicy.FATALto the constructor.If you did not really want to have fatal errors by default, you do not need to do anything, since
ALTER_INFOSETis now the default. - If you use the parser through the SAX, DOM or XOM API and do pass an explicit
XmlViolationPolicyto the constructor ofHtmlParser,HtmlDocumentBuilderorHtmlBuilder: -
You do not need to change your code to upgrade.
- If you have your own subclass of
TreeBuilder: -
The abstract methods on
TreeBuildernow have additional arguments for passing the namespace URI. You should upgrade your subclass to deal with the namespace URIs. (The URI is always an interned string, so you can use==to compare.)The entry point for passing in a SAX
InputSourcehas moved from theTokenizerclass to theDriverclass (in theiopackage), so you should change your references fromTokenizertoDriver. - If you have your own implementation of
TokenHandler: -
Please refer to the JavaDocs of
TokenHandler. Also note the new separation ofTokenizerandDrivermentioned above.
Tags: Syntax
HTML5 Live DOM Viewer—Now in Your Browser
August 14th, 2008 · No Comments
Earlier, I blogged about running the Validator.nu HTML Parser inside Hixie’s Live DOM Viewer using the magic of the hosted mode of the Google Web Toolkit. Back then, a compiler bug in GTW 1.5 RC1 prevented the parser from running as JavaScript in the Web mode. Google has now released GWT 1.5 RC2, which contains a fix for the bug.
So without further ado, here’s Live DOM Viewer with an HTML5 parser running as JavaScript in your browser.
Try pasting in the SVG lion or some MathML in Firefox 3 and Opera 9.5.
Known problems:
- SVG
usedoes not work in Firefox. Update: Fixed in Minefield nightlies. - SVG does not render is Safari.
- IE does not support
createElementNSand, thus, does not work at all.
A big thanks for the GWT team for making this work!
Experience the HTML5 parsing algorithm in the Live DOM Viewer
June 30th, 2008 · No Comments
If you’ve investigated how browsers parse HTML, you’ve probably used Hixie’s Live DOM Viewer to see what happens. Wouldn’t it be cool, though, if you could experiment with the HTML5 parsing algorithm in the same UI? Well, now you can.
I was looking for a way to experiment with document.write() in the code base of the Validator.nu HTML Parser and I was looking for a way to let people see the parse tree output of the HTML5 parsing algorithm more easily. Instead of writing a test harness fully in Java, I thought it would be better to use the Live DOM Viewer and a browser engine as the test harness. The good news is that Google Web Toolkit makes it possible to put these pieces together, and the trunk of the Validator.nu HTML parser now comes with a document.write()-aware tokenizer driver and a tree builder subclass for GWT.
The bad news is that the Java-to-JavaScript compiler of GWT has a bug that blocks me from putting the result online as JavaScript. The Hosted Mode of GWT, works, though.
Here’s how you can run the Validator.nu HTML Parser in the Live DOM Viewer locally in the Hosted Mode of GWT (on Mac or Linux):
- Check out the source: svn co http://svn.versiondude.net/whattf/htmlparser/trunk/ htmlparser
- Download and untar GWT 1.5 RC1
- On Linux, install libstdc++5 and a JDK (Ubuntu’s OpenJDK-based package worked for me).
- Edit the paths in
HtmlParser-shell(Mac) orHtmlParser-linux(Linux) to point to the location of GWT. - Run
HtmlParser-shell(Mac) orHtmlParser-linux(Linux)
Known problems:
- The Linux version of GWT runs an outdated version of Gecko, and the rendered view doesn’t work. The DOM view does.
- The Mac version of GWT runs a Web Inspector-enabled version of WebKit, but SVG does not draw.
document.write()semantics are right only for inline scripts.- Copying and pasting using keyboard shortcuts doesn’t work. (Use the context menu.)
- On Linux, GTW prints a lot of harmless warnings about not finding annotations. (I don’t know why that happens. The annotations should be among translatables.)
- Gecko (used by GTW on Linux) doesn’t allow the creation of xmlns attributes in no namespace, so things stop working if you try to put an attribute called
xmlnson HTML elements. - The DOM view on Linux doesn’t report names with colons in them per the HTML5 spec.
(Aside: This code could have applicability beyond testing the parser. If the compiler bug were fixed or worked around, a script could document.write() a math element and an svg element to sniff if they are parsed according to HTML5 and if they aren’t, move aside load event handlers, document.write() <plaintext style='display:none'>, wait until DOMContentLoaded, load the the already created html, head and body elements onto the tree builder stack and head pointer of the HTML5 parser to and reparse the content of the plaintext element as HTML5 and call the load event handlers. See Philip Taylor’s proof of concept with S-expressions.)
Tags: Syntax