Entries Tagged as 'Syntax'
I put together a new release of the Validator.nu HTML Parser. This is a highly recommended update for everyone who is using a previous version the parser in an application.
- Fixed an issue where under rare circumstances attribute values were leaking into element content.
- Fixed a bug where
isindex processing added attributes to all elements that were supposed to have no attributes.
- Implemented spec changes. (Too numerous to enumerate, but, as a highlight, framesets parse much better now.)
- Moved to WebKit-style foster parenting.
- Changed the API for tree builder subclasses again due to new constraints. If you have previously written your own tree builder subclass, you need to change it.
- Fixed the bundled XML serializer.
- Made it possible to generate a C++ version that does not leak memory from the Java source.
- Removed the C++ translator from the release. (Get it from SVN.)
[Read more →]
Tags: Processing Model · Syntax
Internet Explorer poses a small challenge when it comes to making use of the new elements introduced in HTML5. Among others, these include elements like section, article, header and footer. The problem is that due to the way parsing works in IE, these elements are not recognised properly and result in an anomalous DOM representation.
To illustrate, consider this simple document fragment:
<body>
<section>
<p>This is an example</p>
</section>
</body>
Strangely, IE 6, 7 and 8 all fail to parse the section element properly and the resulting DOM looks like this.
BODY
SECTION
P
#text: This is an example
/SECTION
Notice how IE actually creates 2 empty elements. One named SECTION and the other named /SECTION. Yes, it really is parsing the end tag as a start tag for an unknown empty element.
There is a handy workaround available to address this problem, which was first revealed in a comment by Sjoerd Visscher.
The basic concept is that by using document.createElement(tagName) to create each of the unknown elements, the parser in IE then recognises those elements and parses them in a more reasonable and useful way. e.g. By using the following script:
document.createElement("section");
The resulting DOM for the fragment given above looks like this:
BODY
section
P
#text: This is an example
This same technique works for all unknown elements in IE 6, 7 and 8. Note that there is a known bug that prevented this from working in IE 8 beta 2, but this has since been resolved in the latest non-public technical preview.
For convenience, Remy Sharp has written and published a simple script that provides this enhancement for all new elements in the current draft of HTML5, which you can download and use.
This script is not needed for other browsers. Opera 9, Firefox 3 and Safari 3 all parse unknown elements in a more reasonable way by default. Note, however, that Firefox 2 does suffer from some related problems, for which there is unfortunately no known solution; but it is hoped that given the faster upgrade cycle for users of Firefox, relatively speaking compared with IE, Firefox 2 won’t pose too much of a problem in the future.
[Read more →]
Tags: Browsers · DOM · Elements · Events · Syntax
September 26th, 2008 · No Comments
I gave a talk at Google on Monday demonstrating the various features of HTML5 that are implemented in browsers today. The video is now on YouTube, so now you too can watch and laugh at my lame presentation skills!
The segments of this talk are as follows. Some of the demos are available online for you to play with and are linked to from the following list:
- Introduction
-
<video> (00:35)
-
postMessage() (05:40)
-
localStorage (15:20)
-
sessionStorage (21:00)
- Drag and Drop API (29:05)
-
onhashchange (37:30)
- Form Controls (40:50)
-
<canvas> (56:55)
- Validation (1:07:20)
- Questions and Answers (1:09:35)
If you’re very interested in watching my typos, the high quality version of the video on the YouTube site is clear enough to see the text being typed. More details about the demos can be found on the corresponding demo page.
[Read more →]
Tags: Browser API · Browsers · Conformance Checking · DOM · Elements · Events · Forms · Multimedia · Syntax · WHATWG
August 25th, 2008 · 1 Comment
I have released a new version of the Validator.nu HTML Parser (an implementation of the HTML5 parsing algorithm in Java). The new release supports SVG and MathML subtrees, is faster than the old version, fixes bugs, is more portable and supports applications that want to do document.write().
The parser comes with a sample app that makes it possible to use XSLT programs written for XHTML5+SVG+MathML with text/html.
Warning! The internal APIs have changed. Please refer to the Upgrade Guide below.
Change Log
- Made the SAX, DOM and XOM parser entry point constructors default to altering the infoset instead of throwing when the input needs coercing to be an XML 1.0 4th ed. plus Namespaces infoset.
- Isolated Java IO dependent code from the parser core. The parser core now compiles on Google Web Toolkit.
- Refactored the tokenizer to use a
switch branch per state instead of method per state.
- Made various performance tweaks to the tokenizer.
- Implemented support for MathML and SVG foreign content. (Note that the SVG part is based on spec text that has been commented out from the spec at the request of the SVG WG.)
- Made the parser suspendable after any input character.
- Made it possible for custom
TreeBuilder subclasses to request parser suspension. (Applications wishing to implement document.write() should provide their own TreeBuilder subclass and a document.write()-aware replacement of the Driver class. Look in the gwt-src/ directory for sample code.)
- Made changes to the parser core to make it more suitable for mechanical translation into other object-oriented programming languages that have C-like control structures but not necessarily a garbage collector (with focus on targeting C++). This work is not complete.
- Made the HTML serializer do the right thing when input represents a conforming XHTML+SVG+MathML tree. (Results may be bad for non-conforming input trees.)
- Developed sample programs for converting between HTML5 and XHTML5 when the input is known to be conforming.
- Provided an XML serializer so that the sample code no longer depends on the Xalan serializer.
- Improved API documentation.
- Fixed bugs in the tokenizer, tree builder and the input stream character encoding decoder.
- Made coercion to an XML infoset work according to the HTML5 spec.
- Added ID uniqueness checking.
- Various other fixes.
Upgrade Guide from 1.0.7 to 1.1.0
In all cases, you need to check that your application does not break when it receives SVG or MathML subtrees.
- If you use the parser through the SAX, DOM or XOM API and do not pass an explicit
XmlViolationPolicy to the constructor of HtmlParser, HtmlDocumentBuilder or HtmlBuilder:
-
If you really wanted the old default behavior, you should now pass XmlViolationPolicy.FATAL to the constructor.
If you did not really want to have fatal errors by default, you do not need to do anything, since ALTER_INFOSET is now the default.
- If you use the parser through the SAX, DOM or XOM API and do pass an explicit
XmlViolationPolicy to the constructor of HtmlParser, HtmlDocumentBuilder or HtmlBuilder:
-
You do not need to change your code to upgrade.
- If you have your own subclass of
TreeBuilder:
-
The abstract methods on TreeBuilder now have additional arguments for passing the namespace URI. You should upgrade your subclass to deal with the namespace URIs. (The URI is always an interned string, so you can use == to compare.)
The entry point for passing in a SAX InputSource has moved from the Tokenizer class to the Driver class (in the io package), so you should change your references from Tokenizer to Driver.
- If you have your own implementation of
TokenHandler:
-
Please refer to the JavaDocs of TokenHandler. Also note the new separation of Tokenizer and Driver mentioned above.
[Read more →]
Tags: Syntax
Earlier, I blogged about running the Validator.nu HTML Parser inside Hixie’s Live DOM Viewer using the magic of the hosted mode of the Google Web Toolkit. Back then, a compiler bug in GTW 1.5 RC1 prevented the parser from running as JavaScript in the Web mode. Google has now released GWT 1.5 RC2, which contains a fix for the bug.
So without further ado, here’s Live DOM Viewer with an HTML5 parser running as JavaScript in your browser.
Try pasting in the SVG lion or some MathML in Firefox 3 and Opera 9.5.
Known problems:
- SVG
use does not work in Firefox. Update: Fixed in Minefield nightlies.
- SVG does not render is Safari.
- IE does not support
createElementNS and, thus, does not work at all.
A big thanks for the GWT team for making this work!
[Read more →]
Tags: DOM · Syntax
If you’ve investigated how browsers parse HTML, you’ve probably used Hixie’s Live DOM Viewer to see what happens. Wouldn’t it be cool, though, if you could experiment with the HTML5 parsing algorithm in the same UI? Well, now you can.
I was looking for a way to experiment with document.write() in the code base of the Validator.nu HTML Parser and I was looking for a way to let people see the parse tree output of the HTML5 parsing algorithm more easily. Instead of writing a test harness fully in Java, I thought it would be better to use the Live DOM Viewer and a browser engine as the test harness. The good news is that Google Web Toolkit makes it possible to put these pieces together, and the trunk of the Validator.nu HTML parser now comes with a document.write()-aware tokenizer driver and a tree builder subclass for GWT.
The bad news is that the Java-to-JavaScript compiler of GWT has a bug that blocks me from putting the result online as JavaScript. The Hosted Mode of GWT, works, though.
Here’s how you can run the Validator.nu HTML Parser in the Live DOM Viewer locally in the Hosted Mode of GWT (on Mac or Linux):
- Check out the source: svn co http://svn.versiondude.net/whattf/htmlparser/trunk/ htmlparser
- Download and untar GWT 1.5 RC1
- On Linux, install libstdc++5 and a JDK (Ubuntu’s OpenJDK-based package worked for me).
- Edit the paths in
HtmlParser-shell (Mac) or HtmlParser-linux (Linux) to point to the location of GWT.
- Run
HtmlParser-shell (Mac) or HtmlParser-linux (Linux)
Known problems:
- The Linux version of GWT runs an outdated version of Gecko, and the rendered view doesn’t work. The DOM view does.
- The Mac version of GWT runs a Web Inspector-enabled version of WebKit, but SVG does not draw.
document.write() semantics are right only for inline scripts.
- Copying and pasting using keyboard shortcuts doesn’t work. (Use the context menu.)
- On Linux, GTW prints a lot of harmless warnings about not finding annotations. (I don’t know why that happens. The annotations should be among translatables.)
- Gecko (used by GTW on Linux) doesn’t allow the creation of xmlns attributes in no namespace, so things stop working if you try to put an attribute called
xmlns on HTML elements.
- The DOM view on Linux doesn’t report names with colons in them per the HTML5 spec.
(Aside: This code could have applicability beyond testing the parser. If the compiler bug were fixed or worked around, a script could document.write() a math element and an svg element to sniff if they are parsed according to HTML5 and if they aren’t, move aside load event handlers, document.write() <plaintext style='display:none'>, wait until DOMContentLoaded, load the the already created html, head and body elements onto the tree builder stack and head pointer of the HTML5 parser to and reparse the content of the plaintext element as HTML5 and call the load event handlers. See Philip Taylor’s proof of concept with S-expressions.)
[Read more →]
Tags: Syntax
One of the newly introduced features in HTML 5 is the ability to mark up reverse ordered lists. These are the same as ordered lists, but instead of counting up from 1, they instead count down towards 1. This can be used, for example, to count down the top 10 movies, music, or LOLCats, or anything else you want to present as a countdown list.
In previous versions of HTML, the only way to achieve this was to place a value attribute on each li element, with successively decreasing values.
<h3>Top 5 TV Series</h3>
<ol>
<li value="5">Friends
<li value="4">24
<li value="3">The Simpsons
<li value="2">Stargate Atlantis
<li value="1">Stargate SG-1
</ol>
The problem with that approach is that manually specifying each value can be time consuming to write and maintain, and the value attribute was not allowed in the HTML 4.01 or XHTML 1.0 Strict DOCTYPEs (although HTML 5 fixes that problem and allows the value attribute)
The new markup is very simple: just add a reversed attribute to the ol element, and optionally provide a start value. If there’s no start value provided, the browser will count the number of list items, and count down from that number to 1.
<h3>Greatest Movies Sagas of All Time</h3>
<ol reversed>
<li>Police Academy (Series)
<li>Harry Potter (Series)
<li>Back to the Future (Trilogy)
<li>Star Wars (Saga)
<li>The Lord of the Rings (Trilogy)
</ol>
Since there are 5 list items in that list, the list will count down from 5 to 1.
The reversed attribute is a boolean attribute. In HTML, the value may be omitted, but in XHTML, it needs to be written as: reversed="reversed".
The start attribute can be used to specify the starting number for the countdown, or the value attribute can be used on an li element. Subsequent list items will, by default, be numbered with the value of 1 less than the previous item.
The following example starts counting down from 100, but omits a few items from the middle of the list and resumes from 3.
<h3>Top 100 Logical Fallacies Used By Creationists</h3>
<ol reversed="reversed" start="100">
<li>False Dichotomy</li>
<li>Appeal to Ridicule</li>
<li>Begging the Question (Circular Logic)</li>
<!-- Items omitted here -->
<li value="3">Strawman</li>
<li>Bare Assertion Fallacy</li>
<li>Argumentum ad Ignorantiam</li>
</ol>
[Read more →]
Tags: Elements · Syntax
There have been lots and lots of e-mail on the public-html mailing list about making the alt attribute syntactically required in HTML5. At the core of this debate is on one hand using HTML5 validators to send a strong message about accessibility and on the other hand of avoiding a situation where a simplified and idealistic strong message leads to behavior that is counterproductive considering the goal of making the Web accessible. As a policy debate, it is similar to abstinence-only sex education debates.
A validator is a computer program and cannot tell if a textual alternative is appropriate for a given image in a given context. That’s why accessibility checking needs to be done by a person. A person may use a software tool to make the checking easier, but trusting on fully automated software to determine whether a page is accessible is misguided.
Given this basic problem, a policy that insists on the alt attribute always being present doesn’t necessarily lead to accessibility. In fact, considering that syntactic correctness and accessibility are different evaluation axes both in terms of computability and in terms of how HTML authors (other than accessibility advocates) tend to view things (judging from observations about the behavior of HTML authors who use validators), a policy that insists on the alt attribute being always present will likely cause people to put the attribute in there but with inappropriate content. In particular, putting an empty alt on images whose presence is important for understanding the context of other content is bad, because in that case the presence of those images is concealed from a non-graphical user. Also, a textual alternative that just says “image” is not an improvement over what, for example, Safari with VoiceOver says in the absence of alt, but would be worse than a smarter client-side heuristic.
Furthermore, there is a very real case where a textual alternative simply isn’t available to the HTML generator: a user uploads photos to a content management system and refuses to supply textual alternatives at the same moment. HTML 4 didn’t account for this case. In fact, requiring alt to under all circumstances assumes that markup is written by a person who knows what the images are at the time of writing markup. It doesn’t make sense to pretend that the case where the markup generator doesn’t have textual alternatives available doesn’t exist. The HTML 5 syntax needs to account for all use cases.
Expecting markup generators to knowingly emit markup that is not valid is not a winning proposition. Quoting me from 2006:
Authoring tools are judged by taking a page authored using the
tool and running it through the W3C Validator or, presumably in the
future, through an HTML5 conformance checker. Authoring tool makers
who
are capable of making their tool produce syntactically conforming
documents will want to do so and minimize the chance that the users of
their software tarnish the reputation of the tool in the eyes of
people
who use an automated test as a litmus test of authoring tool bogosity.
(People who test tools that way will outnumber the people who make a
more profound analysis due to the “validate, validate, validate”
propaganda.)
To summarize: As a matter of principle, subjective checking or checking that is not applicable for all pages does not belong in the validation function. Practice is more important than principle, though. Baking the alt requirement into the validation function would be bad when the user of the validation function wants a clean report on syntax but isn’t as concerned with accessibility. It is bad for accessibility when authors put the simplest value that silences the validator into the attribute in order to make the validation report look clean, since doing so gives user agents like Safari with VoiceOver less information to work with. That’s why I think the requirement to have an alt attribute present doesn’t belong in the validation function also as a practical matter.
It turns out, though, that some people think of validation as a first step toward accessibility, even though syntactic correctness and accessibility really are different evaluation axes. They expect a validator to help them flag images that are lacking a textual alternative. Moreover, the alt issue seems to be taken as the single most important web accessibility issue with the rest of issues somewhere in the long tail. When there is a demand for validators to flag images without alt, validators probably should meet the demand.
To this end, I have developed a new feature for Validator.nu: Image Report. This new feature is not part of the validation function. It also doesn’t do exactly want people are asking of the syntax definition in the long e-mail thread. (It is not a new idea for a validator user interface to offer tools that help a human perform an assessment about the page outside the validation function. For example, the W3C Validator has offered a “Show Document Outline” feature, which is also on file as a request for enhancement for Validator.nu.)
The new feature tries to address the issue of finding missing textual alternatives but it also seeks to address the issue of faulty textual alternatives. Furthermore, it seeks to address these in a way that doesn’t induce people to write bad textual alternatives in order to make the report look cleaner.
When you turn the feature on, it always lists all the images. There is no textual alternative you can fake to make the list look shorter. Instead, there are four categories and you can only change the category in which an image appears.
This has the benefit of removing the badge hunting problem: people trying to silence the validator without actually raising the quality of their page. However, it also has the benefit that the user can review the textual alternatives for appropriateness and the user can review that the right images have been marked as omitted from non-graphical presentation. Since this tool addresses more problems than simply making alt required on the syntax level, I believe this solution is much better than furiously staying entrenched in the status quo of HTML 4 validation, fearing so much a step backwards as to being too afraid to explore steps forward.
Finally, it should be noted that this feature is, by necessity, itself inaccessible to people who cannot view bitmap images. Yet, I think it is legitimate for this feature to be implemented with an HTML user interface. Also, this feature itself is a case where the generator of the user interface markup has no knowledge of the content of the images it is presenting to the user. Hence, it is itself an example of omitting the alt attribute. It would be truly ironic, if the syntax definition of HTML5 prevented Validator.nu from being self-validating.
[Read more →]
Tags: Conformance Checking · Syntax
There is now a new release of the Validator.nu HTML Parser. Change highlights:
- Adds optional support for heuristic encoding sniffing using the ICU4J sniffer, jchardet or both.
- Adds support for rewinding and reparsing when becoming confident about the character encoding and the tentative encoding was wrong.
- Performs encoding name matching per spec instead of using the JDK mechanism.
- Implements spec changes up until just before SVG and MathML support. (Those will merit 1.1 or something.)
- Warning: The semantics of the doctype token have changed in case you have your own token handler (unlikely).
[Read more →]
Tags: Processing Model · Syntax
Due to implementation details, the HTML5 facet of Validator.nu used to ignore the content of obsolete elements such as center, because obsolete elements were simply unknown. This wasn’t particularly useful when assessing the HTML5-upgradeability of an existing design that wrapped everything in center, for example.
The HTML5 facet of Validator.nu now knows about obsolete container elements that existed as deprecated in HTML 4.01. This means that center is still an error, but the contents are now checked as HTML5.
Also, Validator.nu now allows legacy-style internal encoding declarations per the latest Editor’s Draft.
[Read more →]
Tags: Conformance Checking · Syntax