2010-09-20

Technical study: multi-page HTML rendering

What we need

How hard would that be to render a single Novelang document over multiple HTML pages? Better ask: how cool would that be? Think about Novelang documentation taking one single huge page. This makes non-linear reading quite uncomfortable. Of course, multi-page rendering should work for both batch and interactive mode.

Technical implications

For batch rendering, there can be a simple approach. Xalan (XSLT rendering engine) offers the redirect extension for redirecting output into a given file.

<xsl:template match="/doc/foo">
  <redirect:write select="@file">
    <foo-out>
      <xsl:apply-templates/>
    </foo-out>
  </redirect:write>
</xsl:template>

Unfortunately, this is not suitable for interactive rendering. For interactive rendering, the endering process must known both:

  • The requested page, through a URL aware of the page (as sub-part of the whole document).
  • The whole document, because we may need to render links to other chapters or whatever.

The same need arises for batch rendering but with Xalan’s Redirect extension mentioned above, the whole logic gets buried inside the XSLT (which probably makes it quite complex).

Obviously, we need a Renderer to work the same way for batch and interactive rendering, e. g. there should be no special handling of interactive or batch rendering in the XSL stylesheet. (But multi-page rendering would require a special stylesheet anyways, at least for generating navigation.)

While XSLT-based rendering is the most common case in Novelang, it’s better to think about the general contract of a org.novelang.rendering.Renderer. As it already does, the Renderer should spit bytes into a java.io.OutputStream with no knowledge wether it is a file or a socket. The job of creating the output (which means chosing a file name in the case of batch rendering) is left to some upstream object opening the OutputStream. Currently, this is done by org.novelang.batch.DocumentGenerator or org.novelang.daemon.DocumentHandler which both end by calling DocumentProducer, passing it the OutputStream.

So rendering stage needs additional logic. Interactive rendering implies to extract the requested page from the URL. Batch rendering implies to find the list of pages to create corresponding files on the filesystem.

New Renderer contract

There can’t be unique way to split a document into pages, so we have new responsabilties for our Renderer:Given a document tree (as a org.novelang.common.SyntacticTree) it calculates a list of page identifiers.Given the same document tree, plus a page identifier, it renders the corresponding page to an OutputStream.With something like a single empty page identifier, we should get the same single-page rendering as we have now.

For an XSLT-based Renderer, we should embed page identifiers generation in the same XSL stylesheet (as a part of already-discussed stylesheet metadata ):

<xsl:stylesheet [namespaces blah blah] >

  <nlm:multipage>
    <!-- 
      Some XSL tranformations here,
      starting from  element. 
    -->
  </nlm:multipage>

  ...

Before page rendering occurs, Novelang asks the Renderer for page identifiers. The default XSL-based Renderer applies the content of the element (if there is one) as a stylesheet on the whole document tree. Then it obtains a list of page identifiers as follows:

<pages>
  <page name="Home" >/n:opus
  <page name="ChapterOne" >/n:opus/n:level[1]
  <page name="ChapterTwo" >/n:opus/n:level[2]
<pages>

Of course each page name is unique. In order to achieve this with no tweak, the document tree may embed unique identifiers by extending the semantic of n:implicit-identifier or by adding a new n:unique-identifier element. Node paths seem easy to generate .

Now for each page, Novelang creates the corresponding file out of the page name. If the stylesheet in the did chose filesystem-friendly names, those will be used verbatim (otherwise we may apply some variant of URL encoding). And, for each page, Novelang calls the Renderer with the whole document tree again, and passes additional metadata elements to tell the Renderer which page it is rendering. Input XML looks like this:

<n:opus>
  <n:meta>
    <n:page-name>ChapterOne
    <n:page-path>/n:opus/n:level[1]
  </n:meta>
</n:opus>

This should be enough for the Renderer to figure how to render only the page of interest. It might need to peek elsewhere in the document tree (like for a footer with a copyright notice, or find other chapter names for a navigation bar).

Mix with other features (present or future)

There is an additional role for node identifiers: they might help to “enhance” internal links by adding the prefix corresponding to the target page. (The internal link feature is yet in inception phase. It just seems easier to implement it right after multipage rendering.)

Unique page names

Novelang’s Fragment Identifier is the perfect candidate to generate page identifers. Unfortunately, composite identifier contain the \ character. Should we escape it, or mix it with some weird pseudo-directory feature? But maybe it’s time to remove relative identifiers which never proved useful, and don’t guarantee identifer uniqueness, anyways.

It’s easy to create a new element by adding a simple counter to a colliding identifier. The value for some given document fragment may change across several generations, when adding fragments with colliding identifiers. This won’t be a problem for internal links (links defined by the document itself) prohibit usage of unique identifier. Remember: unique identifiers are only for pure HTML links.

If there is a chance that a foreign HTML documents links to the HTML anchor defined by the unique identifier (in a pure WWWW – World Wide Web Way) then document author should use explicit identifiers.

New URL scheme

With single-page rendering, the rendered document has the same name as the source document (with the difference of the extension). Multi-page adds a new “dimension”. Because the name of the page may collide with another document’s name, the name of the originating document prefixes the page name. Let’s look at different options:

/main/documentation~syntax.html
/main/documentation!syntax.html
/main/documentation,syntax.html
/main/documentation^syntax.html
/main/documentation--syntax.html

Let’s see which character we could use (only checked on Mac OS X, to do: check on Windows):

Character Escaped? Comments
~ No Already used for Novelang meta pages.
- No Already used for Novelang identifiers.
^ No Meaningless in that context.
# No Fragment in URL.
, No Hard to distinguish from full stop . character.
! No Hard to read.
_ No Too common in file names.
+ No Used in URL encoding. Usage unrelated to “plus” meaning.
% No Used in URL encoding.
= Yes Used in URL encoding. Usage unrelated to “equality” meaning.
$ Yes Overused.
; Yes Hard to read.
| Yes Hard to read.
' Yes Hard to read.
& Yes Already used for URL parameters.
? Yes Already used for URL parameters, DOS wildcard.
@ Yes Inverted meaning if page name appears second.
{ Yes Weird because unpaired. Meaningful otherwise.
§ Yes Mac OS X console doesn’t like it.
: - Path separator on Unix.

The “Escaped?” column means, it requires escaping on Mac OS X console.

Finally, it turns out that -- looks the best, especially with a variable-width font like in Mac OS X Finder or Windows Explorer.

Special case: if the page identifier was blank, the page separator doesn’t appear so we would still have:

/main/documentation.html

This naming scheme also implies that all pages appear flatly in the same directory. This should help when resolving resource names.

No comments: