2009-07-26

Rendering document source samples

A nice feature in the documentation would be to show the Novelang source and the rendering result at the same time. There are several ways to achieve this:

— Duplicate the source code in the Novelang document. One is escaped, one is not. The latter gets rendered in the document itself, in a n:paragraphs-inside-angled-bracket-pairs element with a special tag. For now this won’t work in many cases, like levels or lines of literal.

— Reference a screenshot of a previous rendering. This is the most stupid solution because it’s boring to do and hard to keep up-to-date.

— Be clever and generate the image dynamically from the source snippet.

Rendering tools

How to render a PDF fragment into an embeddable image?

IcePDF claims to be open source but the license doesn’t appear on the Web site and downloading the product requires registration. Anyway, the Java WebStart’ed demo doesn’t display anything except a pair of messages telling it’s a trial version. This behavior was observed on Mac OS X 10.5 and Java 6.

PDFRenderer is available under LGPL. The project seems a bit asleep for now; it looks like a dump-everything-to-the-community effect of Sun’s policy last years. PDFRenderer does a nice job with many PDF, but Novelang-generated ones appear severely broken!

PDFBox is licensed under the Apache License, but contains license notices from Adobe (for AFM fonts) and Sun (for JAI). A close look at PDFBox-7.3.jar shows it embeds those AFM fonts.

Since PDFBox-7.3 doesn’t work (spits an exception), let’s check a snapshot out! This is revision 795516 or something. The build goes well, and image generation doesn’t crash. But the text in images appears seriously damaged! And the font doesn’t look correct. The original was created using Linux Libertine; images contain a Helvetica-like which may not have the same metrics. And all text in non-proportional fonts doesn’t appear at all.

Should I give up my dream of finding an OSS solution for rendering images out from FOP-generated PDF documents? Debugging FOP or PDFRenderer looks like a lot of work. And, while it’s easier to get perfect control on PDF rendering, HTML rendering may be enough for creating the samples.

So here comes Flying Saucer to the rescue. It’s pure Java XHTML renderer which supports CSS 2.1. I’ve used it already and I know it works. The “inheritable” nature of CSS means I can tweak the output a bit (reducing margins and page width) while reusing the default CSS stylesheet.

Finally, all this product review turns to be nonsense, because FOP is supposed to generate images directly ! Insanely great!

Integration to Novelang

Here comes hard stuff. Including external resources depends if the document is self-contained (PDF) or multipart (HTML), and if document is generated by generator (batch) or HTTP dæmon (interactive). As a self-contained document, PDF is generated the same way wether it’s a batch or interactive context.

The FO stylesheet may manage image embedding into the PDF, thus avoiding to spread complexity elsewhere. For SVG, the fo:instream-foreign-object allows direct inclusion of the XML. For images, the architecturally-simple approach would be to write a FOP extension taking the code snippet as parameter, then inserting the rendered image into the Area Tree .

Using external files only makes sense when generating an HTML documents, because we’re pretty sure in this case that user agent won’t request the image before it can read its address from the HTML. For PDF documents, the temporary file must exist before running the FO stylesheet, so it would require some kind of ugly pre-processing.

External files are generated “once-for-all” in batch mode. But, in interactive mode, how long should they live? And does it make sense to write files on the filesystem while the resource could be dynamically generated?

Dynamic resources could be kept in some session-scoped cache. This is how it would work:No need to cache the generated image, only the source snippet. This allows deferred generation.The HTTP session contains several cache areas, one per document name.When a fresh document is generated, reset the whole cache area for this document name.During XSLT processing, call an XSL extension that feeds the cache with snippets.Given a snippet, the cache returns some kind of identifier to be inserted as a link in resulting HTML.A special resource handler (at HTTP dæmon level) queries the cache with the identifiers.If the cache has such a snippet, then it returns it for rendering.

The XSL extension called by the stylesheet could trigger two different behavior, wether it’s dæmon or interactive mode:Use the caching stuff as above.Just write the image file on the filesystem.

How to invalidate the cache? Session expiration is not enough: if several documents exist in the same session, some may become unused therefore causing excessive memory consumption. To avoid this, turn reference to “old” documents to soft references so the JVM would clean them upon memory demand.

Bad behavior would occuring when trying to load an image inside an HTML page after the cache got cleaned by some way (session expiration or memory reclaim) and prior to refreshing the whole page. This sounds like a tolerable annoyance.

Conclusion(s)

I wanted to avoid coding whatever looks like a cache for as long as possible. Now there is a case where caching is linked to a feature out of the performance scope. Anyways, the cache described above is a “toy” cache. Real caching would take the whole resource graph (source documents, images, stylesheets and so one) in account.

Dynamically-generated images could also make sense for rendering ASCII Math for Web browsers which don’t support SVG.

As often, a bit of additional comfort requires a lot of work.

2 comments:

Laurent Caillette said...

Also: JPodRenderer from the JPod suite. These are PDF manipulation and rendering suite. The core product may be solid, but documentation is quite thin.

Laurent Caillette said...

Jeremias Märki (FOP committer) has written a tool for inserting PDF as images.