The Novelang blog: July 2009

2009-07-26

Novelang-0.31.1 released!

This release brings minor fixes around new block-after-tilde feature. Can be downloaded here. See documentation for details.

Novelang has no semantic markup, insteads it creates an AST (Abstract Syntax Tree) to feed a stylesheet with. This allows creating document-specific idioms, to be handled at stylesheet level. Here is one.

Starting from source document like this:

<<
[INFO] This is an info block.
>>

<<
[WARNING] Beware of "this" paragraph.

(This warning spreads on several paragraphs.)
>>

We want lines of literal to appear in a special manner (like within a frame and with a special icon in the margin). Here is how to achieve this:

<?xml version="1.0"?>
<xsl:stylesheet
   xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
   xmlns:n="http://novelang.org/book-xml/1.0"
>
 <xsl:import href="default-html.xsl" />
 <xsl:import href="punctuation-FR.xsl" />

 <xsl:template match="/" >
   <xsl:apply-imports/>
 </xsl:template>

 <xsl:template match="n:paragraphs-inside-angled-bracket-pairs" >
   <xsl:choose>
     <xsl:when 
         test="n:paragraph-regular[1]/n:block-inside-square-brackets[1]='WARNING'" 
     >
       <blockquote>
         <b>WARNING</b><br/>
         <xsl:apply-templates />
       </blockquote>
     </xsl:when>
     <xsl:when 
         test="n:paragraph-regular[1]/n:block-inside-square-brackets[1]='INFO'" 
     >
       <blockquote>
         <b>INFO</b><br/>
         <xsl:apply-templates />
       </blockquote>
     </xsl:when>
     <xsl:otherwise>
       <blockquote>
         <xsl:apply-templates />
       </blockquote>
     </xsl:otherwise>
   </xsl:choose>

 </xsl:template>

 <xsl:template 
     match="n:block-inside-square-brackets[ text()='WARNING' or text()='INFO' ]" 
 />

</xsl:stylesheet>

Rendering document source samples

A nice feature in the documentation would be to show the Novelang source and the rendering result at the same time. There are several ways to achieve this:

— Duplicate the source code in the Novelang document. One is escaped, one is not. The latter gets rendered in the document itself, in a n:paragraphs-inside-angled-bracket-pairs element with a special tag. For now this won’t work in many cases, like levels or lines of literal.

— Reference a screenshot of a previous rendering. This is the most stupid solution because it’s boring to do and hard to keep up-to-date.

— Be clever and generate the image dynamically from the source snippet.

Rendering tools

How to render a PDF fragment into an embeddable image?

IcePDF claims to be open source but the license doesn’t appear on the Web site and downloading the product requires registration. Anyway, the Java WebStart’ed demo doesn’t display anything except a pair of messages telling it’s a trial version. This behavior was observed on Mac OS X 10.5 and Java 6.

PDFRenderer is available under LGPL. The project seems a bit asleep for now; it looks like a dump-everything-to-the-community effect of Sun’s policy last years. PDFRenderer does a nice job with many PDF, but Novelang-generated ones appear severely broken!

PDFBox is licensed under the Apache License, but contains license notices from Adobe (for AFM fonts) and Sun (for JAI). A close look at PDFBox-7.3.jar shows it embeds those AFM fonts.

Since PDFBox-7.3 doesn’t work (spits an exception), let’s check a snapshot out! This is revision 795516 or something. The build goes well, and image generation doesn’t crash. But the text in images appears seriously damaged! And the font doesn’t look correct. The original was created using Linux Libertine; images contain a Helvetica-like which may not have the same metrics. And all text in non-proportional fonts doesn’t appear at all.

Should I give up my dream of finding an OSS solution for rendering images out from FOP-generated PDF documents? Debugging FOP or PDFRenderer looks like a lot of work. And, while it’s easier to get perfect control on PDF rendering, HTML rendering may be enough for creating the samples.

So here comes Flying Saucer to the rescue. It’s pure Java XHTML renderer which supports CSS 2.1. I’ve used it already and I know it works. The “inheritable” nature of CSS means I can tweak the output a bit (reducing margins and page width) while reusing the default CSS stylesheet.

Finally, all this product review turns to be nonsense, because FOP is supposed to generate images directly ! Insanely great!

Integration to Novelang

Here comes hard stuff. Including external resources depends if the document is self-contained (PDF) or multipart (HTML), and if document is generated by generator (batch) or HTTP dæmon (interactive). As a self-contained document, PDF is generated the same way wether it’s a batch or interactive context.

The FO stylesheet may manage image embedding into the PDF, thus avoiding to spread complexity elsewhere. For SVG, the fo:instream-foreign-object allows direct inclusion of the XML. For images, the architecturally-simple approach would be to write a FOP extension taking the code snippet as parameter, then inserting the rendered image into the Area Tree .

Using external files only makes sense when generating an HTML documents, because we’re pretty sure in this case that user agent won’t request the image before it can read its address from the HTML. For PDF documents, the temporary file must exist before running the FO stylesheet, so it would require some kind of ugly pre-processing.

External files are generated “once-for-all” in batch mode. But, in interactive mode, how long should they live? And does it make sense to write files on the filesystem while the resource could be dynamically generated?

Dynamic resources could be kept in some session-scoped cache. This is how it would work:No need to cache the generated image, only the source snippet. This allows deferred generation.The HTTP session contains several cache areas, one per document name.When a fresh document is generated, reset the whole cache area for this document name.During XSLT processing, call an XSL extension that feeds the cache with snippets.Given a snippet, the cache returns some kind of identifier to be inserted as a link in resulting HTML.A special resource handler (at HTTP dæmon level) queries the cache with the identifiers.If the cache has such a snippet, then it returns it for rendering.

The XSL extension called by the stylesheet could trigger two different behavior, wether it’s dæmon or interactive mode:Use the caching stuff as above.Just write the image file on the filesystem.

How to invalidate the cache? Session expiration is not enough: if several documents exist in the same session, some may become unused therefore causing excessive memory consumption. To avoid this, turn reference to “old” documents to soft references so the JVM would clean them upon memory demand.

Bad behavior would occuring when trying to load an image inside an HTML page after the cache got cleaned by some way (session expiration or memory reclaim) and prior to refreshing the whole page. This sounds like a tolerable annoyance.

Conclusion(s)

I wanted to avoid coding whatever looks like a cache for as long as possible. Now there is a case where caching is linked to a feature out of the performance scope. Anyways, the cache described above is a “toy” cache. Real caching would take the whole resource graph (source documents, images, stylesheets and so one) in account.

Dynamically-generated images could also make sense for rendering ASCII Math for Web browsers which don’t support SVG.

As often, a bit of additional comfort requires a lot of work.

First Novelang demo in the enterprise world!

Last week I gave an introduction to Novelang at a software publisher’s who’s looking at a collaborative tool for writing it’s product’s documentation. The attenders were the internal IT manager, two technical writers and and an IT intern. Technical writers were enthusiastic, since Novelang is a huge leap from FrameMaker and Robohelp – that’s what they’re working with for now. They had a look at Scenari and Nuxeo . While both are open-source product, the setup fees seemed hugely overpriced.

(In my own humble opinion, Scenari is overengineered crap with excessively complex graphical editor. But it has nice slides to explain the “What You See Is What You Mean” concept. I spent no time looking at Nuxeo for now.)

The technical writers really liked that Novelang never asks for more information that the very minimum required (I guess FrameMaker doesn’t do that). I had a chance to show off with how whitespaces, non-breakable spaces, zero-width spaces, indentation, line breaks, separators, and automatic handling of punctuation typography through a customizable stylesheet. After spending countless hours on those obscure cases that’s great to discover that I’m not the only guy to who this matters (oh, by the way both tech writers are girls).

Now they’re evaluating the product but I already got some feedeback.

With somebody looking over my should, I realized how Novelang installation looks ugly. I unzipped the Novelang archive, did set a pair of system properties, and wrote a .bat file at the root of their working directory, all with my bare hands.

The “local webserver” concept is confusing. For most of people, a webserver is a remote host. This leads to some confusion on collaborative features. They were tempted to store the source document on a shared network drive. On the other hand, they’re not used to source control tools like CVS (the one in use in their company).

The lack of graphical editor looks strange to people who are not accustomed to technical writing.

There is no “guided tour” document nor “cheat sheet” to give an overview of every feature and how to use it best.

I had a deep look at the documents they’re producing for now. The content seems to fit in Novelang syntax. They have some very big tables with one column full of content like lists. For this, a level-based structure seems more appropriate than Novelang’s cell rows.

Having image resolution for PDF hardcoded to 300 dpi won’t work here. At least they need a command-line option until stylesheet metadata gets implemented.

A big requirement is the already-discussed index feature.

To be continued!

2009-07-22

Liberation fonts

Liberation is a superb font set under the GPL+ license, which allows redistribution while not extending the GPL licence to the documents produced with the fonts. Definitely a must-have. It's a 3-family set (Serif, Sans, Mono) with the 4 combinations with bold and italics. At the first glance they look more elegant than Times + Arial + Courier.

2009-07-12

Novelang-0.31.0 released!

So sweet! This release brings better control over whitespace suppression. See documentation for details.

Removing unwanted spaces (continued)

This is about some kind of brand new operator: it groups all words an blocks and punctuation signs which are not separated by space.

A great feature of Novelang is to apply standard typographic rules, especially when there is punctuation. The problem is, sometimes you can’t apply those rules in a blunt manner.

Consider these cases: on the left, what’s in the source document, and on the right default rendering.

Source document	Default rendering	Hack
`imprimé(e)s`	imprimé (e) s	`imprimé(e)s`
`F.B.I.`	F. B. I. (superfluous spaces)	`F.B.I`
`computer//ing//`	computer ing	No hack available

Default space insertion makes it all wrong. I tried to fix it by detecting proximity (lack of spaces) between casual words and blocks inside grave accents. But, if adding other cases like full stops, blocks inside solidus pairs and blocks inside parenthesis, we end up with many complex tranformations which just break existing whitespace addition for the common case.

The solution is something more generic. I’m thinking about a special character which groups everything that follows until there is a space, a line break or the end of the document. This character would be the tilde ~ because it looks like a kind of elastic ligature.

So, with source document like this:

~computer//ing//

We get an AST (Abstract Syntax Tree) like this:

+ block-after-tilde
  + word "computer"
  + block-inside-pair-of-solidus "ing"

But we still miss the feature of adding zero-width spaces when needed. How to express this? Since zero-width spaces only make sense inside a group with no space, we can reuse the tile character safely.

This:

~A.L.L.~O.F~'E.M.

… becomes:

+ block-after-tilde
  + subblock
    + word "A"
    + punctuation-sign full-stop
    + word "L"
    + punctuation-sign full-stop
    + word "L"
    + punctuation-sign full-stop
  + subblock
    + word "O"
    + punctuation-sign full-stop
    + word "F"
    + punctuation-sign full-stop
  + subblock
    + apostrophe-wordmate
    + word "E"
    + punctuation-sign full-stop
    + word "M"

And this is enough for the stylesheet to find where to insert zero-width spaces.