New beasts: delimiters and levels

To my average surprise, the renaming of all tree nodes went smoothly and all existing stylesheets were quickly put back to work, thanks to the XPath verifier. The token names are now beautifully consistent. Except a few ones, which I left out of scope until I get some insights about them. These are: CHAPTER, SECTION, TITLE. While Novelang claims to have no semantic markup, how to ignore the fact that chapters and section have highly structural effect? With a closer look, it appears that chapters and sections are quite not the same thing, depending we take them at parsing stage, or at rendering stage. At parsing stage, chapters and sections are delimiters which may be followed by some text (becoming the title). They exist at the same level as paragraphs. It’s only after the whole tree is parsed that Novelang creates the hierarchy by re-hierarchizing the tree before passing it to the rendering stage. At rendering stage, chapters and sections become containers (a source document declaring a chapter then a paragraph is not “flat” anymore, as the chapter now contains the paragraph). This all looks the same as parsing stage, except that a Book may define new chapters and sections on its own like with insert $createchapters. My guess is, with a rendering system supporting recursive processing, it would be a waste to limit ourselves to a fixed hierarchy.
Figure some big documents with numerous title levels, between five and fifteen. Creating a standard template under a tool like MS-Word doesn’t work well. When forcing each top-level title to appear on a blank page, this is a waste for small documents. But when allowing several top-level titles on the same page, big documents become unreadable and deep titles even smaller than the text body. It’s not a solution neither, to hack title depth by starting at, let’s say, the third level, because automatic numbering would cause our first chapter to be numbered “0.0.1”. On the other hand, a programmatic templating system like FOP would support such hacks.
With different names for nodes at parsing and rendering stage, we make the thing clearer. At parsing stage, we have delimiters. Delimiters are the new beast. They look like a list item but may contain less things (by now the only restriction is, they cannot contain a URL immediately following the equals signs). They will be processed in a different manner. Anyways, the “delimiter” name is not supposed to appear at rendering stage. Let’s use DELIMITER as a radix. Because we need to remain consistent with the rest we’ll continue the names with the characters it contains, so we have DELIMITER_DOUBLE_EQUAL_SIGN and DELIMITER_TRIPLE_EQUAL_SIGN. More levels is probably a bad news (you should split your text in smaller parts) but the naming scales up to “octuple”. We introduce a new convention: node names that don’t appear at rendering stage are suffixed by a low line (“_”). The low line as a prefix is already taken for node names which don’t appear in the parser. Finally our delimiter nodes are DELIMITER_TRIPLE_EQUAL_SIGN_ and so on. But what about the title? Even if its content is close to a paragraph’s, the title is structurally different. For its name, I’d like something like “text for a delimiter” but “text” carries no structural meanin. DELIMITING_TEXT_ is better but not so good as it suggests that the text itself is the delimiter. Let’s keep it until better. Now there is an obvious choice for the name of the nodes containing other nodes: _LEVEL (“level” is a palindrome, by the way). The title becomes _LEVEL_DESCRIPTION. Yes, this starts to look like semantic markup but it’s just reflecting the reality, because the processing of the delimiter gives something of a higher meaning. A consequence of dropping the chapter / section difference is some loss of information. Source documents which define top-level sections are now structurally undistinguishable from source documents defining top-level chapters with no section under. This looks like a good thing, because these differences are supposed to be managed at Book level.

Novelang-0.15.0 released!

Latest release of Novelang available here. Main new feature: stylesheets containing XPath expressions relative to non-existing grammar token are detected and rejected. Fixed some annoying bugs. See "Status and history" for details. Enjoy! c.


New naming scheme for nodes

The detection of incorrect XPath expression in XSL files now works (it is in the master branch). It’s based on code generation for the NodeKind class which describes every supported node names as produced by the parser, and this is expected to be of a great help during the incoming huge refactoring of the naming scheme of the nodes.

While Novelang grammar contains no semantic information, it has semantic-like markup. Text like //hi there// is supposed to become italics because of the slanting evocation caused by the solidus (or “forward slash”“forward slash”) character. But here is the lie: while the stylesheet processes an emphasis node, its output is whatever the author wished – including something not related to emphasis at all. The grammar is just wrong to claim its node is about emphasis, because the choice of making it appear as emphasis (through italics) is out of grammar’s scope.

The new naming scheme of the nodes intends to make the intent clearer: Novelang grammar carries no semantic meaning. The meaning is given by the stylesheet. In the case of the text between the two pairs of solidus, all what the grammar surely knows is… well, it’s about two pairs of solidus. Just before diving in the gory details here is a taste of the new naming scheme: EMPHASIS would become BLOCK_INSIDE_SOLIDUS_PAIRS and inside the XSL stylesheet it would be n:block-inside-solidus-pairs.

Finding a consistent (and extensible) naming scheme is not easy because of plenty of overlapping cases. Many terms need clarification and sometimes consistency may impact some structural aspects.

Let’s start with the paragraph. The paragraph is a very central object which helps finding out two families of nodes: those taking place inside a paragraph, and others (which define the paragraph itself, or define stuff that may contain a paragraph).

Let’s say a paragraph is a sequence of characters which does not contain two consecutive line breaks. This draws interesting questions.

Should a standalone URL appear enclosed in a paragraph? If a URL appears standalone, it reflects the author’s will to make it appear as a paragraph so we’ll enclose it into a paragraph (and the definition of “paragraph”“paragraph” gets clearer!).

Is a big list item a paragraph? A big list item could be renamed in order to contain the “paragraph”“paragraph” word. For consistency, the “small list item”“small list item” would become something like “embedded item”“embedded item”, we just lose brevity here.

There is a temptation to embed the list item node inside a PARAGRAPH block, as we do for URL. The stylesheet could rely on paragraph’s parenthood (the list node) to determine it’s a list item. And then we have only PARAGRAPH node, not two distinct cases. But in practice, stylesheet writers will make two distinct cases everytime because the two have different indentation and so on. So we really need to flavors of the “paragraph”“paragraph” node.

PARAGRAPH_REGULAR is a good name for the regular paragraph, hinting there can be not-so-regular ones.

PARAGRAPH_ITEM makes sense, as starting with PARAGRAPH tells structural things about the node. On the other hand, it can be understood as if paragraphs were holding items. So let’s be true and use PARAGRAPH_AS_LIST_ITEM.

For a source document like this:



--- item1

--- item2

We end up with such a node structure:

|   +-- WORD
|   +-- URL
+-- LIST
    |   +-- WORD
        +-- WORD

Now that we are clear about paragraphs, let’s consider the case of paragraphs tied together by a paired delimiter (a delimiter including a start marker and an end marker). This is what current “blockquote”“blockquote” does. The delimiters are a whole line starting with “<<”“<<” (lower than sign) and ending with “>>”“>>” (greater than sign). The “blockquote”“blockquote” may only contain paragraphs. As we reserve the word “block”“block” for another usage to explained later, we must find a way to tell there is an enclosed sequence of paragraphs. A prefix like SEQUENCE_OF_PARAGRAPHS is not so bad because it puts the emphasis on the word “sequence”“sequence”. But PARAGRAPH_SEQUENCE doesn’t carry a plural form so it doesn’t look stupid in case of only one paragraph. On the other hand, names describing a paragraph (PARAGRAPH_REGULAR and PARAGRAPH_AS_LIST_ITEM) start with the word “paragraph”“paragraph”, causing some confusion. Finally, PARAGRAPHS is best because if we drop the plural vs. singular thing, we avoid the lengthy “sequence”“sequence” word with the same meaning. Now we have a radix for the node name, we just add a suffix to describe the delimiter. After all, it would make sense to have similar structures with different markup (as we have for stuff inside a paragraph). Since the delimiter is a pair of angled brackets, just tell it. We end up with PARAGRAPHS_INSIDE_ANGLED_BRACKET_PAIRS.

I had a strong debate with myself: should I use IN (shorter) or INSIDE (more explicit)? The “inside”“inside” word is very clear about a block contained by something. Later, when creating names around the “block”“block” word, we’ll see that a construct like “block inside”“block inside” is less ambiguous than “block in”“block in” that may look like a verb.

In ANGLED_BRACKET_PAIRS the name of the delimiter is left in singular. The word “pair”“pair” is used because “double”“double” is required for “double quote”“double quote”. “Double quote”“Double quote” is the Unicode name of the character, and it’s a Novelang standard to always use Unicode names. So we can’t use “double”“double” to say there are many delimiters and we use “pair”“pair” instead. Telling there are several pairs (plural) is ok because we can’t honestly figure how there could be more than two.

Novelang’s current “literal”“literal” looks a lot like “blockquote”“blockquote” (just three angled brackets instead of two). But literal doesn’t care about any paragraph structure. It’s just uninterpreted lines, including line breaks as they are. That’s a thing to know when writing the stylesheet: it hints there will be no subnode to process. In this case, LITERAL should appear inside the node name. But, as we stated for paragraphs, it’s important to highlight the structural implications of the node. So we end up putting the “line aspect”“line aspect” first and we get LINES_OF_LITERAL as a radix. Adding the suffix, we end up with LINES_OF_LITERAL_INSIDE_ANGLED_BRACKET_TRIPLETS. The suffix here is questionable because I don’t see any reason to offer another support to literal. So let’s keep LINES_OF_LITERAL finally.

There is another kind of nodes that may contain other nodes, especially paragraphs: “chapter”“chapter” and “section”“section”. There is matter for a discussion because if both become sections and sections become nestable, we could do amazing things, especially if the depth of sections can be adjusted at Book level. But we don’t need to solve every problem today and we leave this to another discussion.

Now let’s look at what happens inside a paragraph. All subelements acting like a container inside a paragraph (like parenthesis) are called blocks. “Block”“Block” is a good word because as it is short and it’s not wasted here because it will appear a lot. In order to follow the emerging rule of telling about structure first, we’ll use the prefix BLOCK.

For stuff inside parenthesis and square brackets (and curly braces in a near future), something like BLOCK_INSIDE_PARENTHESIS is clear enough.

For paired delimiters like double hyphen (for -- interpolated clause --) or double solidus, it’s right to say there are two pairs of something. So BLOCK_INSIDE_SOLIDUS_PAIRS looks reasonable.

Current “interpolated clause”“interpolated clause” has a special case when it has a “silent end”“silent end” (like --this-_.). It’s useful for making only the first dash character appear, while a dumb punctuation sign would have released the level of control provided by the parser. In this case, it’s hard to claim there are two pairs of hyphens. BLOCK_INSIDE_2_HYPHENS_THEN_HYPHEN_LOW_LINE is accurate, while not very concise. THEN has the special role to tell delimiters are asymmetrical, describing the first delimiter then the second.

For double quotes, rules stated above still apply (no chance here) and we have BLOCK_INSIDE_DOUBLE_QUOTES.

For current “superscript”“superscript”, there is only one opening delimiter. The closing delimiter is implicit with the end of contained word (super^script) so we don’t have exactly a block. But rules above still apply and we get WORD_AFTER_CIRCUMFLEX_ACCENT.

Punctuation signs are left unchanged: by now we have a PUNCTUATION_SIGN node enclosing a node representing the sign itself (SIGN_COMMA, SIGN_PERIOD…).

Now here is the summary of old node names vs. new ones:
















Now I’m having a look at some ideas I blogged down for extending the Novelang grammar. It was in: http://novelang.blogspot.com/2008/07/some-ideas-for-novelang-syntax.html. The new naming scheme seems to scale!

Please note the use of AND used for describing the first delimiter. For “++=”“++=” we have 2_PLUS_SIGNS_AND_EQUAL_SIGN_PAIRS where AND means “immediately followed by”“immediately followed by”. There is no hint the delimiters are symmetrical.

^^ small caps ^^          
__- single underline -__ 
++= double strike =++     



Novelang-0.14.0 released!

Latest release of Novelang available here. Refactored ANTLR grammar for supporting some planned features. Known regression: some useful characters missing from font list. Enjoy!

Grammar refactoring complete!

Wow, I just checked the result of a major grammar refactoring into the master branch! Now I'll ship a new version with just those changes to get sure that nothing breaks in existing documents. But what's next? I've many options to consider. Split the project down to smaller pieces With one big project with lots of Java code depending on ANTLR-generated parser, playing with the grammar and deactivating some rules breaks the whole project, so it becomes difficult to find out what's wrong. During the refactoring, I splitted the sources in several projects: common stuff, parser stuff, rest of the stuff. Then I could focus on the grammar itself. Keeping such a split would be great for future experiments. On the other hand the Ant build would become increasingly complex. Maven is the tool for working with several subprojects but the migration has a cost. Maybe I should defer it until the next time I need it. Tree nodes renaming With usage, is appears that Novelang grammar has nothing to do with a semantic markup. It supports semantic-like markup (like "//foo//" where double solidus pictures slanted characters which are like italics) but the flexible nature of stylesheets definitely avoids to freeze the meaning of grammar constructs. For this reason, the tree node which is now named emphasis should be renamed into something like block-in-double-solidus-pair. Stylesheet consistency check Changing the name of the tree nodes will break existing stylesheets. But checking if stylesheets use correct node names would ease the pain a lot. Such a check could be made by parsing attributes like match="n:chapter" and raise errors when an unknown node name appears (in the "n:" namespace). Automatic detection of node names in the grammar It's the same principle as above, applied to Java code. By now tree node names defined in the grammar are duplicated in the NodeKind class. The NodeKind class is updated manually. By generating corresponding code, there would be no need to update it manually. Fix list of supported characters Class SupportedCharacters is broken because of changes introduced by ANTLR-3.1.1. The fix could get trivial if we generate some Java code in the same was as for node names (described above). List items That was one of the main features justifying this refactoring, remember? There are plenty of lists to be envisaged. There are two main families of lists: big lists, whose items behave like paragraphs with a special introducer, and small lists, whose items are separated by a single line breaks and therefore may appear inside a paragraph.
--- Big list item with hyphens

### Big list item with number signs

*** Big list item with asterisks

This is a paragraph.
- Small list item with hyphen
  - Some sub-item
  - Sub-item, again
Number sign hints the renderer to generate numbered list.
Interpreting indentation is left to tree-mangling and rendering.
  # Small list item with number sign
  # Number two
  # Number three
Asterisk mean almost the same as hyphens. Do we need them?
* Small list item with asterisk
* Yet another one item with asterisk
URL Another main feature is support for URL inside paragraphs and URL having title.
This is a paragraph embedding two URL. Here is the first:
  "This is the title of second URL"
Conclusion Adding code generation from ANTLR grammar then stylesheet checker seems the first thing to do in order to secure existing features and build on solid basis. Generating code from existing grammar (for enumerating supported node names and characters) may be complex as it may involve ancillary classes. This could be a reason to Mavenize the project.


Ongoing grammar refactoring

As refactoring goes on, it's time to answer some questions. The main new features (URLs and list items inside a paragraph) have a great on the whole design. Both may take place inside a paragraph; both start (and end) with a line break. That's a major change, as in former design the paragraph was a single piece as long as there were no two contiguous line breaks. While it's not a part of the grammar itself (as in the Novelang.g file), URLs support being "decorated" with a preceding double-quoted block, or an angled-bracketed block. So we should consider that URLs inherently spread over many lines and expose a double-quoted block. List items inside a paragraph are called "small list items". They cannot spread over more than one line (no break inside). For this reason, a small list item may not contain a paragraph. So we have a new beast: text blocks which don't contain line breaks as paragraphs do. For this reason, they cannot contain URLs or small lists. We'll call monoblock a text block with no line break. We'll call spreadblock a text with line breaks (as it may spread over several lines). We can't have URLs inside small lists. But's that's ok because there is another thing called "big list items" that's a plain paragraph with a special indicator at its start. By now (master branch), titles for sections and chapters are casual paragraphs. I wonder if it now makes sense to have URLs or small lists inside a title? Another limit I put arbitrarily is to forbid URLs and small lists inside double-quoted blocks. This sounds right for the URL because of the double-quoted block that may decorate a URL (it's logically impossible to have double-quoted block inside double-quoted block without an additional delimiter). For the small list, it's more for typographical sanity. Does it make sense to extend this to any asymmetrical delimiter (emphasis, interpolated clause...)? Sometimes it's useful to emphasize a whole paragraph, including its URLs and small lists. Because I'd like Novelang grammar to be twisted and abused in any technically-feasible way (for creating idioms), it doesn't make sense to forbid titles to be paragraphs, nor to forbid asymmetrically-delimited blocks (except double-quoted blocks) to spread over several lines and contain URLs and small list items. Depending on the stylesheet, things like this could make sense:
=== This is a section with embedded small list items:
- item1
- item2
- item3

This is a paragraph.
Rendering may look like:
This is a section with embedded small list items: item1, item2, item3.
Same for URLs. Because URL href must stand on its own line, this encourages adding a display text:
=== "This is URL display text"
At the end, it seems that technically feasible things drive to well-designed grammar! Thanks to ANTLR for helping me to express the grammar so clearly.


Tree manipulation languages

As a background task I'm thinking on how to improve the Book definition. By now, Novelang books are a sequence of imperative tasks:
insert file:the-preamble.nlp 

insert file:. 

Apart of the mapstylesheets command, insert command is all about inserting trees in the main document. Default behavior is to create one tree from a Part file. The $recurse command creates many trees (one per Part file) from given directory. The $style command creates a style node right under the root of the inserted tree. One of the planned features of Novelang is to build Books from identifiers, which are subtrees. As it's all manipulating trees one can note that XSLT transforming a Book tree into a PDF or a HTML document are all about manipulating trees, too, but they act at a different stage: once the content of the document is well-defined. And XSLT work as producing a changed version of one input tree, they don't build a tree from multiple sources. As XSLT make a good work, I googled on "tree manipulation language" to see if there is something useful here, at least to take inspiration from. TXL I found TXL, which seems backed by serious research. Unfortunately it doesn't come an an embeddable Java library. TXL scripts define an input grammar for building up the tree, then rules for creating / adding / moving / deleting nodes. It looks sweet for experimenting with languages. Tregex and Tsurgeon Tregex is a regex-like language for extracting nodes from a tree. Tsurgeon is Tregex extension for manipulating the trees extracted by Tregex. The Tregex language itself looks good. Given a tree like this:
  / |  \
The following Tregex expression means something like "Call 'n' the node that has a 'NP' node as a parent and with a sibling which has a 'PP' node as a sibling while this last node should be called 'pp2' by the way." :
NP=np < (NP $+ (PP $+ PP=pp2))
Then here is a Tsurgeon expression :
adjoin (NP=new_np NP@) np
move pp2 >- new_np
Applying the Trex and Tsurgeon expressions on the tree above give a new tree like this:
     /  \  
   NP    PP
  / \ 
Tregex and Tsurgeon are bundled in a Java library. I don't like their design of the Tree class I don't discuss the fact their Tree class should be mutable, at least because this may save memory with some algorithms. But the Tree class is a concrete class declaring more than one hundred of public methods. Most of them could have been part of a utility class. You are dishonestly invited to compare to Novelang's Tree definition! Conclusion As far as I see, there is much work done on tree manipulation languages. While they enable to do anything on a given tree, they are a very special jargon that won't help non-geeks to create Novelang Book files. So I should find other areas to investigate if I want new ideas in this area.



Leo is some kind of text editor with the ability to aggregate files and parse special processing directives. It can be seen as a tool to create and manage graphs of text fragments. Where Leo shines is for extracting some code fragments, using its own directives. There is some likeliness with Novelang, that has books including parts. I don't want anything like a graphical front-end for Novelang now but there may be some ideas to steal from Leo.


Compulsive font browsing: FontExplorerX

The best font browser I met at this time: Linotype FontExplorerX. I'm playing with the Mac version and it rocks! Top features:
  • Can visualize non-installed fonts.
  • Many font formats supported.
  • Displays a whole font family at once. Also recognizes Optical fonts.
  • Sparse selection for multiple previews. Supports multiple windows for font detail. That's incredibly useful for comparisons!
  • Rich preview options, like custom text and ligature activation.
  • Detailed font information with huge zoom factor, baseline display, x-height and bouding box.
  • Unicode support.
  • Virtual font folders.
  • Detects font conflicts.
  • More than font viewing: may intercept system calls to obtain fonts from some applications like InDesign or XPress.
It's a commercial product with an option to buy fonts on Linotype's site, but it is downloadable and usable for free. It's a beautiful professional tool. Real font lovers like me will find it more addictive than YouTube!


LyX: the WYSIWYM document processor

I'm always glad to see there are clever projects providing alternatives to boring WYSIWYG word processors. LyX is one of those. It defines itself as a "What You See Is What You Mean" document processor. LyX is a graphical environment for typing LaTeX-like documents. Text displays nicely with various font sizes and so on, but does not attempt to match exactly the final result, so it can display useful tags and so on. LyX documentations coins a few interesting terms: "WYSIWYM", "document processor" instead of "word processor" and "finger painting" for this boring way to set style at character level under a graphical user interface. LyX requires TeX. Check the demo video for mathematical equations, it rocks.


Novelang-0.13.0 released!

Latest release of Novelang available here. Better detection of text inconsistencies (lexer now reports problems). Enhanced font list: doesn’t break where there are no custom font at all. Cleaned up samples directory: most of stuff here was for tests, that moved to src/test-resources. Enjoy!


Refactoring Novelang grammar

I started working on a refactoring of the Novelang grammar and it's a big job. By the way I switched to ANTLR-3.1.1, the latest version of ANTLR. The development occurs in the ANTLR-3.1.1 branch. ANTLR is the greatest tool for generating parsers. With version 3.1, it supports grammar imports, which means a complex grammar can split in smaller files. With careful design it would be possible to let third-party developers extend Novelang grammar, as ANTLR supports rule overriding. Alas, ANTLR-3.1.1 doesn't work well with multiple import levels so I'm keeping one huge grammar file for now. You can have a look at current Novelang grammar (master branch). For the end-user, the biggest feature brought by this refactoring is support for "monoline" text items. Basically this is for stuff delimited by a pair of line breaks and that may stand in the middle of a paragraph. By now, Novelang only recognizes URLs when delimited by two pairs of line breaks.
(This is some paragraph before the URL.)


(This is some paragraph after the URL.)
Recogninizing a URL as "monoline" text item would allow something like this:
This is a paragraph.
...Same paragraph, continued.
That's a lot more natural. The URL still must start at the start of the line, because it's much easier to copy from the text editor. I previously discussed URL syntax here. Coding Horror blog has a nice post that should deter anyone to include URLs in plain text with no machine-understandable delimiter. The full-blown URL syntax supports URL decorations like this:
Go to 
  "Novelang website" 
  [Novelang website on Sourceforge.net]
and see all useful links.
The quoted and bracketed text blocks are optional and provide display text and alternate text. I can see no reasonable way to support them at grammar level. The best way to handle them is at tree-mangling stage (reordering the Abstract Syntax Tree generated by the parser). This means, the parser-generated AST should include nodes describing whitespace and line breaks. Support for monoline items is helpful (necessary?) for supporting lists. As previously discussed, here is how I want to write a list:
Here is a list on two levels:
* First item
* Second item
  * First subitem
  * Second subitem
* Third item
...And the paragraph continues here.
As for URL decoration, the grouping of list items is made at tree-mangling stage. Because identation matters, whitespaces in AST should tell how big they are. A list which can appear inside a paragraph will be called a small list. There is the need for another kind of list where items are paragraphs, to be called big list. The symbols for designating list items ("*", "#", "-", "---",...) are left to another discussion.


URW Fonts

Ok, I swear I take a break on fonts for a time. But before I have to post this link on URWFonts. URWFonts are a free library of good-quality fonts under the Gnu General Public License. Here's the stuff: http://www.ghostscript.com/awki/GhostPCL I already wrote that Novelang should not come with its own set of fonts but creating nice-looking documents with base 14 fonts (vanilla Times + Helvetica + Courier) just sounds too hard for me. In the same post I wrote that Bitsream Vera Fonts were looking good and, well, I'm not so sure now. But URW fonts are just great. They reproduce a set of standard, classical fonts, with little marvels like Garamond, or Palladio (which looks close to Palatino). For a project like Novelang, it's just a perfect gift. Free fonts of decent quality are pretty rare, and those ones seem to have consistent baselines and spacing across the collection (more tests needed to be sure). One potential problem for the end-user is the GPL license. This is fine for Novelang (which is GPL'ed, too) but if you create a PDF using those fonts, then your document becomes GPL'ed and it may not be what you want. It seems there is a special exception for some distributions of those fonts, but I could not grab the downloadable file.

Novelang-0.12.0 released!

Latest release of Novelang available here. Font list looks better than ever! See release notes for details.


Stylesheet metadata and caching

As shown by previous posts on this blog, I've already spent a lot of time on fonts and how FOP deals with them. By now (before releasing Novelang-0.12.0) the result is far from perfect but I should move on to other topics in order to keep the product well-balanced. So today I'm blogging for the pleasure to write purely speculative things, you've been warned. Outstdanding defects of current solution are:
  • FOP caches font metadata in ~/.fop/fop-fonts.cache. Font deletion doesn't seem to be detected, so it's better to always delete the cache file at application startup.
  • Font changes are not detected while application runs. This is Novelang's fault (in RenderingConfiguration) as it always returns the same FopFactory instance.
  • Fonts defined as application parameters. Should be defined at stylesheet level in order to get a finer grain.
The last point relates to the previously-discussed point on how to enrich an XSL file with FOP-specific metadata. It could include a link to hyphenation directory, as hyphenation rules may depend on some characters defined in the stylesheet (like the apostrophe). By now, when the user touches one of her / his stylesheets, the next document rendition takes the changes in account, and that's the way it should be. Fonts or hyphenation directories are set at FopFactory level, which currently lives as long as the whole Novelang application. For supporting "live" changes in XSLs, Novelang should re-create a FopFactory each time. What happens if reading font metrics takes much time? Remember: FOP has a cache for this and I guess there is a reason. I know that "premature optimization is the root of all evil" so caching has low priority for Novelang but for my own personal comfort I must know how to perform caching. The goal here is to cache a whole FopFactory (possibly several ones, for several documents rendered concurrently) for reuse when the configuration is the same. From my Cocoon days, I remember the excellent caching system. Each cacheable resource involved in the making of the final document has a CacheValidity object that tells whenever another CacheValidity is valid regarding current one. Here is my own flavor of CacheValidity using Generics (the Javadoc of the original is here):
public interface CacheValidity< T > {
 boolean isValid( T other ) ;
I'm not sure on how to use Generics in this case but anyways I'll see. Now here is how to get a new resource from a cache. Let's say this is a method of an object holding an instance of a CacheableFile. Synchronization is omitted for brevity.
public String getContent( File file ) {
  final FileValidity current = 
      new FileCacheValidity( file ) ;
  if( ! cache.getValidity().isValid( current ) ) {
    cache = createCache( file ) ;
  } ;
  return cache.getCachedContent() ;  
From the simple and clever CacheValidity object, its is possible to implement various caching strategies in a transparent manner. It is also possible to compose them, like using a temporal cache for avoiding any refresh during an given time interval, or implement a directory cache from a list of file caches. More exotic CacheValidity objects may represent a whole XML fragment corresponding to the XSL metadata. If it is left untouched by the user and files are untouched, too (what should happen the most often) then the FopFactory may be used again. I don't know if it's worth reusing Avalon code (it's a lot of mess with no Generics) but it was definitely worth a look. So what's the lesson here? If I had started to look at caching issues from the beginning, attempting to make Novelang fit around Avalon or whatever, I would have lost the focus and started recoding Cocoon (which is the historical reason for building Avalon). But I started Novelang because I was unhappy with Cocoon. So I'll probably have a close look at Avalon, and rewrite caching stuff the way I feel. This may sound like horribly stupid "Not Invented Here" syndrome but my own experience on incremental developments shows that value is not in the code itself, but in the understanding of the problems. (That's why many organizations worship somewhat crappy home-grown frameworks: not because of their intrisic value, but because it's the proof they could bring many people thinking the same way.) Joel Spolsky develops this point of view under the light of the competitive advantage.


Web fonts

My recent efforts for supporting custom fonts for PDF documents greatly increased my attention to all font-related stuff. Here is an article from Ars Technica about the revival of Web fonts, with CSS linking to downloadable fonts. While several formats are competing, there is a clear trend here. I don't anything more to do for Novelang than serving a static file. Or am I missing something? Anyways, deploying fonts on the Web will be a huge fest of copyright issues. Downloadable PDFs with embedded fonts may raise the same issues but it's a marginal case for now. Sidenote: I had a look at the demo page in Ars article. Camino 1.6.4 displays replacement fonts but Safari 3.1.2 does its job. In addition, Safari's Web Inspector displays the font list with a nice preview.


General-purpose text processing library

Using Novelang to produce real-word documents (I mean: more than Novelang documentation), I discovered how it is convenient for producing custom idioms without touching the main grammar. I mean: Novelang syntax supports well-known artefacts like quotes, parenthesis, square brackets, punctuation signs, chapter headers, and so on. The text gets abstracted into a tree-like structure which is processed by a stylesheet that may be a custom one. The default stylesheet recognizes the "bracketed" item of the structure and outputs brackets around the text inside the "bracketed" tree fragment and everything looks fine. Now consider the case where:
  1. Your text doesn't need square brackets.
  2. You need to express something else, like a special name with special typographical effect.
Quickly, you start attributing a new effect to the square brackets. Because it corresponds to a new meaning, you just started building your own semantic markup. And, let me say it again: without touching the main grammar. It's even possible to assign different semantics to different parts of a document. From a Book you can tag an inserted Part with a special style:
insert file:mybibliography.nlp
Then the content of the Part has a style element containing the "bibliography" string. So the stylesheet may use a special template to process entries like this, where italics inside a section don't mean it's italics, but the text to sort the author list on:
=== Paul //Graham//

On Lisp [Prentice Hall]

=== Allen //Holub//

Taming Java Threads [APress]
That's incredibly lightweight compared to semantic markups like DocBook's one. The magic only comes from:
  • The choice to avoid too-specific markup whenever possible.
  • The choice of a distinct presentation layer.
With this in mind I see a chance to turn parts of Novelang to a general-purpose text processing library, with pluggable presentation layer.


Novelang-0.11.0 released!

Latest release of Novelang available here. Multiple font directories, no need for temporary font metric files. Command-line arguments supercede system properties, See release notes for details. Update: I wrote that Sourceforge shell service was down, but it has been shut down permanently. So I'll have to change the script for uploading documentation.


Inferring fonts characteristics

Now I'm trying to display a nicer font listing. FOP does a great job, reading font files and extracting font name, style, and weight. A font name is disconnected from the font file name (though before Novelang-0.11.0 it was not the case). A font name should correspond to a typeface, which is a family of font. For the "Linux Libertine" font name, there can be several variants, like roman or bold+italic. But when taking a closer look at the information provided by FOP there are "virtual" font names, corresponding to the font variant of a given file, or an abbreviated name. Let's consider the files of the Linux Libertine typeface:
  • LinuxLibertine.ttf
  • LinuxLibertine-Italic.ttf
  • LinuxLibertine-Bold.ttf
  • LinuxLibertine-Bold-Italic.ttf
  • LinuxLibertine-SmallCaps.ttf
Beware of the trick: there are four files corresponding to standard style / boldness combinations plus the small capitals which can be considered as a separate font. In FOP's terminology, a font-triplet associates a font name, a style (normal / italic) and a weight (normal, bold, extra-bold...). Each triplet has a priority meaning (I guess) that triplets with higher priority should be used first when resolving a font triplet.From the five files FOP extracts following font-triplets:
"Linux Libertine" italic, bold, p=12

"Linux Libertine" italic, normal, p=7

"Linux Libertine C" normal, normal, p=7

"Linux Libertine" normal, bold, p=5

"Linux Libertine Bold Italic" normal, normal, p=0

"LinLibertineBI" normal, normal, p=0

"Linux Libertine Bold" normal, normal, p=0

"LinLibertineB" normal, normal, p=0

"Linux Libertine Italic" normal, normal, p=0

"LinLibertineI" normal, normal, p=0

"Linux Libertine Capitals" normal, normal, p=0

"LinLibertineC" normal, normal, p=0

"Linux Libertine" normal, normal, p=0

"LinLibertine" normal, normal, p=0
This looks quite messy. Using raw this raw data, the font listing would reveal 14 fonts instead of the 5 expected. That is because FOP focuses on resolving font variants given a name, a style and a boldness, while each font file may contain more than one font name. Novelang has to take FOP's information and move it upside-down to obtain a human-readable font list. First, sort all triplets by priority (like above). Let's say that all triplets with a priority greater than 0 define "good" font names: font names that are shared between triplets can be used safely to choose font variants (while there is no chance to get a variant from font names that already describe a variant, like "LinLibertineI"). Let's call those names the "clean names". In the list above we get following clean names: "Linux Libertine" and "LinLibertineC". Then it is easy to craft a structure like this:
"Linux Libertine"
  italic, bold, LinuxLibertine-Bold-Italic.ttf
  italic, normal, LinuxLibertine-Italic.ttf
  normal, bold, LinuxLibertine-Bold.ttf
"Linux Libertine C"
  normal, normal, LinuxLibertine-SmallCaps.ttf
The "Linux Libertine, normal, normal" font-triplet is missing. Using the clean name "Linux Libertine" it is easy to find from font-triplets with priority zero. If looking for perfection we can try to locate a better name for "Linux Libertine C". How? Once the clean names are established, we look for singletons in the set of font triplets with priority greater than zero. For each of those elements, we replace the clean name by an "outstanding name" which is the longest name in the set of font-triplets with priority zero with the same font file (LinuxLibertine-SmallCaps.ttf). So now we have something like this:
"Linux Libertine"
  italic, bold, LinuxLibertine-Bold-Italic.ttf
  italic, normal, LinuxLibertine-Italic.ttf
  normal, bold, LinuxLibertine-Bold.ttf
  normal, normal, LinuxLibertine.ttf
"Linux Libertine Capitals"
  normal, normal, LinuxLibertine-SmallCaps.ttf
Now there is the temptation to show all available font names in the list, like "Linux Libertine C" as an alias for "Linux Libertine Capitals". While this would increase the complexity of the algorithm, I don't see how useful this would be. Anyways, the algorithm described above may require additional work, considering messy fonts of the real world.


Blogging while Sourceforge is down

I've uploaded Novelang-0.11.0.zip on Sourceforge but I'm stuck while their shell service is down — it's what I'm using to upload and unzip the Novelang website. So I'm refraining myself to advertise this new version is available, while it wouldn't appear on documentation available online. This release is not a shiny one: I just cleaned up some mess and added some tests. But now, both daemon and still-undocumented batch tool read command-line parameters the same way. These parameters follow the "--option=value" form. They supercede system properties that are no less than crappy global variables making automated tests hard to write. System properties also make troubleshooting more difficult, as badly-spelled system property fails silently. With the help of a command-line argument parsing tool, an unsupported options raises an exception at program startup, on a fail-fast basis. In order to make resulting configuration more understandable, the log shows how the value was set:
INFO  n.c.ConfigurationTools - Recognized user-defined 
  directory '/.../Novelang/samples/hyphenation' 
  (from option: --hyphenation-dir, Directory containing 
  hyphenation files).
INFO  n.configuration.ConfigurationTools - Creating 
  DaemonConfiguration from default value [8080] 
  (option not set: --port, TCP port for daemon).
A big temptation during this refactoring was to add new features. One frustrating moment was the handling of multiple font directories because of greater ambitions. But finally I managed to keep this development round short, the code got better, and having just one default /fonts directory works well in many cases. I've thought about a few potentially useful options:
  • --serve-shutdown (daemon only): by now this HTTP request shuts the daemon down: /~shutdown.html. This should be disabled by default, and enabled only with --serve-shutdown option.
  • --serve-remote (daemon only): by now any remote computer may access to the daemon (unless there is some firewall preventing it). The default behavior should be to restrict access to localhost, unless explicitely stated otherwise.
  • --flatten-output (batch only): by now the batch tool renders documents with the same path as source documents. This may cause annoying tricks to get generated files.
  • --sources-dir (batch and daemon): the directory to resolve document sources from.


URL for querying metadata

So I'm thinking again on how to list all the fonts available from a given document, considering the stylesheet as the best place to define font directories. This may not be such a futile exercise as it draws the question about document metadata and the URI syntax to query it. By now the best I found is:
I like it because:
  • URI parser can detect that "~fonts" makes no sense unless the MIME type is PDF.
  • It doesn't mess the URI parameters which are about the document itself (not its metadata).
  • It's easy to extend with other functions like word count or whatever, with no risk to create incompatible options.
I took a different way for error pages: these URIs look like
But meanwhile I had to find a workaround for displaying directories: a pseudo -.html document. So we could get an unified way to display metadata through some kind of "service" (including errors): /~fonts/my.pdf /~error/broken.pdf
Or, with Safari: /~fonts/my.pdf/-.html /~error/broken.pdf/-.html


How to display font listing

Yes, this is yet another post about fonts. As I found how to get more information on available fonts, I'm gathering some ideas on the best way to display them. By "font" I mean the combination of a font family (like "Verdana"), a weight (like "extra bold"), and a style (like "italic"). Each font has a font family name, and is backed by one file (though there can be several fonts in one file, like with Open Type fonts). FOP calls such a combination of font family, weight and style a "font triplet". So it's important to name clearly the fonts giving all the characteristics of the font triplet, and the name of the font file (with a path relative to the project's root). There should be a duplicate warning when this happens, that could be some small red symbol. A nice feature is to display characters that are supported in source documents. This is partially supported by now (the SupportedCharacters doesn't get'em all). Many fonts are documented in a way like this using a table; a table is nice because empty cells show missing characters. There should be a small text showing how the font renders. A sentence with all roman alphabet letters ("the quick brown fox jumps over the lazy dog") is not enough because it contains no accent. The best language-insensitive display I've seen is mixed-case alphabet ("AaBbCcDdEeFfGgHh..."). Because such a table takes much space, we can show one font per page. Because there will be many pages, the first page should list available fonts by family, with hyperlinks. As information about broken fonts becomes available, those should be listed in the first page, preferably with a red symbol aside.


FOP and fonts, the story goes on

Now I've a better understanding on how FOP handles fonts and how to get its precious informations about font list, duplicates, and failures. In the PrintRendererConfigurator, the #buildFontListFromConfiguration static method does (almost) all the job. It takes following input parameters:
  • A Configuration object with a <renderer> as root element.
  • A URL (as a String) to resolve relative font URLs with. Null is supported.
  • A FontResolver, which can be a DefaultFontResolver.
  • A boolean set to true if an exception should be thrown if an error is found.
  • A FontCache instance or null if caching is disabled.
As it is a static method, #buildFontListFromConfiguration can be called from everywhere with a fresh FontCache instance. The latter is useful as it gathers failed fonts. A fresh cache instance is needed, because cached data may survive the JVM. The cache saves itself in a ~/.fop/fop-fonts.cache file, holding font descriptions. Font descriptions seem to be only invalidated when FOP attempts to load a font for "real" (at rendering time). So when hitting the cache in a test program, it sometimes returned font descriptions that shouldn't have been there. The FontResolver requires a FOUserAgent which is created from a FopFactory. The FopFactory itself is created from a Configuration which contains the <renderer> elements, so there should be some instance reuse here. I've found a private method somewhere which logs font duplicates but I can't find it back to see if there was any hook around (didn't seem so). Anyways it will be cleaner to sort out font triplets with the same value. Printing the font triplets on the console, I noted they take the right font name, whatever the font file is. Adios, proprietary font naming convention!

Font listing revisited

In a previous post, I found some good reasons to embed font directory list inside a stylesheet. With such an approach, there is no centralized declaration of font list, so font listing with http://localhost:8080/~fonts.pdf becomes unavailable. That's bad news, since font listing is incredibly useful to debug documents, especially when there are broken fonts (the Web is full of them, waiting to be downloaded). With font list inside the stylesheet, font listing requires the stylesheet itself as a parameter. This breaks current URL scheme. By now it's possible to tell Novelang to use a given stylesheet through the book itself, or with a URL parameter. So it makes sense to use document URL as a parameter for the font listing. Obviously, this is a common use case. I'm thinking about something like:
This is less elegant than current solution but if there are many font directories (like one per font) this helps to reduce the list length to what's really used. The /~fonts.pdf pseudo-document may stay useful, listing all the fonts under the project directory, using a deep directory scan.

'External' directories

Previous post highlighted that Novelang should not allow a reference to a directory out of its project. We'll call such a directory an external directory. The reason is, Novelang could be used (in a distant future) as an embedded component in a Web application where users upload their own source documents and stylesheets. A malicious stylesheet could exploit some special FOP behavior to embed a file that it is not supposed to, like password file, or just another user's document. By now Novelang just filters HTTP queries, especially those for directory listing. There is no check on the path on images or fonts that FOP tries to embed. Enforcing file access restriction is a great subject by itself. How to handle resource access, depending on current Novelang project? How to test security in general? Those points arise as I'm writing, but the initial topic of this post is: how to let a project access to a directory out of its scope, let's say, in case of multiple projects sharing same datas like fonts on a privately-owned local filesystem? This may be achieved using Un*x symbolic links, depending on Java support them. A more portable solution could be to set a system option like:
"System option" means it is defined outside of a Novelang book (through command-line or system properties). Then, one can reference suchdirectories as any other directory inside the project using variable expansion:
insert file:${extdir:greeting}/salute.nlp
Too bad! Until now I avoided variable expansion which makes everything unreadable. Variable expansion makes sense if you want to restrict access to images in a given context, while not giving access to greetings. This doesn't make sense. After all, it's enough to give access to some external directories with no other kind of ceremony:
Then we let a Novelang book or stylesheet reference them:
insert file:../common/text/salute.nlp
By the way, this could be done using filesystem's permissions, but they are not portable accross systems. Anyways, as I don't see many use cases, implementing such a feature has the lowest priority by now.

Opening access to FOP configuration?

As I'm getting closer and closer to support multiple font directories comes the problem of how to define them. It seems logical to extend current convention, passing several paths to the novelang.fonts.dir VM argument, separated by platform's path separator. On Un*x it would look like:
But the path separator highlights that font definition becomes system-dependant (it's a semicolon on Windows). And anyways defining the fonts in the command line is unlogical as fonts are part of the rendering. So I'm thinking on embedding font directories names as XSLT metadata (this idea was already mentioned). I explored the possibility to embed the whole FOP configuration itself, which is XML, also. But opening direct access to FOP configuration would let the opportunity to do weird things:
  • Font cache configuration.
  • Default page settings. This makes only sense when configuration is accessed by multiple XSLT.
  • Title of the PDF document. The probable need to get this title from a source document (like the Book) would make this approach redundant.
On the other hand, stuff like hyphenation directories, ICC profiles and target resolution for bitmap images make sense. But one good reason to let Novelang keep hands on everything passed to FOP is to ensure that every directory is a subdirectory of current project, therefore preventing security threats. So we could have something like:
  <nmeta:fop version="1.0" >
      <renderer mime="application/pdf" >
          <directory recursive="true" >
        <filterList type="image" >



This really looks like FOP configuration (see FOP documentation for details), but what's not shown is all forbidden stuff. So we end up with the best of the two worlds.


FOP and fonts

FOP makes me feel dumb because it is great. In a previous post I already mentioned that FOP's font handling was better than I thought first, so I could have been wrong forcing a custom font naming. As a matter of fact, FOP holds a list of fonts inside the FontCache of its FontFactory. Novelang instantiates the FontFactory so it has full hands on it. Using a Java debugger, let's look at what a FontCache contains.
fontMap = {java.util.HashMap}
  [0] = {java.util.HashMap$Entry} 
    key: java.lang.String
    value: org.apache.fop.fonts.CachedFontInfo
      lastModified = 1042610200000
      metricsFile = {java.lang.String}
          "file:/…/.fop-font-metrics-14045 \
      embedFile = {java.lang.String}
      kerning = true
      fontTriplets = {java.util.ArrayList} 
        [0] = {org.apache.fop.fonts.FontTriplet} 
          name = {java.lang.String} 
          style = {java.lang.String} "normal"
          weight = 400
          priority = 0
          key = {java.lang.String}
failedFontMap = {java.util.HashMap}
Sweet! Here is everything I need:
  • Font name.
  • Font style, "italic" or "normal".
  • Weight. Not just "normal" or "bold" but "light" and "extra-bold".
  • Priority for dealing with duplicates
  • A list of fonts which could not be read (failedFontMap).
The moderately bad news is, fontMap and failedFontMap are private fields but I see no reason to not use dirty reflexion here. The example above is biased as it was created from a Novelang-generated font list, so I'll have to investigate a bit more to see how Fop deals with:
  • Failed fonts.
  • Font name different from font file name.
  • Multiple directories (including nested ones).
To sum up, FOP provides all I need to make Novelang code cleaner and bring following enhancements:
  • Multiple directories.
  • List of failed fonts.
  • Warning in case of duplicates.
  • Fonts sorted by font name.
  • Throw away temporary .fop-font-metrics directory.
  • Cache of font descriptions handled by FOP itself (this is the meaning of the lastModified field in the FontCache).
  • Support more font types. By letting FOP do its job we let its FontFileFinder recognize following font files: *.ttf for True Type, *.pfb for Type One. The *.otf suffix also appears and those fonts may be treated as TTF,

Novelang-0.10.0 released!

Latest release of Novelang available here. Coolest feature: barcode generation for PDF! Sounds gadget at this stage of development but I was needing it somewhere else and it didn't cripple architecture. See release notes for details. Enjoy!


Problem with 'œ' and 'Œ' characters

By now, French users of Novelang willing to type "œ" and "Œ" need to type "«oelig»" and "«OElig»" (yes, angled quotes included). That's especially boring for Mac users who are eager to just type Alt-o and Shift-Alt-O. The Unicode specification makes œ and Œ ( 'LATIN SMALL LIGATURE OE' and 'LATIN CAPITAL LIGATURE OE') part of Latin Extended-A Block. All other letters with French accents are part of Latin-1 Supplement. Unfortunately, the commonly-favoured ISO-8859-1 encoding doesn't include "œ" and "Œ". As a consequence, while those characters may appear in a text editor configured to save files in ISO-8859-1 encoding, they'll appear as question marks when reopening the document. The Latin-1 supplement seems to offer characters that look the same: 'STRING TERMINATOR' (U+009C) and 'PARTIAL LINE BACKWARD' (U+008C). But I don't think it's a good idea to use them as their name suggests they have another purpose. Googling on "latin-extended-b iso-8859-1" I discovered this page listing all differences between ANSI (aka Windows-1252), Mac Roman and ISO-8859-1. Very useful! It seems that ISO-8851-1 was not such a clever choice, but I can't find any multiplatform 8-bit encoding including every commonly used French character.


git's "detached head"

git is just the best version control tool, ever, period. Most of times, it's straightforward. Sometimes you have to really understand what happens. A few day ago I synchronized with another repository on a USB key. Then I resumed my work. While requesting git status I saw many times there was no current branch. I decided to fix that by forcing the current branch, issuing a git checkout master. Then all the work I did since the synchronization appeared to be lost on both git repository and my local filesystem! As those commits happened, I was pretty sure they were somewhere inside my git repository. Reading git doc carefully, I learned that I was working with a "detached head" (poor myself).
It is sometimes useful to be able to checkout a commit that is not at the tip of one of your branches. [...] The state you are in while your HEAD is detached is not recorded by any branch (which is natural --- you are not on any branch). What this means is that you can discard your temporary commits and merges by switching back to an existing branch.
Clearly, I messed the merge in some way. The way to recover was mentioned: the "reflog" kept track of every changes (including those not attached to a branch) until a git prune or a git gc. Here is what I did. First, read the reflog and find last "lost" commit with my bare eyes:
$ git log -g --after=2008-08-14
Appeared to be:991ee3ebc11e1dc3434fab4c22e261b7e0711346. This time I created a branch:
$ git branch rescue_2008-08-23 
$ git checkout rescue_2008-08-23
Now get back every "lost" stuff with one single command (commits are chained):
git checkout 991ee3ebc11e1dc3434fab4c22e261b7e0711346
This looked good. I committed inside the "rescue_2008-08-23" and switched back to the "master" branch:
$ git commit -a
$ git checkout master
Merge happened seamlessly as a "fast forward", wow!
$ git merge rescue_2008-08-23

Updating 13d16f3..31626f8
xxx: needs update
Fast forward
[every "lost" change listed here, there were many!]
This mess happened because I had some problems during the merge I didn't try to understand. Next time if I get such problems I'll do all the mess in a new branch of the target repository, and then perform a second merge.



I just threw Barcode4J's library files into Novelang and in the next version you'll be able to include various kinds of beautiful barcodes in your PDFs. I could say: "look how powerful I am!" but as a fervent Novelang blogreader you know now who's deserving the fame. PDF-embedded barcodes in a FO stylesheet requires such a namespace declaration:
<xsl:stylesheet version="1.0"
And the barcode itself looks like this:
      <barcode:barcode message="L loves L!">
Of course the "L loves L!" message could be replaced by something more serious like an EAN-13 barcode (the one used for ISBNs). In this case the <barcode:datamatrix> element becomes <barcode:ean-13> but you get the idea (datamatrix looks very pretty). Barcode4J's documentation is excellent so at best I would do some copy-paste. Just one advice of mine: in order to avoid FOP warnings you should add the barcode: namespace in front of each element. Regarding Novelang's develepment roadmap, this barcode feature may look a bit alien but I was in need for it and it doesn't cripple Novelang architecture or grammar at all, just a few more files in the lib/ directory. The only problem I ran through was a missing SVG-related classes that appeared to be in the xml-apis-ext.jar file in Batik-1.7 that I renamed into batik-xml-apis-ext-1.7.jar. It's ok now and I won't complain if projects like FOP or Batik come with many jar files that help to understand what's doing what. Barcode4J is definitely sweet and makes your project shine. Long live to its developers.

Book configuration as rendering parameter

By now the ?stylesheet=... request parameter proved especially useful to render a single Part file, using a stylesheet that can be globally defined at Book level (with mapstylesheets command). As I blogged the Book file heads towards carrying more and more configuration stuff. So, a ?book-configuration=... request parameter would make sense, in order to reuse Book's configuration, like stylesheet, encoding, hyphenation language and probably more. Because a Book could define Part-specific configuration, those must be taken in account if the Part if one of those included by the Book.

About handling multiple languages

In my previous post I raised the subject of multiple languages. In order to keep things simple, one Part file should have one language no more. The Book file provides the context for file encoding and hyphenation language.
insert file:in-english.nlp 
The language could become a parameter to pass to every renderer (especially FO stylesheets!).

Hyphenation at work

With Novelang-0.9.0 comes hyphenation support for PDFs. Basically, it's just passing a directory containing hyphenation files to FOP (the PDF generator). When trying to make hyphenation work for real, I hit several problems. It was with fr.xml hyphenation rules, which fortunately comes under the GPL license. Hyphenation takes care of apostrophes. That's because with a "remain character count" of three it is correct to hyphenate a word like "l'attrait" like this: "l'at-trait". The fr.xml was quite clear, with many occurences of the APOSTROPHE character (U+0027) which is also called "single quote" and looks symmetrical. But hyphenation occurs after the FO-generating XSL replaced the <apostrophe-wordmate> element by the RIGHT SINGLE QUOTATION MARK character (U+2019) which looks better than APOSTROPHE, but was not understood by hyphenation rules, causing potential hyphenation bug on every word with a "relooked" apostrophe. I spent much time trying to hack the rules which were correct, and finally the solution was to replace every APOSTROPHE by RIGHT SINGLE QUOTATION MARK (the &#x2019; XML entity). Because hyphenation worked better, it changed the word distribution and raised another problem: some proper nouns got hyphenated. FOP documentation tells about an <exceptions> element containing words to not hyphenate at all. First it didn't work and I had to trace into FOP code to find out that every word in exception list should be lower-cased. So Novelang could support:
  • An exception list declared in the Book file itself.
  • Automatic replacement of the quoting character.
As quoting character may vary (as it is defined in the stylesheet) this implies a metadata mechanism with the stylesheet exposing which character it uses. Such a mechanism would be useful for plenty of other things, like expected image resolution for automatic resampling. When there is no licensing issue preventing from distributing the hyphenation file, there could be built-in files providing standard stuff (including the easy-to-forget hyphenation.dtd). Generating temporary files may seem unelegant but it makes debugging easier than in-memory structures and playing with custom URL protocols. Hyphenation would get really simple for French users! Now this opens another interesing question : how to handle documents with several languages?



I've been proud of Novelang support for custom fonts and now I'm reading this: FOP, the PDF generator used by Novelang, is able to scan multiple directories for fonts. It is even able to use system fonts (based on a directory scan, though). http://xmlgraphics.apache.org/fop/0.95/fonts.html#register I still believe that using system fonts is error-prone. I wonder if FOP is able to aggregate in the same family various files representing different weights and styles. Anyways, Novelang needs a font list for the font listing which is really useful, especially when there are some broken fonts somewhere. But FOP should hold such a list somewhere internally. Now I'm tempted to give access to FOP configuration file in some way instead of making FOP "transparent". Such transparency fails because the naming convention for fonts (Xxxx-bold-italic.ttf) was influenced by how FOP works.


Novelang-0.9.0 released!

Latest release of Novelang available from here. Coolest features of this release:
  • Custom fonts.
  • Hyphenation support.
  • New $style parameter for the insert function.
  • Superscript.
See release notes (in the "Status" chapter) for details. Enjoy!


GPL-friendly fonts: Bitstream-Vera

It's not the purpose of Novelang to come with its own set of fonts and this could turn licensing to a mess. But testing require a set of valid fonts and since the project is public, I'm reluctant to put a font I've no rights to distribute under the version control system. Lucky me I discovered the Bitstream Vera Fonts with a license which is GPL-compatible (they're used under a Linux-related project). Those fonts are designed for screen display rather than printing but they behave quite well and provide a very complete family which is perfect for various tests. If you're interested you can download them from here. Nice work, guys!

Extending standard stylesheet functions

The eXtensible Stylesheet Language for Transformations comes with its own set of functions, but Xalan, the XSLT processor shipped with Novelang support additional functions. Here is how it works, for function that converts numbers into words. First we define a static method (the most simple approach) in some Java class, doing the conversion we need. Parameters are: the number, the name of the language, if we want lower or upper case or capitals.
package novelang.rendering.xslt;

public class XsltFunctions {

  public static String numberAsText(
      Object numberObject,
      Object localeNameObject,
      Object caseObject
  ) {
  // ...

The class must appear in the Novelang classpath (Java developers know what it means). In the stylesheet we add a special namespace that we call "nlx" like "NoveLang eXtensions":
Here is how looks the call to convert a number into words:
      select="nlx:numberAsText(43,'EN','capital')" />
Of course function calls (like position()) can replace our litteral number ("43"). The complete example is here and also contains a nice trick for hiding page numbers when they are not welcome. This function is useful for giving a special touch to lists or chapter numbers but we can imagine many other usages.

Custom fonts in PDFs!

Custom fonts support in PDFs now work and will be available in the next release. Basically, here's all what you have to do:
  • Create a fonts directory at the root of your Novelang project.
  • Copy all the True Type fonts (.ttf files) you need here, suffixing the font name with bold.ttf, italic.ttf, and bold-italic.ttf, according to corresponding style and weight.
  • Check if all fonts are healthy by requesting the font listing to Novelang from your Web browser.
  • Enjoy, and set the font-family attribute in your stylesheets.
The URL for listing fonts is:
The listing displays all reccognized fonts with their name, the file name, and most of characters supported by Novelang grammar. The font directory may be set explicitely, with the novelang.fonts.dir. FOP, the PDF generator, needs to create a file for each font with font metrics. By default it is created in a fop-metrics directory under the Novelang project root, but this can be set to another place by setting the novelang.fop.fontmetrics.dir system property. If there is something wrong with any font in the directory, there is no custom font at all by now and the error message only displays in the log file. If you download free fonts on the Internet you will learn quickly that many are full of bugs and missing letters (especially accents) so I recommand to add them one by one and restart Novelang each time. FOP has its own limitations when dealing with fonts. True Type fonts define bold, italic and bold + italic as four different fonts. Operating systems and desktop applications show the families as a whole (but you know, they lie all the time). In order to remain platform-independant, Novelang uses the most simple convention. First it seemed inconvenient to require a copy of every needed font, but now it appears as a good thing to me. People using publishing tools often complain about a missing font, or a buggy one. Making font files a part of your Novelang project, with the same sharing and backup strategy as for content and stylesheet, is a clear way to gain in robustness. Possible usability improvements:
  • List broken fonts in the font listing.
  • Support several font directories.
  • Support True Type Collections (FOP does that).
  • Support Type One fonts (FOP does that).
  • Detect font file change on the disk and therefore provide an updated listing. Restarting Novelang after adding a font wouldn't be required anymore.
  • Use a temporary directory for font metrics files.


Decorations revisited

On the previous post about URLs we extended the decoration concept to URLs. We stated that some identation tells "this is a decoration for the construct right below". As this impacts the way to define identifiers here is an updated example, with 4-space indentations.
== Chapter one

    @tag-1 @tag-2
=== Section one

Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
Nunc vulputate, elit ac accumsan sodales, libero nisi 
euismod erat, a semper dolor turpis non pede. 
Donec sem ligula, congue id, porta et, tincidunt quis, 
eros. Praesent ipsum. Ut at urna. Proin cursus condimentum
risus. Fusce at lacus tincidunt 
    [Visit Novelang website]
metus tristique dictum. Pellentesque habitant morbi 
tristique senectus et netus et malesuada fames ac turpis 
egestas. Suspendisse potenti. Aliquam id quam. Quisque 
pellentesque est vitae est. Morbi faucibus ornare ligula. 
Pellentesque sed mi non elit vehicula ullamcorper. 
Pellentesque habitant morbi tristique senectus et netus 
et malesuada fames ac turpis egestas. Nunc vel eros nec 
leo mollis adipiscing.

Pellentesque mollis, quam et tincidunt vulputate, ligula 
lectus ullamcorper lacus, non sagittis lorem lectus ut 
tellus.Praesent diam mi, convallis et, pharetra sed, 
tempus in, lacus. Integer aliquet, augue ac vestibulum 
sollicitudin, ligula erat molestie eros, sed feugiat diam 
felis et odio.

=== Section two

In orci elit, porta id, volutpat ac, ornare sit amet, 
felis. Mauris vel ipsum eget mi gravida pellentesque. 
Vestibulum et pede et mi lobortis cursus. 
Phasellus fermentum, odio non auctor placerat, nisi pede 
aliquam nisi, in ultrices leo mi vitae risus. 
Lorem ipsum dolor sit amet,  
    "consectetuer adipiscing elit"
url-ref: \\novelang-website
. Quisque eu neque ac lectus consectetuer pharetra. Nulla 
rhoncus elementum mi. Phasellus vitae diam. Class aptent 
taciti sociosqu ad litora torquent per conubia nostra, per 
inceptos himenaeos. Sed bibendum, sem nec consectetuer 
laoreet, ante felis aliquam metus, non placerat nunc erat 
vitae dolor. 

URL syntax

By now URLs must appear as a standalone paragraph. This is correct:
Go to:


And see all useful links.
But this is incorrect:
Go to http://novelang.sourceforge.net and see all useful links.
There is a good reason to keep the URL on its own line: most text editors make easy to copy a whole text line, so there is less chance to forget some characters when moving the URL inside the text or copy-pasting it to a Web browser. Aside of this, "http", ":" and "//" are legal Novelang grammar constructs (as word, punctuation sign, and start of italics, respectively) so a hint on where to find a URL makes the Novelang grammar much simpler. I've looked at the way Markdown and WikiCreole define hyperlinks. Both require too many delimiters; I prefer to leverage on the fact a URL takes place on its own line. The good thing to keep from Markdown is labelling (give a label to some URL and reuse it later through this label); one day Novelang will do the same through identifiers. So let's say this could become legal:
Go to 
and see all useful links.
There are at least two features missing: text for URL and advisory title (the one appearing in a tooltip). URL text and advisory title are about "decorating" some Novelang construct. This was previously discussed for identifiers. We could get something like:
Go to 
  "Novelang website" 
  [Novelang website on Sourceforge.net]
and see all useful links.
I don't want to use new delimiters for URL text and advisory title, in order to not transform Novelang grammar to some new flavor of XML. So let's say double quotes are for URL text and square brackets for advisory title. As a good news, this notation is consistent with the way to decorate paragraphs with identifiers. This strengthens the meaning of indentation as "here stands metadata stuff for the thing right below". So we're breaking previous decision to put chapter and section decorations below the header. It's amazing to see, how keeping consistency on a grammar carries the showckwave of small changes on long distances: here it was about adding URL text and now we're revising the way we write chapter and section headers.


Novelang-0.8.0 released!

Latest release of Novelang can be downloaded here. Coolest features of this release: See release notes (in the "Status" chapter) for details.



Priorities are:
  1. Bug fixing.
  2. Documentation.
  3. Error handling.
  4. New features.
Stability is the key feature for adoption. Novelang will grow slowly, and won't advertise a lot until all features I think necessary are present. I don't want to harass my testers with bugs or missing features I already know about. Improvements on short term:
  • Report location on every error.
  • Try to recover on unmatched delimiter (like missing closing parenthesis).
  • Document some tricks.
Requirements for 1.0:
  • Better URLs: inside paragraphs, alt and text properties.
  • Fix potential punctuation problems.
  • Lists, ordered and unordered.
  • Images. May turn to an infinite feature list -- be careful!
  • Accolades and angled brackets (used for footnotes and index entries).
  • Identifiers. These are needed for generating table of content.
  • Bold, small caps, superscript, subscript, a few levels of headers below section.
  • "Beautiful" PDF generation with a look inspired from Manning's books, table of content, index and so on.
After release 1.0 there are two different paths to follow in parallel (will depend on feedback I suppose):
  • Improve content generation.
  • Open Novelang to other developers, as an embeddable / extensible software component.
Content generation improvements include:
  • Identifier-based inclusions.
  • Multi-document output (useful for generating web sites with several pages).
  • Resource scan for automatic copy in batch mode.
  • Some optimizations for speed / memory consumption.
Componentizing Novelang means:
  • Remove dependency to Jetty and rely on pure Servlet API.
  • Pluggable tree manipulation functions. By now such a mechanism is used internally but it deserves to get open. Would require some Generics to support custom Environment class.
  • Extensible grammars. Thanks to ANTLR 3.1 it will be possible to write a grammar reusing parts of an existing one. So developers could writer their own additions to Novelang's standard grammar, while ANTLR performs all consistency checks. By the way, making a grammar evolve is not a quiet game.
  • Extensible grammars mean redefining token list.
  • Component weaving with Guice. Guice is the coolest way to assemble components which are, basicall, functions.
  • Configurable escape codes, and whitespace triggers.
Novelang as component can be advertised through some plugin for a tool like Maven or Eclipse.


Directory listing

I just finished the directory listing feature and it seems terribly addictive. Let's say you started Novelang HTTP Daemon from $NOVELANG_HOME. The sample directory is full of samples. Given a URL like http://localhost:8080/samples, your browser displays a page listing all Novelang documents, including those in subdirectories.
All lines are links to subdirectories and documents. There is also a link to the parent directory, while it's not a parent of the content root itself (for security reasons). For a consistent URL scheme, a directory listing ends with "/". In the example above, the browser is forwarded to http://localhost:8080/samples/ (note the trailing solidus). There is another trick required by Safari. Safari doesn't take the MIME type of the document in account, just the resource extension. No matter how loud you say "it's HTML, stupid" it tries to download the file instead of displaying the page. So Safari is handled as a special case which is redirected to a URL like http://localhost:8080/samples/-.html. Yeah, it sucks. I chose the "-" name because it's not a valid filename so it won't conflict with document sources (it's perfectly legal to have a "index.nlp" file). There are many possible improvements:
  • Show directories containing no Novelang documents in a dimmed color (not showing them at all could be confusing).
  • Add a link to every supported format (first, PDF).
  • Add breadcrumbs like / > samples > served
  • Add some metadata like number of files and the date of the last modification.
  • Display files in the same directory on several columns.


Novelang-0.7.0 released!

Version 0.7.0 is hot! You can download it here. It comes with a complete redesign of literal and character escaping.
  • Literal blocks are still here, much improved as they support any character on the inside.
  • Hard inline literal, corresponding to "technical" text inside plain text, like code citation. Renderers will use monospace font. Every character will appear as it is.
  • Soft inline literal works the same way hard inline literal does, but it should not be rendered in a different manner than casual text. Soft inline literal is a convenient answer for supporting almost any character and disabling standard formatting that occurs with punctuation, while avoiding conflict with other style delimiters.
  • Character escape is a last-resort option for displaying characters used as delimiters for one of the literal forms described above.
There were previous posts describing how this should work. During the implementation, there were minor adjustments so refer to the documentation.


Character escaping

I just fixed a few bugs, now literal form supports nested less-than / greater-than signs, except if there are three greater-than signs in sequence at the beginning of a line. Very sweet (at least for Novelang documentation) to make this a correct literal block (starting with '<<<' and ending with '>>>', both on the beginning of the line):
>> >
This dramatically reduces the need for character escaping. Of course there is always a weird language to quote with three greater-than signs at the beginning of a line. And there may be other weird characters in a non-supported encoding. So we're hitting character escaping problem again. In the refactoring-characterescape branch I already pushed new character escaping based on the tilde '~' character but having a non-symmetrical delimiter makes the document source much less readable. Of course this is because I'm using character escaping as a workaround, until I implement better literal. But that unreadable stuff is like a warning that tilde character is inappropriate. And I realize that it's commonly used in programming languages, so it should be escaped in literal. Gets tedious when you copy-paste from your favorite programming language. As a Mac user I'm a bit stuck to their keyboard layout but I think that left and right pointing double angle quotation marks (don't laugh, it's official Unicode name) is ok. Instead of this:
I'm about to switch to this:
The interest is obvious when there are several escaped character to juxtapose:
is better than
On a Mac AZERTY keyboard the two characters are obtained with Alt-7 and Shift-Alt-7. There must be something similar on other platforms (Windows, QWERTY). Anyways this doesn't have to be used often so it's ok to use a weird character that doesn't appear in common text or programming language. It would be then possible to document Novelang correctly by giving a sample of literal like this:
Some literal here.
Or even like this:
Escape character like this: «lpdaqm»escapecode«rpdaqm».
Of course lpdaqm and rpdaqm stand for "left (respectively right) pointing double angle quotation mark". I prefer to avoid acronyms but this name is really too crazy.


Impressive XSL-FO resource

I was looking for how to make appear the name of current chapter in a PDF header. This is called "running header". Found Dave Pawson's site on XSLT, DocBook, and Braille. The FO section contains very serious stuff pretty above all other tutorials! The running header requires no trick. It's a standard FO feature: define a marker corresponding to current chapter title / whatever (fo:marker) and retrieve it from the header definition (fo:retrieve-*).

Novelang-0.6.0 is there!

Now you can do all sorts of amazing things with stylesheets, as explained in the documentation. There is also a nicer default stylesheet for PDF. Check out PDF version of Novelang documentation!


New feature: selectable stylesheets

I just checked into GitHub the code for selectable stylesheets. Until now, a Novelang project could define its own stylesheets, using custom stylesheets. While Novelang can render PDF and HTML with its own, built-in stylesheets, every user probably needs to define his-her own ones. When rendering a document, Novelang attempts to find appropriate stylesheet:
  1. In the directory given by novelang.stylesheet.dir system property, if defined.
  2. In a style directory under the directory from which Novelang was launched (corresponding to user.dir).
  3. Inside Novelang-x.x.x.jar under the /style directory.
"Appropriate stylesheet" means a stylesheet corresponding to the MIME type of requested document: pdf.xsl for a PDF document, html.xsl for a HTML document. That was not flexible enough because the same document of the same MIME type may deserve multiple renderings, like "miser printing", "visually impaired" and "tree-killer". That's where selectable stylesheets come to the rescue. With selectable stylesheets, you give the name of the stylesheet to use. This can be done at query level, or at book level. Let's say this is your project layout, with two stylesheets under the style directory:
After launching Novelang HTTP daemon, you can use the stylesheet query parameter to override any other stylesheet name:
Please note the html-beautiful.xsl path is still relative to the directory containing custom stylesheets! Another place to set stylesheet names is the Book file. Since a Book doesn't know how it will be rendered, you can define a stylesheet for multiple document MIME types. The book.nlb would look like this:

insert file:chapter-1.nlp

insert file:chapter-2.nlp
I've not tested subdirectories yet but they are supposed to work. Keep in mind: they will be relative to the directory containing your stylesheets. Supporting multiple stylesheets is a necessary step before provinding nice built-in stylesheets to be tried with documents of your own.