The Novelang blog: October 2008

2008-10-23

URW Fonts

Ok, I swear I take a break on fonts for a time. But before I have to post this link on URWFonts. URWFonts are a free library of good-quality fonts under the Gnu General Public License. Here's the stuff: http://www.ghostscript.com/awki/GhostPCL I already wrote that Novelang should not come with its own set of fonts but creating nice-looking documents with base 14 fonts (vanilla Times + Helvetica + Courier) just sounds too hard for me. In the same post I wrote that Bitsream Vera Fonts were looking good and, well, I'm not so sure now. But URW fonts are just great. They reproduce a set of standard, classical fonts, with little marvels like Garamond, or Palladio (which looks close to Palatino). For a project like Novelang, it's just a perfect gift. Free fonts of decent quality are pretty rare, and those ones seem to have consistent baselines and spacing across the collection (more tests needed to be sure). One potential problem for the end-user is the GPL license. This is fine for Novelang (which is GPL'ed, too) but if you create a PDF using those fonts, then your document becomes GPL'ed and it may not be what you want. It seems there is a special exception for some distributions of those fonts, but I could not grab the downloadable file.

Novelang-0.12.0 released!

Latest release of Novelang available here. Font list looks better than ever! See release notes for details.

2008-10-21

Stylesheet metadata and caching

As shown by previous posts on this blog, I've already spent a lot of time on fonts and how FOP deals with them. By now (before releasing Novelang-0.12.0) the result is far from perfect but I should move on to other topics in order to keep the product well-balanced. So today I'm blogging for the pleasure to write purely speculative things, you've been warned. Outstdanding defects of current solution are:

FOP caches font metadata in ~/.fop/fop-fonts.cache. Font deletion doesn't seem to be detected, so it's better to always delete the cache file at application startup.
Font changes are not detected while application runs. This is Novelang's fault (in RenderingConfiguration) as it always returns the same FopFactory instance.
Fonts defined as application parameters. Should be defined at stylesheet level in order to get a finer grain.

The last point relates to the previously-discussed point on how to enrich an XSL file with FOP-specific metadata. It could include a link to hyphenation directory, as hyphenation rules may depend on some characters defined in the stylesheet (like the apostrophe). By now, when the user touches one of her / his stylesheets, the next document rendition takes the changes in account, and that's the way it should be. Fonts or hyphenation directories are set at FopFactory level, which currently lives as long as the whole Novelang application. For supporting "live" changes in XSLs, Novelang should re-create a FopFactory each time. What happens if reading font metrics takes much time? Remember: FOP has a cache for this and I guess there is a reason. I know that "premature optimization is the root of all evil" so caching has low priority for Novelang but for my own personal comfort I must know how to perform caching. The goal here is to cache a whole FopFactory (possibly several ones, for several documents rendered concurrently) for reuse when the configuration is the same. From my Cocoon days, I remember the excellent caching system. Each cacheable resource involved in the making of the final document has a CacheValidity object that tells whenever another CacheValidity is valid regarding current one. Here is my own flavor of CacheValidity using Generics (the Javadoc of the original is here):

public interface CacheValidity< T > {
 boolean isValid( T other ) ;
}

I'm not sure on how to use Generics in this case but anyways I'll see. Now here is how to get a new resource from a cache. Let's say this is a method of an object holding an instance of a CacheableFile. Synchronization is omitted for brevity.

public String getContent( File file ) {
  final FileValidity current = 
      new FileCacheValidity( file ) ;
  if( ! cache.getValidity().isValid( current ) ) {
    cache = createCache( file ) ;
  } ;
  return cache.getCachedContent() ;  
}

From the simple and clever CacheValidity object, its is possible to implement various caching strategies in a transparent manner. It is also possible to compose them, like using a temporal cache for avoiding any refresh during an given time interval, or implement a directory cache from a list of file caches. More exotic CacheValidity objects may represent a whole XML fragment corresponding to the XSL metadata. If it is left untouched by the user and files are untouched, too (what should happen the most often) then the FopFactory may be used again. I don't know if it's worth reusing Avalon code (it's a lot of mess with no Generics) but it was definitely worth a look. So what's the lesson here? If I had started to look at caching issues from the beginning, attempting to make Novelang fit around Avalon or whatever, I would have lost the focus and started recoding Cocoon (which is the historical reason for building Avalon). But I started Novelang because I was unhappy with Cocoon. So I'll probably have a close look at Avalon, and rewrite caching stuff the way I feel. This may sound like horribly stupid "Not Invented Here" syndrome but my own experience on incremental developments shows that value is not in the code itself, but in the understanding of the problems. (That's why many organizations worship somewhat crappy home-grown frameworks: not because of their intrisic value, but because it's the proof they could bring many people thinking the same way.) Joel Spolsky develops this point of view under the light of the competitive advantage.

2008-10-20

Web fonts

My recent efforts for supporting custom fonts for PDF documents greatly increased my attention to all font-related stuff. Here is an article from Ars Technica about the revival of Web fonts, with CSS linking to downloadable fonts. While several formats are competing, there is a clear trend here. I don't anything more to do for Novelang than serving a static file. Or am I missing something? Anyways, deploying fonts on the Web will be a huge fest of copyright issues. Downloadable PDFs with embedded fonts may raise the same issues but it's a marginal case for now. Sidenote: I had a look at the demo page in Ars article. Camino 1.6.4 displays replacement fonts but Safari 3.1.2 does its job. In addition, Safari's Web Inspector displays the font list with a nice preview.

2008-10-19

General-purpose text processing library

Using Novelang to produce real-word documents (I mean: more than Novelang documentation), I discovered how it is convenient for producing custom idioms without touching the main grammar. I mean: Novelang syntax supports well-known artefacts like quotes, parenthesis, square brackets, punctuation signs, chapter headers, and so on. The text gets abstracted into a tree-like structure which is processed by a stylesheet that may be a custom one. The default stylesheet recognizes the "bracketed" item of the structure and outputs brackets around the text inside the "bracketed" tree fragment and everything looks fine. Now consider the case where:

Your text doesn't need square brackets.
You need to express something else, like a special name with special typographical effect.

Quickly, you start attributing a new effect to the square brackets. Because it corresponds to a new meaning, you just started building your own semantic markup. And, let me say it again: without touching the main grammar. It's even possible to assign different semantics to different parts of a document. From a Book you can tag an inserted Part with a special style:

insert file:mybibliography.nlp
  $style=bibliography

Then the content of the Part has a style element containing the "bibliography" string. So the stylesheet may use a special template to process entries like this, where italics inside a section don't mean it's italics, but the text to sort the author list on:

=== Paul //Graham//

On Lisp [Prentice Hall]

=== Allen //Holub//

Taming Java Threads [APress]

That's incredibly lightweight compared to semantic markups like DocBook's one. The magic only comes from:

The choice to avoid too-specific markup whenever possible.
The choice of a distinct presentation layer.

With this in mind I see a chance to turn parts of Novelang to a general-purpose text processing library, with pluggable presentation layer.

2008-10-03

Novelang-0.11.0 released!

Latest release of Novelang available here. Multiple font directories, no need for temporary font metric files. Command-line arguments supercede system properties, See release notes for details. Update: I wrote that Sourceforge shell service was down, but it has been shut down permanently. So I'll have to change the script for uploading documentation.

2008-10-01

Inferring fonts characteristics

Now I'm trying to display a nicer font listing. FOP does a great job, reading font files and extracting font name, style, and weight. A font name is disconnected from the font file name (though before Novelang-0.11.0 it was not the case). A font name should correspond to a typeface, which is a family of font. For the "Linux Libertine" font name, there can be several variants, like roman or bold+italic. But when taking a closer look at the information provided by FOP there are "virtual" font names, corresponding to the font variant of a given file, or an abbreviated name. Let's consider the files of the Linux Libertine typeface:

LinuxLibertine.ttf
LinuxLibertine-Italic.ttf
LinuxLibertine-Bold.ttf
LinuxLibertine-Bold-Italic.ttf
LinuxLibertine-SmallCaps.ttf

Beware of the trick: there are four files corresponding to standard style / boldness combinations plus the small capitals which can be considered as a separate font. In FOP's terminology, a font-triplet associates a font name, a style (normal / italic) and a weight (normal, bold, extra-bold...). Each triplet has a priority meaning (I guess) that triplets with higher priority should be used first when resolving a font triplet.From the five files FOP extracts following font-triplets:

"Linux Libertine" italic, bold, p=12
LinuxLibertine-Bold-Italic.ttf

"Linux Libertine" italic, normal, p=7
LinuxLibertine-Italic.ttf

"Linux Libertine C" normal, normal, p=7
LinuxLibertine-SmallCaps.ttf

"Linux Libertine" normal, bold, p=5
LinuxLibertine-Bold.ttf

"Linux Libertine Bold Italic" normal, normal, p=0
LinuxLibertine-Bold-Italic.ttf

"LinLibertineBI" normal, normal, p=0
LinuxLibertine-Bold-Italic.ttf

"Linux Libertine Bold" normal, normal, p=0
LinuxLibertine-Bold.ttf

"LinLibertineB" normal, normal, p=0
LinuxLibertine-Bold.ttf

"Linux Libertine Italic" normal, normal, p=0
LinuxLibertine-Italic.ttf

"LinLibertineI" normal, normal, p=0
LinuxLibertine-Italic.ttf

"Linux Libertine Capitals" normal, normal, p=0
LinuxLibertine-SmallCaps.ttf

"LinLibertineC" normal, normal, p=0
LinuxLibertine-SmallCaps.ttf

"Linux Libertine" normal, normal, p=0
LinuxLibertine.ttf

"LinLibertine" normal, normal, p=0
LinuxLibertine.ttf

This looks quite messy. Using raw this raw data, the font listing would reveal 14 fonts instead of the 5 expected. That is because FOP focuses on resolving font variants given a name, a style and a boldness, while each font file may contain more than one font name. Novelang has to take FOP's information and move it upside-down to obtain a human-readable font list. First, sort all triplets by priority (like above). Let's say that all triplets with a priority greater than 0 define "good" font names: font names that are shared between triplets can be used safely to choose font variants (while there is no chance to get a variant from font names that already describe a variant, like "LinLibertineI"). Let's call those names the "clean names". In the list above we get following clean names: "Linux Libertine" and "LinLibertineC". Then it is easy to craft a structure like this:

"Linux Libertine"
  italic, bold, LinuxLibertine-Bold-Italic.ttf
  italic, normal, LinuxLibertine-Italic.ttf
  normal, bold, LinuxLibertine-Bold.ttf
"Linux Libertine C"
  normal, normal, LinuxLibertine-SmallCaps.ttf

The "Linux Libertine, normal, normal" font-triplet is missing. Using the clean name "Linux Libertine" it is easy to find from font-triplets with priority zero. If looking for perfection we can try to locate a better name for "Linux Libertine C". How? Once the clean names are established, we look for singletons in the set of font triplets with priority greater than zero. For each of those elements, we replace the clean name by an "outstanding name" which is the longest name in the set of font-triplets with priority zero with the same font file (LinuxLibertine-SmallCaps.ttf). So now we have something like this:

"Linux Libertine"
  italic, bold, LinuxLibertine-Bold-Italic.ttf
  italic, normal, LinuxLibertine-Italic.ttf
  normal, bold, LinuxLibertine-Bold.ttf
  normal, normal, LinuxLibertine.ttf
"Linux Libertine Capitals"
  normal, normal, LinuxLibertine-SmallCaps.ttf

Now there is the temptation to show all available font names in the list, like "Linux Libertine C" as an alias for "Linux Libertine Capitals". While this would increase the complexity of the algorithm, I don't see how useful this would be. Anyways, the algorithm described above may require additional work, considering messy fonts of the real world.