2008-09-29

Blogging while Sourceforge is down

I've uploaded Novelang-0.11.0.zip on Sourceforge but I'm stuck while their shell service is down — it's what I'm using to upload and unzip the Novelang website. So I'm refraining myself to advertise this new version is available, while it wouldn't appear on documentation available online. This release is not a shiny one: I just cleaned up some mess and added some tests. But now, both daemon and still-undocumented batch tool read command-line parameters the same way. These parameters follow the "--option=value" form. They supercede system properties that are no less than crappy global variables making automated tests hard to write. System properties also make troubleshooting more difficult, as badly-spelled system property fails silently. With the help of a command-line argument parsing tool, an unsupported options raises an exception at program startup, on a fail-fast basis. In order to make resulting configuration more understandable, the log shows how the value was set:
INFO  n.c.ConfigurationTools - Recognized user-defined 
  directory '/.../Novelang/samples/hyphenation' 
  (from option: --hyphenation-dir, Directory containing 
  hyphenation files).
INFO  n.configuration.ConfigurationTools - Creating 
  DaemonConfiguration from default value [8080] 
  (option not set: --port, TCP port for daemon).
A big temptation during this refactoring was to add new features. One frustrating moment was the handling of multiple font directories because of greater ambitions. But finally I managed to keep this development round short, the code got better, and having just one default /fonts directory works well in many cases. I've thought about a few potentially useful options:
  • --serve-shutdown (daemon only): by now this HTTP request shuts the daemon down: /~shutdown.html. This should be disabled by default, and enabled only with --serve-shutdown option.
  • --serve-remote (daemon only): by now any remote computer may access to the daemon (unless there is some firewall preventing it). The default behavior should be to restrict access to localhost, unless explicitely stated otherwise.
  • --flatten-output (batch only): by now the batch tool renders documents with the same path as source documents. This may cause annoying tricks to get generated files.
  • --sources-dir (batch and daemon): the directory to resolve document sources from.

2008-09-26

URL for querying metadata

So I'm thinking again on how to list all the fonts available from a given document, considering the stylesheet as the best place to define font directories. This may not be such a futile exercise as it draws the question about document metadata and the URI syntax to query it. By now the best I found is:
/~fonts/my.pdf
I like it because:
  • URI parser can detect that "~fonts" makes no sense unless the MIME type is PDF.
  • It doesn't mess the URI parameters which are about the document itself (not its metadata).
  • It's easy to extend with other functions like word count or whatever, with no risk to create incompatible options.
I took a different way for error pages: these URIs look like
/broken.pdf/error.html
But meanwhile I had to find a workaround for displaying directories: a pseudo -.html document. So we could get an unified way to display metadata through some kind of "service" (including errors): /~fonts/my.pdf /~error/broken.pdf
Or, with Safari: /~fonts/my.pdf/-.html /~error/broken.pdf/-.html

2008-09-25

How to display font listing

Yes, this is yet another post about fonts. As I found how to get more information on available fonts, I'm gathering some ideas on the best way to display them. By "font" I mean the combination of a font family (like "Verdana"), a weight (like "extra bold"), and a style (like "italic"). Each font has a font family name, and is backed by one file (though there can be several fonts in one file, like with Open Type fonts). FOP calls such a combination of font family, weight and style a "font triplet". So it's important to name clearly the fonts giving all the characteristics of the font triplet, and the name of the font file (with a path relative to the project's root). There should be a duplicate warning when this happens, that could be some small red symbol. A nice feature is to display characters that are supported in source documents. This is partially supported by now (the SupportedCharacters doesn't get'em all). Many fonts are documented in a way like this using a table; a table is nice because empty cells show missing characters. There should be a small text showing how the font renders. A sentence with all roman alphabet letters ("the quick brown fox jumps over the lazy dog") is not enough because it contains no accent. The best language-insensitive display I've seen is mixed-case alphabet ("AaBbCcDdEeFfGgHh..."). Because such a table takes much space, we can show one font per page. Because there will be many pages, the first page should list available fonts by family, with hyperlinks. As information about broken fonts becomes available, those should be listed in the first page, preferably with a red symbol aside.

2008-09-18

FOP and fonts, the story goes on

Now I've a better understanding on how FOP handles fonts and how to get its precious informations about font list, duplicates, and failures. In the PrintRendererConfigurator, the #buildFontListFromConfiguration static method does (almost) all the job. It takes following input parameters:
  • A Configuration object with a <renderer> as root element.
  • A URL (as a String) to resolve relative font URLs with. Null is supported.
  • A FontResolver, which can be a DefaultFontResolver.
  • A boolean set to true if an exception should be thrown if an error is found.
  • A FontCache instance or null if caching is disabled.
As it is a static method, #buildFontListFromConfiguration can be called from everywhere with a fresh FontCache instance. The latter is useful as it gathers failed fonts. A fresh cache instance is needed, because cached data may survive the JVM. The cache saves itself in a ~/.fop/fop-fonts.cache file, holding font descriptions. Font descriptions seem to be only invalidated when FOP attempts to load a font for "real" (at rendering time). So when hitting the cache in a test program, it sometimes returned font descriptions that shouldn't have been there. The FontResolver requires a FOUserAgent which is created from a FopFactory. The FopFactory itself is created from a Configuration which contains the <renderer> elements, so there should be some instance reuse here. I've found a private method somewhere which logs font duplicates but I can't find it back to see if there was any hook around (didn't seem so). Anyways it will be cleaner to sort out font triplets with the same value. Printing the font triplets on the console, I noted they take the right font name, whatever the font file is. Adios, proprietary font naming convention!

Font listing revisited

In a previous post, I found some good reasons to embed font directory list inside a stylesheet. With such an approach, there is no centralized declaration of font list, so font listing with http://localhost:8080/~fonts.pdf becomes unavailable. That's bad news, since font listing is incredibly useful to debug documents, especially when there are broken fonts (the Web is full of them, waiting to be downloaded). With font list inside the stylesheet, font listing requires the stylesheet itself as a parameter. This breaks current URL scheme. By now it's possible to tell Novelang to use a given stylesheet through the book itself, or with a URL parameter. So it makes sense to use document URL as a parameter for the font listing. Obviously, this is a common use case. I'm thinking about something like:
/my.pdf?listfonts
/my.pdf?stylesheet=mypdf.xsl&listfonts
This is less elegant than current solution but if there are many font directories (like one per font) this helps to reduce the list length to what's really used. The /~fonts.pdf pseudo-document may stay useful, listing all the fonts under the project directory, using a deep directory scan.

'External' directories

Previous post highlighted that Novelang should not allow a reference to a directory out of its project. We'll call such a directory an external directory. The reason is, Novelang could be used (in a distant future) as an embedded component in a Web application where users upload their own source documents and stylesheets. A malicious stylesheet could exploit some special FOP behavior to embed a file that it is not supposed to, like password file, or just another user's document. By now Novelang just filters HTTP queries, especially those for directory listing. There is no check on the path on images or fonts that FOP tries to embed. Enforcing file access restriction is a great subject by itself. How to handle resource access, depending on current Novelang project? How to test security in general? Those points arise as I'm writing, but the initial topic of this post is: how to let a project access to a directory out of its scope, let's say, in case of multiple projects sharing same datas like fonts on a privately-owned local filesystem? This may be achieved using Un*x symbolic links, depending on Java support them. A more portable solution could be to set a system option like:
external.allfonts=../shared/all-the-fonts
external.logos=../shared/images/my-logo
external.greetings=../shared/text
"System option" means it is defined outside of a Novelang book (through command-line or system properties). Then, one can reference suchdirectories as any other directory inside the project using variable expansion:
insert file:${extdir:greeting}/salute.nlp
Too bad! Until now I avoided variable expansion which makes everything unreadable. Variable expansion makes sense if you want to restrict access to images in a given context, while not giving access to greetings. This doesn't make sense. After all, it's enough to give access to some external directories with no other kind of ceremony:
externaldirectory=../common
externaldirectory=../../Shared/images
Then we let a Novelang book or stylesheet reference them:
insert file:../common/text/salute.nlp
By the way, this could be done using filesystem's permissions, but they are not portable accross systems. Anyways, as I don't see many use cases, implementing such a feature has the lowest priority by now.

Opening access to FOP configuration?

As I'm getting closer and closer to support multiple font directories comes the problem of how to define them. It seems logical to extend current convention, passing several paths to the novelang.fonts.dir VM argument, separated by platform's path separator. On Un*x it would look like:
-Dnovelang.fonts.dirs=my-fonts:other/fonts
But the path separator highlights that font definition becomes system-dependant (it's a semicolon on Windows). And anyways defining the fonts in the command line is unlogical as fonts are part of the rendering. So I'm thinking on embedding font directories names as XSLT metadata (this idea was already mentioned). I explored the possibility to embed the whole FOP configuration itself, which is XML, also. But opening direct access to FOP configuration would let the opportunity to do weird things:
  • Font cache configuration.
  • Default page settings. This makes only sense when configuration is accessed by multiple XSLT.
  • Title of the PDF document. The probable need to get this title from a source document (like the Book) would make this approach redundant.
On the other hand, stuff like hyphenation directories, ICC profiles and target resolution for bitmap images make sense. But one good reason to let Novelang keep hands on everything passed to FOP is to ensure that every directory is a subdirectory of current project, therefore preventing security threats. So we could have something like:
<xsl:stylesheet
    version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:fo="http://www.w3.org/1999/XSL/Format"
    xmlns:n="http://novelang.org/book-xml/1.0"
    xmlns:nmeta="http://novelang.org/meta-xsl/1.0"
>
  <nmeta:fop version="1.0" >
    <target-resolution>72</target-resolution>
    <renderers>
      <renderer mime="application/pdf" >
        <fonts>
          <directory recursive="true" >
            my/fonts
          </directory>
        </fonts>
        <output-profile>
          profiles/EuropeISOCoatedFOGRA27.icc
        </output-profile>
        <filterList>
          <value>null</value>
        </filterList>
        <filterList type="image" >
          <value>flate</value>
          <value>ascii-85</value>
        </filterList>
      </renderer>
    </renderers>
  </nmeta:fop>

    ...

</xsl:stylesheet>

This really looks like FOP configuration (see FOP documentation for details), but what's not shown is all forbidden stuff. So we end up with the best of the two worlds.

2008-09-16

FOP and fonts

FOP makes me feel dumb because it is great. In a previous post I already mentioned that FOP's font handling was better than I thought first, so I could have been wrong forcing a custom font naming. As a matter of fact, FOP holds a list of fonts inside the FontCache of its FontFactory. Novelang instantiates the FontFactory so it has full hands on it. Using a Java debugger, let's look at what a FontCache contains.
fontMap = {java.util.HashMap}
  [0] = {java.util.HashMap$Entry} 
    key: java.lang.String
        "file:/…/fonts/URWGothicL-BookObli.ttf"
    value: org.apache.fop.fonts.CachedFontInfo
      lastModified = 1042610200000
      metricsFile = {java.lang.String}
          "file:/…/.fop-font-metrics-14045 \
               .temp/URWGothicL-BookObli.xml"
      embedFile = {java.lang.String}
          "file:/…/fonts/URWGothicL-BookObli.ttf"
      kerning = true
      fontTriplets = {java.util.ArrayList} 
        [0] = {org.apache.fop.fonts.FontTriplet} 
          name = {java.lang.String} 
              "URWGothicL-BookObli"
          style = {java.lang.String} "normal"
          weight = 400
          priority = 0
          key = {java.lang.String}
              "URWGothicL-BookObli,normal,400"
        […]
  […]
failedFontMap = {java.util.HashMap}
Sweet! Here is everything I need:
  • Font name.
  • Font style, "italic" or "normal".
  • Weight. Not just "normal" or "bold" but "light" and "extra-bold".
  • Priority for dealing with duplicates
  • A list of fonts which could not be read (failedFontMap).
The moderately bad news is, fontMap and failedFontMap are private fields but I see no reason to not use dirty reflexion here. The example above is biased as it was created from a Novelang-generated font list, so I'll have to investigate a bit more to see how Fop deals with:
  • Failed fonts.
  • Font name different from font file name.
  • Multiple directories (including nested ones).
To sum up, FOP provides all I need to make Novelang code cleaner and bring following enhancements:
  • Multiple directories.
  • List of failed fonts.
  • Warning in case of duplicates.
  • Fonts sorted by font name.
  • Throw away temporary .fop-font-metrics directory.
  • Cache of font descriptions handled by FOP itself (this is the meaning of the lastModified field in the FontCache).
  • Support more font types. By letting FOP do its job we let its FontFileFinder recognize following font files: *.ttf for True Type, *.pfb for Type One. The *.otf suffix also appears and those fonts may be treated as TTF,

Novelang-0.10.0 released!

Latest release of Novelang available here. Coolest feature: barcode generation for PDF! Sounds gadget at this stage of development but I was needing it somewhere else and it didn't cripple architecture. See release notes for details. Enjoy!

2008-09-03

Problem with 'œ' and 'Œ' characters

By now, French users of Novelang willing to type "œ" and "Œ" need to type "«oelig»" and "«OElig»" (yes, angled quotes included). That's especially boring for Mac users who are eager to just type Alt-o and Shift-Alt-O. The Unicode specification makes œ and Œ ( 'LATIN SMALL LIGATURE OE' and 'LATIN CAPITAL LIGATURE OE') part of Latin Extended-A Block. All other letters with French accents are part of Latin-1 Supplement. Unfortunately, the commonly-favoured ISO-8859-1 encoding doesn't include "œ" and "Œ". As a consequence, while those characters may appear in a text editor configured to save files in ISO-8859-1 encoding, they'll appear as question marks when reopening the document. The Latin-1 supplement seems to offer characters that look the same: 'STRING TERMINATOR' (U+009C) and 'PARTIAL LINE BACKWARD' (U+008C). But I don't think it's a good idea to use them as their name suggests they have another purpose. Googling on "latin-extended-b iso-8859-1" I discovered this page listing all differences between ANSI (aka Windows-1252), Mac Roman and ISO-8859-1. Very useful! It seems that ISO-8851-1 was not such a clever choice, but I can't find any multiplatform 8-bit encoding including every commonly used French character.