2008-08-30

Hyphenation at work

With Novelang-0.9.0 comes hyphenation support for PDFs. Basically, it's just passing a directory containing hyphenation files to FOP (the PDF generator). When trying to make hyphenation work for real, I hit several problems. It was with fr.xml hyphenation rules, which fortunately comes under the GPL license. Hyphenation takes care of apostrophes. That's because with a "remain character count" of three it is correct to hyphenate a word like "l'attrait" like this: "l'at-trait". The fr.xml was quite clear, with many occurences of the APOSTROPHE character (U+0027) which is also called "single quote" and looks symmetrical. But hyphenation occurs after the FO-generating XSL replaced the <apostrophe-wordmate> element by the RIGHT SINGLE QUOTATION MARK character (U+2019) which looks better than APOSTROPHE, but was not understood by hyphenation rules, causing potential hyphenation bug on every word with a "relooked" apostrophe. I spent much time trying to hack the rules which were correct, and finally the solution was to replace every APOSTROPHE by RIGHT SINGLE QUOTATION MARK (the &#x2019; XML entity). Because hyphenation worked better, it changed the word distribution and raised another problem: some proper nouns got hyphenated. FOP documentation tells about an <exceptions> element containing words to not hyphenate at all. First it didn't work and I had to trace into FOP code to find out that every word in exception list should be lower-cased. So Novelang could support:
  • An exception list declared in the Book file itself.
  • Automatic replacement of the quoting character.
As quoting character may vary (as it is defined in the stylesheet) this implies a metadata mechanism with the stylesheet exposing which character it uses. Such a mechanism would be useful for plenty of other things, like expected image resolution for automatic resampling. When there is no licensing issue preventing from distributing the hyphenation file, there could be built-in files providing standard stuff (including the easy-to-forget hyphenation.dtd). Generating temporary files may seem unelegant but it makes debugging easier than in-memory structures and playing with custom URL protocols. Hyphenation would get really simple for French users! Now this opens another interesing question : how to handle documents with several languages?

No comments: