The Novelang blog: February 2009

2009-02-27

Images

I just checked a first working version of images into github. It’s far from complete, but with such source document:

image:foo/bar.jpg

… the stylesheet receives a n:image element containing foo/bar.jpg. How beautiful. It is supposed to work the way URL do and offer a uniform syntax. It addition, it’s more or less used elsewhere, see http://www.wikicreole.org/wiki/ImagesReasoning.

But it’s crap. It’s wrong, it’s inconsistent, it will not be comfortable.

URL exist on their own line because this makes copy-paste easy. There is no operating system supporting image: as a protocol. And URL are links to another resource, while image:... represents the resource itself.

Other wikis need the image: prefix because they accept a / in the middle of the content. Novelang requires the solidus to appear as literal. Therefore something like screenshots/preferences.png cannot be confused with three words with punctuation signs or symbols inbetween.

Current image support takes paths relative to the project root, but a path relative to current Part file is comfortable in some cases. If the image is in the same directory, we need to make the solidus character appear so we’ll have ./preferences.png (instead of preferences.png which could be two dot-separated words).

The extension is enough to make the difference with other resources. Almost everybody knows that .jpg, .png, .gif, .svg are for images, and it leaves room for other stuff like .csv.

In a previous post about tables, I stated that image declaration would be too long to fit in a cell, but with relative path and no image: prefix this has to be revised.

Images are definitely not the same thing as URL, but the decorations should work the same way, with an identifier and a name.

Then, the identifier could replace the image, or provide some kind of reference. I’ll tell more about identifiers another day but here’s a complete example using two different source files.

First file: we declare the image, decorating it with metadata.

  \dog-with-bone
  "My dog with its bone"
/photos/dog.jpg

Second file: using images declared above.

See a picture of my dog:

\dog-with-bone

... later in the text, we want some reference to the 
picture to appear, like its name, some hyperlink or 
a figure number (depending on the stylesheet). 

You have already seen my dog in -\dog-with-bone .

Finally, it seems we’ve the best of every world with a compact notation.

2009-02-21

Embedded maths

Musing on the Web I discovered those little gems: JEuclid, Open Office Math and ASCIIMathML.

JEuclid is a renderer for MathML. MathML is an XML-based representation of mathematical formulæ. It has a FOP plugin, which transforms an embedded MathML expression into a nice formula in the resulting PDF.

http://jeuclid.sourceforge.net

JEuclid claims it supports the .mml files exported by Open Office Math.

MathML is horribly verbose and not intended to be used by humans, but rather to help programs to interoperate.

OpenOffice Math is a math formula editor, bundled with Open Office. It’s partially WYSIWYG as it lets you type a formula in plain text and produces a preview in (quasi) real time. Open Office Math favors its .odf format but is able to save and load formulæ in MathML. Its text-based editor supports formulæ like this:

f(x)=sum from{n=0} to{infinity} 
  { { f^{(n)}(a) } over {n!} (x-a)^n }

You can see OpenOffice Math in action here: http://en.wikipedia.org/wiki/OpenOffice.org_Math

Used together, JEuclid and OpenOffice Math could make Novelang more attractive to TeX users, who always have been unbeatable when it comes to craft beautiful graphics from text-based formulæ.

Novelang could learn to recognize a reference to a MathML file to be edited with Open Office:

When ``a > b`` we always have
math:my-formula.mml
bla bla blah.

In a perfect world, Novelang would support formulæ inside the source document (with a tweak to make a n:lines-of-literal appear inside a paragraph).

When ``a > b`` we always have
<<<math
f(x)= ...
>>>
blah blah blah.

This requires a translator from text-based formula to MathML. Such a translator is hard to find, especially with the OSS constraint.

Maybe I’ve found this rare beast, with ASCIIMathML. It’s a Javascript-based translator designed to run inside a Web browser. The interactive demo is stunning!

http://mathcs.chapman.edu/~jipsen/mathml/asciimath.html

It recognizes TeX (same formula as above):

$f(x)=\sum_{n=0}^\infty\frac{f^{(n)}(a)}{n!}(x-a)^n$

… or its custom ASCII-based format. Here again, the same formula:

`f(x)=sum_(n=0)^oo(f^((n))(a))/(n!)(x-a)^n`

ASCIIMathML is released under the LGPL. Great. I just wonder if it works inside a Java-powered Javascript interpreter.

Font survey

Gentium

According tho its website, Gentium is a typeface family designed to enable the diverse ethnic groups around the world who use the Latin and Greek scripts to produce readable, high-quality publications.

http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&item_id=Gentium

The license is quite liberal. It avoids the “viral” effect of GPL: embedding the Gentium font in the PDF you redistribute on a numerical form won’t turn on the GPL for the PDF.

http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&id=OFL

Bad news: the OFL license doesn’t seem compatible with the GPL v3 (documentation is unclear). So you can use Gentium fonts but you’ll have to download it with your bare hands, as it cannot fit in a Novelang distribution, which is released under GPL v3.

Junicode

Good news: the beautiful Junicode font will become part of Novelang. Junicode (short for Junius-Unicode) is a Unicode font for medievalists. It embeds various amazing characters like old English and runes. But as a great classical serif font, it’s a nice replacement for boring Times New Roman.

http://junicode.sourceforge.net

But the Junicode license, which is GPL v2, has no exception about embedding so if you embed it in a PDF (what Novelang always does) your PDF will be redistributable under the GPL license.

Linux Libertine

Better news: the Linux Libertine font, bundled in Novelang for a long time, is dual-licensed, under both GPL and OFL, so you can use at no legal risk.

http://linuxlibertine.sourceforge.net/Libertine-EN.html

2009-02-20

Novelang-0.20.0 released!

Latest release of Novelang available here.

This version brings tables!

While source documents prevent to embed any formatting information it's possible to have very fine control at stylesheet level, with all FO power! Novelang documetation does this for its PDF, by overriding n:cell... elements when it detects a special style (purposefully named character-escapes here). See pdf.xsl. The style is set from the novelang.nlb book file, which sets the structure of the whole Book.

See release notes for details.

Enjoy!

2009-02-19

Tables

Other wikis make tables a daunting subject, often because they try to embed crazy formatting instructions. Yo get a taste here: http://www.wikicreole.org/wiki/ListOfTableMarkups. First I decided that tables wouldn’t be a concern before a long, long time because I felt I could live without them. But everybody needs to create tables, and simulating them with literal in a fixed-width font is not great. I reconsidered my position upon an external request that made me ask to myself: after all, would it be so hard to implement the simplest, smallest, least controversial feature set to let people define tables?

Obivously this would start around the well-adopted vertical line separator. To keep things clear, no line break is allowed inside table definition. Vertical lines may not be vertically-aligned. So it would look like:

| row1, col1 | row1, col2 | row1, col3 | 
| row2, col1 |  row2, col2   | row2, col3 |

No headers, no justification, no span, no calculated fields, no neverending list of conflicting features. All complicated stuff must be done in the stylesheet or never. Then it becomes easy to support tables in Novelang grammar, and easy to write correct source documents using them.

There is still one potential conflict, however, with URL (or images, which will use the same grammatical approach). In a Novelang source document, the URL must appear at the start of a line. But how to make it appear inside a list of row cells while keeping the grammar readable? The answser is: no URL inside a table. A URL is a lengthy thing which doesn’t fit in a table cell, so there is another syntax to find. Meanwhile, users are free to handle particular cases with a context-specific hack in their stylesheet.

Because users are encouraged to write stylesheets on their own, the XML structure given in input has to be very consistent. Of course the structure will look like a HTML table (table containing rows which contain divisions). By avoiding semantic notations (e.g. drop n:emphasis and prefer n:block-inside-solidus-pairs) the XML tells offers a more structural view, letting the stylesheet give a meaning on its own.

There is the temptation to call a table a “table” but it would be lying. See this:

| item1 | item2 | item3 |

The first vertical line indicates the start of the first item, the last vertical line indicates the end of the list, and items are separated by vertical lines. The table row is just a picture in your mind. In fact, it’s no more than a list. The whole table is a list of lists. The default stylesheet will show it as a table, maybe with fancy headers, but its just one manner to show lists of lists.

So which name could tell it’s a list of list, while not completely hiding the fact it’s a table?

The word “cell” is a good start. Table have cells, but we’re living in a universe where cells exist outside of tables. So our items get wrapped in n:cell elements.

For a row, n:cell-list is an obvious choice but it doesn’t tell about the “horizontality”. So we’ll prefer n:cell-row which is compatible with the table vocabulary, while staying close to the representation in the source document. Other XML elements tell about their delimiting character, like n:list-with-triple-hyphen so it’s tempting to consider n:cell-row-with-vertical-lines-between-cells for consistency.

The solution is to tell about the delimiter only in the element enclosing all the rows. (This is what embedded lists will intend to do, with a n:embedded-list-with-hyphen wrapping items with a more generic name.) Something like n:cell-rows-delimited-by-vertical-lines is a bit long and we don’t need to embed the whole specification in the name. n:cell-rows-with-vertical-lines is shorter and evocative enough.

To summarize, the XML representation of a table which is in fact a list of lists would look like this:

<n:cell-rows-with-vertical-lines>
  <n:cell-row>
    <n:cell>item1</n:cell>
    <n:cell>item2</n:cell>
    <n:cell>item3</n:cell>
  </n:cell-row>
</n:cell-rows-with-vertical-lines>

Novelang-0.19.0 released!

Latest release of Novelang available here. This version enhances custom charset support, and brings a convenient interface to the batch document generator. See documentation for details. Enjoy!

2009-02-18

Batch charset transcoding

Here is a Bash script transcoding all .nlp files from ISO-8859-1 to UTF-8, including those in subdirectories.

#!/bin/bash

for i in `find . -name "*.nlp" `
do
  echo $i
  if test ! -d $i ; then
    iconv -f iso-8859-1 -t utf8 $i >> $i-utf
    rm $i
    mv $i-utf $i
  fi
done

Of course it doesn’t perform ultra-clever, Novelang-friendly transcoding like changing «latin-capital-letter-o-with-double-acute» to Ő inside the sources, but I can’t beat its performance / price ratio.

Found on: http://niko.gramophon.com/index.php?op=ViewArticle&articleId=6380

2009-02-16

Novelang-0.18.0 released!

Latest release of Novelang available here.

This version brings an experimental support for custom charset for both source and rendered documents. See "Internationalization" chapter for details.

Enjoy!

2009-02-15

More charsets

While Novelang documentation and samples are in English, Novelang already supports French characters and typography very well. I must confess there is yet no testing with other charset that ISO-8859-1 (Western European) which is (almost) perfect for both French and English.

What does happen when trying to add support for a new charset? This should be just a few additional declarations inside the grammar file. Here are Hungarian characters submitted by a reader of this blog:

ö ü ó ő ú é á ű í 
Ö Ü Ó Ő Ú É Á Ű Í

First, Novelang has to read those characters as they are, provided the right charset.

According to Wikipedia, all those characters are part of ISO-8859-2 (Western Central European) charset. See: http://en.wikipedia.org/wiki/ISO_8859-2. They don’t seem to belong to Mac Roman charset. Of course, they fit perfectly into UTF-16 charset.

This is confirmed by Smultron, a text editor which cleverly refuses to save a file containing characters which don’t belong to declared charset.

Starting Novelang with -Dfile.encoding=ISO-8859-2 the two O with double agrave are rejected as unknown but other characters do pass, including the U with double agrave.

Starting Novelang with -Dfile.encoding=UTF-16 gives a lot of mess because of the two bytes instead of one.

Starting Novelang with UTF-16 or ISO-8859-2 as default value inside novelang.parser.Encoding class, all characters do pass. This includes recompiling Novelang.

What’s the problem with -Dfile.encoding system property? I can’t tell. Anyways, this is not the right place to set the charset of source documents because this property would apply to all other files, including configuration files. So the constant inside novelang.parser.Encoding should become a configurable thing.

How configurable? I can see several useful places to set the charset.

— For the whole Novelang daemon instance (almost like -Dfile.encoding).

— For a whole Book with a source-charset command.

— For a source document read from a book using insert command.

— For currently rendered document with a source-charset query parameter (making sense only for Parts).

This would enable Books with various charsets in their Parts. Great!

Now what happens at rendering time?

PDF may render well, given the right fonts. When specifying no font directory (with --font-dirs command-line argument) there is a number sign # instead of the vowels with double grave accent. With Linux Libertine font (shipped with Novelang) those characters appear as they should.

PDF is the easy case, because it uses Unicode during all its processing, and finally embeds the fonts.

HTML is more complicated and it will never show well if user agent doesn’t provide the right font. Before that, HTML should embed the right character with the right charset.

Setting novelang.parser.Encoding to ISO-8859-2 is not enough. Characters appear correctly in HTML only with -Dfile.encoding=ISO-8859-2 set in addition. I guess this is needed for the streaming of transformation result.

Novelang passes rendered document charset as parameter to XSL stylesheets. This parameter by now reflects the constant value in novelang.parser.Encoding. Novelang’s default XSL stylesheet for HTML injects this parameter in the HTML header (meta/content/charset). What about adding a rendering-charset command for Books, and a rendering-charset query parameter? There are several things to consider.

— An XSL stylesheet defines the charset of the resulting HTML in content/charset. In the default stylesheet, this is where the value of our new rendering-charset should appear. But users may write stylesheets that don’t use this parameter.

— An XSL stylesheet may use characters outside of its own charset. This is made possible using character escape (like —), possibly through XML entity inclusion. Novelang comes with ISOlat1.pen, ISOnum.pen, ISOpub.pen standard entity sets.

— An XSL stylesheet may render characters outside of the charset of resulting document. This would be obviously the case with Books made of documents with various charsets, but this is already the case with escaped characters like OE ligatured, a character which does not appear in ISO-8859-1. Current implementation would perform unneeded escapes if rendering a document in a charset which supports OE ligatured.

From the last case, we can state this rule: “when feeding the XSL stylesheet with text, some characters which are not supported by the charset of resulting HTML must be escaped”. By now, Novelang does a bit of this inside the novelang.rendering.HtmlWriter. When feeding the stylesheet with text, the HtmlWriter calls novelang.parser.Escape#escapeHtml which avoids wrecking HTML with literal &, < and >. But escaping also occurs for OE ligatured, which plagues French users as not a part of ISO-8859-1! So we know where to plug character escaping, but this should occur depending on the charset of resulting HTML.

How could Novelang know the charset used by the output of an XSL stylesheet? The rendering-charset parameter described above offers a straightforward solution. What about some XSL metadata to express the specific charset needed by a stylesheet? Use cases are quite obscure and XSL metadata is not a simple thing. Go for the rendering-charset parameter!

In order to keep HTML source readable, character entities would be named character entities, instead of numerical ones. Sometimes – like for O with double agrave – the named entity doesn’t exist, but whenever possible, something like È is definitely more readable than È. In order to keep all definitions at the same place, named entity could appear aside of character declaration in the Novelang grammar. It just implies some extra parsing in the SupportedCharactersGenerator.

As I’m writing this post with Novelang, I realize there is no generic way to escape characters which are part of the grammar. The use case here is obvious: my source document is in ISO-8859-1 and I want those Hungarian characters to appear in the text. The novelang.parser.Escape class holds hardcoded definitions for character escapes with Unicode names (like euro-sign) and even HTML entity names as shortcuts (oelig being an alias of latin-small-ligature-oe). The novelang.parser.Escape class could feed its table from values in SupportedCharacters which is kept in sync with the grammar.

Note: the two terms charset and encoding are almost synonyms. Because W3C and ISO seem to prefer the term charset, Novelang should use the latter. This saves the more generic term encoding for other uses, which charset is clearly scoped to characters. The -Dfile.encoding system property is just badly named (and its effect is not restricted to files). See: http://en.wikipedia.org/wiki/Character_encoding.

2009-02-11

Novelang-0.17.0 released!

Latest release of Novelang available here. Just introduced unlimited level depth. There was no special need for this feature: depth of 2 is enough most of time. As Linus Torvalds said: "More than than 3 levels of indentation means you're screwed". But while developing embedded lists I found that a generic way to handle levels was removing an ambiguity. Please note that built-in stylesheets don't handle a depth greater than 2 for now. Enjoy!

Sourceforge: useful links when releasing

As SourceForge greatly improved its services, especially for remote shell and file upload, there are plenty of things to learn again. Here is the documentation on SSH client, File Release System, File management service. It's good to note that only rsync/SSH, SFT and SCP are recommended above 20MB. rsync supports resume but it doesn't seem to be an Ant task for emulating it. SCP'ing file release:

scp Novelang-VERSION.zip USERNAME@frs.sourceforge.net:uploads

SCP'ing documentation (while in target/documentation/site:

 scp * USERNAME,PROJECTNAME@frs.sourceforge.net:/home/groups/n/no/novelang/htdocs

Opening an SSH session:

 ssh -t USERNAME,PROJECTNAME@shell.sourceforge.net create

Session info with timeleft and shutdown. Unzip javadoc after uploading it, from project home:

 unzip -o -d htdocs/javadoc javadoc.zip

Now, all of this can be automated in the build scripts!

2009-02-09

Multiple outputs

The HTML documentation of Novelang is made of one huge HTML page. That’s inconvenient. In perfect world there would be a mean to split the document in several pages. Curious of how this could be done I found that Xalan 2.7.1 provides an extension which helps a lot doing this through the Redirect class. http://xml.apache.org/xalan-j/extensionslib.html#redirect The syntax is clean: just wrap the XSL expression with a element. I think this extension is too much permissive, as you can specify absolute files. At least it’s a good starting point for rewriting a new one, where to give only logical names for files.

2009-02-08

Of inconsistent depth for levels and embedded lists

Now I’m working on the tree rehierarchization, which means taking flat items and giving them a tree-like structure. This applies for embedded lists and levels. For a Novelang source document like this:

Blah.

== Depth 1

Boo.

=== Depth 2

Yuck.

The parser converts this into an almost flat structure where text preceded by a sequence of equal signs become level introducers. So we get this intermediary structure:

n:part
 +-- n:paragraph-regular "Blah."
 +-- level-introducer "==", "Depth 1"
 +-- n:paragraph-regular "Boo."
 +-- level-introducer "===", "Depth 2"
 +-- n:paragraph-regular "Yuck."

A rather hidden step is the hierarhizer which converts level introducers (which also convey indentation information) into plain levels. Hierarchizer’s result looks like this:

n:part
 +-- n:paragraph-regular "Blah."
 +-- n:level
      +-- n:level-title "Depth 1"
      +-- n:paragraph-regular "Boo."
      +-- n:level
           +-- n:level-title "Depth 2"
           +-- n:paragraph-regular "Yuck."

This all works the same with (yet unimplemented) embedded lists so we won’t have the discussion twice. Just keep in mind how embedded lists look like:

- depth 1
  - depth 2

Now what should happen if source document looks like this? See:

=== Depth 2

== Depth 1

More generally, what should happen when a level introducer has no preceding level introducer of a smaller depth? A tempting approach is promoting first item to the smallest depth. This looks smart but this would distort information. Example above becomes:

n:part
 +-- n:level
      +-- n:level-title "Depth 2"
 +-- n:level
      +-- n:level-title "Depth 1"

The hierarchizer could also create a level from nothing in order to keep the correct depth. This is information distortion as well:

n:part
 +-- n:level         // Created from nothing
      +-- n:level
           +-- n:level-title "Depth 2"
 +-- n:level
      +-- n:level-title "Depth 1"

Here is the real question: how could inconsistent level depth mean something? There could be some extreme cases when concatenating different Parts, but this should not pollute more general cases. Maybe the intent is to have a rendered document with small titles in some kind of introduction happening before a the big title of some kind of chapter. Then it should be handled at stylesheet level. Yet the cleanest and simplest approach for handling depth inconsistencies for levels and embedded lists is to spit an error.

2009-02-05

JPdfUnit

JPdfUnit looks sweet. It checks that some PDF document has expected properties, like containing some text or embedding some fonts.