The Novelang blog: 2009

2009-12-27

Descriptors in HTML

Designators are identifiers and tags. Source document may contain explicit designators, but since version 0.37.0 Novelang is smart enough to generate implicit designators from the text of source documents. Rules for implicit designators are as intuitive as possible, but it is helpful to show somewhere the implicit designators Novelang generated for you.

So default HTML stylesheet introduce a new artefact called “descriptor”. It’s a text area surrounding a paragraph or a level title, that unfolds for displaying implicit descriptors and maybe other useful things in the future, like location in the source document.

Looks like this:

The descriptor shows:

Implicit identifier: \\ThisIsASectionWithATitle_andStyleInsideTheTitle
Implicit tag: ThisIsASectionWithATitle
Implicit tag: andStyleInsideTheTitle

While working on layout and animation I found those links useful:

JQuery , the must-have JavaScript framework for doing everything with browser’s DOM in an concise and elegant fashion.

Stuff About CSS floats. (As a bonus, found this one about floatless layout , may become useful one day.)

CSS popups are fine but I gave up this way as the popup only appears with mouse pointer over a drop zone, prevents from copy-pasting.

This one about null HTML links also helped to drop bad ideas.

2009-12-22

Novelang-0.38.1 released!

Download Novelang-0.38.1 here !

Fixed occasional crash caused by Implicit Tags.

2009-12-21

Novelang-0.38.0 released!

Download Novelang-0.38.0 here !

Experimental support for Implicit Tags, deduced from level’s title.

2009-12-14

Generating human-friendly designators

A Designator helps to locate text fragments. This is a generic term for Tags and Identifiers. With Novelang-0.37.0 come Implicit Identifiers, that make a level title behave as an Identifier. With explicit Identifiers, to reference a level from an insert command you decorate the level with an Identifier like this:

  \\Preamble
== Preamble

This is a preamble, blah blah blah...

And this is how to insert only the Part with “Preamble” title in some Novelang book:

insert file:my-document.nlp \\Preamble

But why duplicating the “Preamble” word? As long it doesn’t collide with another Identifier we should be able to write:

== Preamble

This is a preamble, blah blah blah...

… And use the insert command the same way.

Now with this feature available, it makes sense to support implicit Tags, too. When requesting a document containing only fragments tagged with @Preamble one could expect to see our level with “Preamble” title. The need for Implicit Tags and Identifier came out from documents looking like this:

  \\Preamble
  @Preamble
== Preamble

...

Quite not good, for a typing-savvy tool, is it? So now we need a common rule to create Implicit Tags and Implicit Identifiers out from legal Novelang level titles.

There are some differences between Implicit Tags and Implicit Identifiers.

— Implicit Tags don’t appear in the list of explicitely-defined Tags (in the n:meta/n:tags element).

— One given level title generates only one Implicit Identifier, but it may generate several Implicit Tags. This makes sense for long titles; the longer they are the less likely they are to appear several times in the rendered document. With a simple rule – like breaking on punctuation signs – a long title may generate several meaningful Tags.

Here are some generic rules for crafting Implicit Designators:

— Generate something as close as possible of what a human could write.

— Resolve to a limited set of characters that comply with the specification of a URL . By now, Tags appear in the URL-like document request as parameters. There is a chance to support Identifiers as document request parameters, too.

To make a long story short, the RFC lists diacriticless letters, digits and "$-_.+!*'()," characters as legal part of a URL. We can note there is non-uniform support of punctuation signs (! supported but not ? and :). For this reason, we exclude punctuation signs. Same for paired delimiters. The asterisk, plus sign, and dollar sign don’t appear as document construct (they may only appear under some escaped form), so we exclude them too. Only remain low line _ and hyphen minus -.

Implicit Tags split on punctuation signs, while Implicit Identifiers must keep them by some mean. By disallowing the low line in Tag syntax, we save it for punctuation sign replacement for Implicit Identifiers.

The hyphen minus may replace space character. But forcing character case to camel case makes shorter Designators, while keeping them quite readable. Camel case only happens for whitespace stripping between two adjacent words.

Samples:

Document source   Implicit Designator
aéœ               aeoe
x, yz             x_yz      -> 2 Tags: @x  @yz 
X, yz             X_yz      -> 2 Tags: @X  @yz 
v `0.1.2`         v0-1-2
Foo bar           FooBar
foo bar           fooBar
foO BAR           foOBAR
w (x yz)          w_xYz     -> 3 Tags: @w  @x  @yz

2009-12-13

Novelang-0.37.0 released!

Download Novelang-0.37.0 here !

Implicit identifiers, deduced from level’s title. See “Identifiers” chapter in Part syntax.

2009-09-27

Novelang-0.36.0 released!

Download Novelang-0.36.0 here !

New --style-dirs command line parameter (superceding --style-dir) for multiple style directories.

Minor identifier-related fixes and internal refactorings.

2009-09-20

Novelang-0.35.0 released!

Download Novelang-0.35.0 here !

Experimental support for identifiers. See “Identifiers” chapter in Part syntax, and “Insert Command” in “Book Files”.

2009-09-09

Novelang-0.34.1 released!

Download Novelang-0.34.1 here !

Fixed: bug preventing from starting a Novelang release with a numbered version.

Novelang-0.34.0 released!

Download Novelang-0.34.0 here !

New levelabove option for insert book command.

New sort option for insert book command.

New explodelevel batch command for splitting one document’s levels into several parts.

New --content-root command line argument for setting the directory where content files reside.

Fixed: paragraph as list did not support indented embedded list items.

2009-08-23

Novelang-0.33.1 released!

A new release of Novelang is available! Download it here and see documentation for details.

2009-08-21

Apostrophe and quotation marks

Finally, people getting serious about this.

2009-08-17

Novelang-0.33.0 released!

A new release of Novelang is available! Download it here and see documentation for details.

2009-08-08

Novelang-0.32.1 released!

A new release of Novelang is available! Download it here and see documentation for details.

2009-08-02

Novelang-0.32.0 released!

A new release of Novelang is available! Download it here and see documentation for details.

2009-07-26

Novelang-0.31.1 released!

This release brings minor fixes around new block-after-tilde feature. Can be downloaded here. See documentation for details.

Idiom: custom blocks

Novelang has no semantic markup, insteads it creates an AST (Abstract Syntax Tree) to feed a stylesheet with. This allows creating document-specific idioms, to be handled at stylesheet level. Here is one.

Starting from source document like this:

<<
[INFO] This is an info block.
>>

<<
[WARNING] Beware of "this" paragraph.

(This warning spreads on several paragraphs.)
>>

We want lines of literal to appear in a special manner (like within a frame and with a special icon in the margin). Here is how to achieve this:

<?xml version="1.0"?>
<xsl:stylesheet
   xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
   xmlns:n="http://novelang.org/book-xml/1.0"
>
 <xsl:import href="default-html.xsl" />
 <xsl:import href="punctuation-FR.xsl" />

 <xsl:template match="/" >
   <xsl:apply-imports/>
 </xsl:template>

 <xsl:template match="n:paragraphs-inside-angled-bracket-pairs" >
   <xsl:choose>
     <xsl:when 
         test="n:paragraph-regular[1]/n:block-inside-square-brackets[1]='WARNING'" 
     >
       <blockquote>
         <b>WARNING</b><br/>
         <xsl:apply-templates />
       </blockquote>
     </xsl:when>
     <xsl:when 
         test="n:paragraph-regular[1]/n:block-inside-square-brackets[1]='INFO'" 
     >
       <blockquote>
         <b>INFO</b><br/>
         <xsl:apply-templates />
       </blockquote>
     </xsl:when>
     <xsl:otherwise>
       <blockquote>
         <xsl:apply-templates />
       </blockquote>
     </xsl:otherwise>
   </xsl:choose>

 </xsl:template>

 <xsl:template 
     match="n:block-inside-square-brackets[ text()='WARNING' or text()='INFO' ]" 
 />

</xsl:stylesheet>

Rendering document source samples

A nice feature in the documentation would be to show the Novelang source and the rendering result at the same time. There are several ways to achieve this:

— Duplicate the source code in the Novelang document. One is escaped, one is not. The latter gets rendered in the document itself, in a n:paragraphs-inside-angled-bracket-pairs element with a special tag. For now this won’t work in many cases, like levels or lines of literal.

— Reference a screenshot of a previous rendering. This is the most stupid solution because it’s boring to do and hard to keep up-to-date.

— Be clever and generate the image dynamically from the source snippet.

Rendering tools

How to render a PDF fragment into an embeddable image?

IcePDF claims to be open source but the license doesn’t appear on the Web site and downloading the product requires registration. Anyway, the Java WebStart’ed demo doesn’t display anything except a pair of messages telling it’s a trial version. This behavior was observed on Mac OS X 10.5 and Java 6.

PDFRenderer is available under LGPL. The project seems a bit asleep for now; it looks like a dump-everything-to-the-community effect of Sun’s policy last years. PDFRenderer does a nice job with many PDF, but Novelang-generated ones appear severely broken!

PDFBox is licensed under the Apache License, but contains license notices from Adobe (for AFM fonts) and Sun (for JAI). A close look at PDFBox-7.3.jar shows it embeds those AFM fonts.

Since PDFBox-7.3 doesn’t work (spits an exception), let’s check a snapshot out! This is revision 795516 or something. The build goes well, and image generation doesn’t crash. But the text in images appears seriously damaged! And the font doesn’t look correct. The original was created using Linux Libertine; images contain a Helvetica-like which may not have the same metrics. And all text in non-proportional fonts doesn’t appear at all.

Should I give up my dream of finding an OSS solution for rendering images out from FOP-generated PDF documents? Debugging FOP or PDFRenderer looks like a lot of work. And, while it’s easier to get perfect control on PDF rendering, HTML rendering may be enough for creating the samples.

So here comes Flying Saucer to the rescue. It’s pure Java XHTML renderer which supports CSS 2.1. I’ve used it already and I know it works. The “inheritable” nature of CSS means I can tweak the output a bit (reducing margins and page width) while reusing the default CSS stylesheet.

Finally, all this product review turns to be nonsense, because FOP is supposed to generate images directly ! Insanely great!

Integration to Novelang

Here comes hard stuff. Including external resources depends if the document is self-contained (PDF) or multipart (HTML), and if document is generated by generator (batch) or HTTP dæmon (interactive). As a self-contained document, PDF is generated the same way wether it’s a batch or interactive context.

The FO stylesheet may manage image embedding into the PDF, thus avoiding to spread complexity elsewhere. For SVG, the fo:instream-foreign-object allows direct inclusion of the XML. For images, the architecturally-simple approach would be to write a FOP extension taking the code snippet as parameter, then inserting the rendered image into the Area Tree .

Using external files only makes sense when generating an HTML documents, because we’re pretty sure in this case that user agent won’t request the image before it can read its address from the HTML. For PDF documents, the temporary file must exist before running the FO stylesheet, so it would require some kind of ugly pre-processing.

External files are generated “once-for-all” in batch mode. But, in interactive mode, how long should they live? And does it make sense to write files on the filesystem while the resource could be dynamically generated?

Dynamic resources could be kept in some session-scoped cache. This is how it would work:No need to cache the generated image, only the source snippet. This allows deferred generation.The HTTP session contains several cache areas, one per document name.When a fresh document is generated, reset the whole cache area for this document name.During XSLT processing, call an XSL extension that feeds the cache with snippets.Given a snippet, the cache returns some kind of identifier to be inserted as a link in resulting HTML.A special resource handler (at HTTP dæmon level) queries the cache with the identifiers.If the cache has such a snippet, then it returns it for rendering.

The XSL extension called by the stylesheet could trigger two different behavior, wether it’s dæmon or interactive mode:Use the caching stuff as above.Just write the image file on the filesystem.

How to invalidate the cache? Session expiration is not enough: if several documents exist in the same session, some may become unused therefore causing excessive memory consumption. To avoid this, turn reference to “old” documents to soft references so the JVM would clean them upon memory demand.

Bad behavior would occuring when trying to load an image inside an HTML page after the cache got cleaned by some way (session expiration or memory reclaim) and prior to refreshing the whole page. This sounds like a tolerable annoyance.

Conclusion(s)

I wanted to avoid coding whatever looks like a cache for as long as possible. Now there is a case where caching is linked to a feature out of the performance scope. Anyways, the cache described above is a “toy” cache. Real caching would take the whole resource graph (source documents, images, stylesheets and so one) in account.

Dynamically-generated images could also make sense for rendering ASCII Math for Web browsers which don’t support SVG.

As often, a bit of additional comfort requires a lot of work.

First Novelang demo in the enterprise world!

Last week I gave an introduction to Novelang at a software publisher’s who’s looking at a collaborative tool for writing it’s product’s documentation. The attenders were the internal IT manager, two technical writers and and an IT intern. Technical writers were enthusiastic, since Novelang is a huge leap from FrameMaker and Robohelp – that’s what they’re working with for now. They had a look at Scenari and Nuxeo . While both are open-source product, the setup fees seemed hugely overpriced.

(In my own humble opinion, Scenari is overengineered crap with excessively complex graphical editor. But it has nice slides to explain the “What You See Is What You Mean” concept. I spent no time looking at Nuxeo for now.)

The technical writers really liked that Novelang never asks for more information that the very minimum required (I guess FrameMaker doesn’t do that). I had a chance to show off with how whitespaces, non-breakable spaces, zero-width spaces, indentation, line breaks, separators, and automatic handling of punctuation typography through a customizable stylesheet. After spending countless hours on those obscure cases that’s great to discover that I’m not the only guy to who this matters (oh, by the way both tech writers are girls).

Now they’re evaluating the product but I already got some feedeback.

With somebody looking over my should, I realized how Novelang installation looks ugly. I unzipped the Novelang archive, did set a pair of system properties, and wrote a .bat file at the root of their working directory, all with my bare hands.

The “local webserver” concept is confusing. For most of people, a webserver is a remote host. This leads to some confusion on collaborative features. They were tempted to store the source document on a shared network drive. On the other hand, they’re not used to source control tools like CVS (the one in use in their company).

The lack of graphical editor looks strange to people who are not accustomed to technical writing.

There is no “guided tour” document nor “cheat sheet” to give an overview of every feature and how to use it best.

I had a deep look at the documents they’re producing for now. The content seems to fit in Novelang syntax. They have some very big tables with one column full of content like lists. For this, a level-based structure seems more appropriate than Novelang’s cell rows.

Having image resolution for PDF hardcoded to 300 dpi won’t work here. At least they need a command-line option until stylesheet metadata gets implemented.

A big requirement is the already-discussed index feature.

To be continued!

2009-07-22

Liberation fonts

Liberation is a superb font set under the GPL+ license, which allows redistribution while not extending the GPL licence to the documents produced with the fonts. Definitely a must-have. It's a 3-family set (Serif, Sans, Mono) with the 4 combinations with bold and italics. At the first glance they look more elegant than Times + Arial + Courier.

2009-07-12

Novelang-0.31.0 released!

So sweet! This release brings better control over whitespace suppression. See documentation for details.

Removing unwanted spaces (continued)

This is about some kind of brand new operator: it groups all words an blocks and punctuation signs which are not separated by space.

A great feature of Novelang is to apply standard typographic rules, especially when there is punctuation. The problem is, sometimes you can’t apply those rules in a blunt manner.

Consider these cases: on the left, what’s in the source document, and on the right default rendering.

Source document	Default rendering	Hack
`imprimé(e)s`	imprimé (e) s	`imprimé(e)s`
`F.B.I.`	F. B. I. (superfluous spaces)	`F.B.I`
`computer//ing//`	computer ing	No hack available

Default space insertion makes it all wrong. I tried to fix it by detecting proximity (lack of spaces) between casual words and blocks inside grave accents. But, if adding other cases like full stops, blocks inside solidus pairs and blocks inside parenthesis, we end up with many complex tranformations which just break existing whitespace addition for the common case.

The solution is something more generic. I’m thinking about a special character which groups everything that follows until there is a space, a line break or the end of the document. This character would be the tilde ~ because it looks like a kind of elastic ligature.

So, with source document like this:

~computer//ing//

We get an AST (Abstract Syntax Tree) like this:

+ block-after-tilde
  + word "computer"
  + block-inside-pair-of-solidus "ing"

But we still miss the feature of adding zero-width spaces when needed. How to express this? Since zero-width spaces only make sense inside a group with no space, we can reuse the tile character safely.

This:

~A.L.L.~O.F~'E.M.

… becomes:

+ block-after-tilde
  + subblock
    + word "A"
    + punctuation-sign full-stop
    + word "L"
    + punctuation-sign full-stop
    + word "L"
    + punctuation-sign full-stop
  + subblock
    + word "O"
    + punctuation-sign full-stop
    + word "F"
    + punctuation-sign full-stop
  + subblock
    + apostrophe-wordmate
    + word "E"
    + punctuation-sign full-stop
    + word "M"

And this is enough for the stylesheet to find where to insert zero-width spaces.

2009-06-25

Foreign characters

I like seeing all of those strange letters.

2009-06-21

Autotagging

The Tag feature is great. It turns out that I’m mainly using it to tag levels, and too often it leads to duplicate information like this:

  @Graphics
== Graphics

...

What I want is to simply write:

== Graphics

...

And then Novelang should guess that “Graphics” matches the Graphics tag. I call this “Autotagging”.

We need a few simple tranformations rules to turn titles into tags.

“foo”	`@foo`
“Foo”	`@Foo`
“Foo bar”	`@Foobar`
“Foo, bar”	`@Foo @bar`
“Foo. Bar”	`@Foo @Bar`
“Foo-bar”	`@Foo-bar`

Automatically-generated tags wouldn’t be part of the tag list, but they would be used for the matching of a known tag with existing level titles.

Depending on the document, this behavior may produce a lot of noise so it requires explicit activation.

Of course, the default tag list (with checkboxes) has a new option to enable autotagging, with some JavaScript adding an autotag parameter in the URL.

2009-06-19

Novelang-0.30.0 released!

Latest release available here.

This release fixes a few bugs and brings some minor enhancements. See documentation for details.

2009-06-07

Book index

Start small, think big: while I don’t even provide a decent default stylesheet, I’m not afraid to blog about a tough subject: book index.

An index is a table at the end of a book, referencing pages or chapters containing a pertinent usage of a key word.

A very simple kind of index could look like this:

icons, 21, 136, 138

You can have ranges:

gender in language, 4-6

References avoid duplicates:

GUIs, see graphical user interfaces (GUIs)

Words can group on several levels (3 being the maximum):

keys
  capitalizing name of, 68
  typographic conventions for, 144, 147
  writing about, 142

Simplest representation

How could Novelang help to represent this?

The simplest thing to do is to use some kind of delimiter to tell that a word is an index entry. (As usual the meaning of the delimiter would be a matter of stylesheet.)

The {icon} is displayed on the corner.

Obviously, this doesn’t work, because we want the index entry to show a plural. So we invent some new kind of syntactic form representing a tuple (which symbols are used doesn’t matter at this stage).

The { icon | icons } is displayed in the corner.

Great, but how to model several levels of entries? We could make the source document look like this:

In this case we must
{ capitalize | { keys | capitalizing name of } }
the name of a key.

This feels just unreadable!

External index entry declaration

The trick: split index entry declaration from its complete definition, using several files. Complete definition would rely on a subset of Novelang grammar for source files. Example above becomes:

%% source document:

In this case we must
{ capitalize | capitalizing-name-of-key }
the name of a key.


%% index definition file:

capitalizing-name-of-key
- capitalizing name of
  - keys

This deserves a few explainations. The first capitalize in the source document is still what’s displayed. The capitalize-name-of-key is the entry name that is not supposed to be read by anybody else than the document writer – could be 123456 as well. The capitalizing name of in the index definition file is what to display in the index. The keys subitem is the parent item.

Because name of the index entry has no semantic meaning, we can let Novelang generate it using simple replacement rules (spaces becoming hyphen minus…). Explicit names are useful for special cases (like homonyms) but now we expect to be able to write:

%% source document:

In this case we must {capitalize} the name
of a key.


%% index definition file:

capitalize
- keys
  - capitalizing name of

Because for the same index entry we may have another pertinent words which are not exactly “capitalize” we let the index entry support several names.

%% source document place 1:

In this case we must {capitalize} the name
of a key.


%% source document place 2:

Sometime the name of a key should not be
{in capitals}.


%% index definition file:

in-capitals
capitalize
- keys
  - capitalizing name of

The index definition file could support lots of features.Some styling. There are rare cases (like latin names) where italics are required.A kind of markup to tell which words to take in account in alphabetical sort.Multiple posting: the same keyword in the source document has several index entries. This can happen using several embedded list items.“See” and “See Also” references.

Reusing tags and identifiers

Now, entry names may give some feeling of déjà vu. We already have machine-processed names with tags (implemented) and identifiers (not implemented yet). Are entry names just redundant? As tags and identifier apply on a whole paragraph or level, they don’t have the same level of precision when it comes to refer to the exact location of a word. And it turns out that referencing a range of paragraphs (or even chapters) is what we need for supporting page ranges.

The obvious way of handling page range would be to add a new “end of index entry range” here polluting the source document and making it look like LaTeX. This is a convention of the Novelang grammar: avoid end delimiters whenever possible. Tags and identifiers provide a nice extension.

Here is how we use a tag in the index definition file. Instead of tagging, it refers to the tagged text as index entry name:

%% source document:

@gender
== Stylistic principles

=== Avoid jargon

 ...

=== Avoid sexist language

 ...

=== Common gender

 ...


%% index definition file:

@gender
- gender in language

This is not yet perfect because page ranges are bound to the scope of a tag/identifier. But it seems possible to Some tree-mangling could detect that several tagged nodes appear consecutively:

%% source document:

== Stylistic principles

=== Avoid jargon

 ...

@gender
=== Avoid sexist language

 ...


@gender
=== Common gender

 ...

Tree-mangling could add a special marker as the last child of the last node of the consecutive tagged ones. Our tree would look like:

level
  + level-title     "Stylistic principles"
  + level
  |   + level-title "Avoid jargon"
  + level
  |   + level-title "Avoid sexist language"
  |   + tag         "gender"
  |   + paragraph
  |   |   + start-of-range "gender"
  |   |   + ...
  |   + paragraph
  + level
      + level-title "Common gender"
      + tag         "gender"
      + paragraph
      + paragraph
         + ...
         + end-of-range    "gender"

Foreseen weaknesses

This is still unperfect because page ranges are bound to the scope of tags/identifiers but this looks like a pretty good approximation.

Another possible drawback is the ability to have overlapping ranges or page numbers with several different tags/identifiers, or index entries in the middle. Because FOP provides no hook to deal with page numbers once they are known, Novelang could not fix a serie of page numbers like 14-17, 15-18, 15, 16. It’s a long way to perfection.

Bibliography

All examples from A Style Guide for the Computer Industry, Sun Technical Publications, 1996.

2009-06-03

Novelang-0.29.0 released!

Latest release available here.

It brings various enhancements to whitespace handling, and the pretty color palette for tags.

2009-05-30

Image processing

In my endless quest of nice libraries to integrate into Novelang, I’ve been wandering about image processing. This makes sense for technical documentation with screen captures; often it is useful to do some rescale of fade. Ccompression is useful, too, but it should be probably be left to the rendering stage.

When updating the captured image, you need to process the image again with your Gimp or whatever. This should be done automatically! I’m looking for a scripting language to do clever things. I’ve no idea on how to integrate it to Novelang – maybe some special files to avoid messing macro-instructions with content.

The language itself could be something like this:

  {
    rescale( 40% )
    fade( SOUTHWEST, 3px )
  }
  ./my-image.png

I want something clever with an explicit representation of pipeline processing. And, yes, it should be all in Java and with a GPL-compatible license. Am I asking too much here?

I’ve found an amazing software piece: ImageJ, a public domain tool for image processing with huge amount of macros and plugins. It seems widely used for science. Bad news, image transparency doesn't look like a great concern. The Alpha Channel plugin is the best I’ve found so far with its rough edges.

NetKernel has a pipeline image processing feature that looks like what I want. But I don’t like their everything-is-a-String approach.

Maybe I shouldn’t be so ambitious, and just start hacking “the smallest thing that could possibly work” for solving my own problem instead of looking for a save-the-world solution.

By the way, this is an interactive rendering effect editor based on BeanShell that may ease some pain while hacking image filters.

2009-05-28

Space character and related stuff

Blocks of literal

Most of times, the text inside blocks of literal should be kept in one piece. A blatant example is a numeric value and its unit.

Compact several spaces into one for the same reason as above.Trim leading and trailing spaces. Otherwise they offer a suspicious mean to override text layout.Replace spaces by non-break spaces.

With the low line character _ figuring the no-break space we’d like to obtain such transformation:

` 20   m  ` -> `20_m`

This means a long block of literal with several spaces (transformed into no-break spaces) could become very cumbersome and mess the layout. So we need a hint to allow line breaks at some places. This can be done by splitting the big block of literal into several small ones, which are not separated by spaces.

With the vertical bar character | figuring the zero-width space we have such transformation:

   `Y.O.U.``A.R.E.``B.E.A.U.T.I.F.U.L` 
-> `Y.O.U.|A.R.E.|B.E.A.U.T.I.F.U.L`

See more about the zero-width space here. A quick test shows that FOP supports it.

Implementation will be done at tree-mangling level. A whitespace between two consecutive blocks of literal will be replaced by a special node meaning that a break is allowed here. The special node will be replaced by a no-width space at rendering time.

Apostrophe

This technique could be useful to keep apostrophe character stuck to a word when in last position. By now, Novelang does not take care of the whitespace after or before the apostrophe.

he's here     -> he’s here
houses' roofs -> houses’roofs
during '60    -> during’60

This is because whitespaces are used as separators, but don’t cary “real” information (except in a few cases, like indentation for embedded lists). Before discarding WHITESPACE nodes, the ones immediately preceding or following an apostrophe could become an EXPLICIT_WHITESPACE to be rendered as, yes, a space character.

2009-05-27

Pretty color palette for tags

Default representation of tags attempts to help locating them at a glance, with nice colors. “Nice colors” means a lot of care.

Defining a color palette from scratch is tricky. Colors must be dinstiguishable one from the other. They must spread evenly on the visible spectrum; but this is not easy because the visual effect depends on the display. For this reason, I use the 140 colors of the SVG specification (the same are used in the CSS spec). Much of hard work is done here, including finding pretty names.

But that’s not all. Because the small rectangle of the tag has text, too, there must be a foreground color. First I tried to compute it, using a simple algorithm (increasing Red, Green and Blue of 50% each and applying modulus 255). The text was always barely readable. Not really good.

Another problem was the choice of the color for each tag. I’ve chosen to pick the color of each tag in a predefined list. When all the colors have been set, we start from the start again. This round-robin algorithm for chosing colors is ok, but inside the 140 colors, many look quite the same. Colors like mistyrose and lavenderblush are very close, and if we have only 10 tags, it’s a pity to see two tags looking the same. So it makes sense to edit the color list in order to make the first one look very different. In addition, because those first colors will be picked up the most often, they must be in the same tone (mild saturation).

If there are more than 10 or 20 tags, similar colors will be unavoidable, finally. But, since we display text (and a thin border) there is a foreground color to chose. This gives (140 × 139) 19460 possibilities! Of course background and foreground cannot be the same (hence the 139) and many possibilities are unreadable. But, given a color like white, those colors look quite similar: mintcream, honeydew, ghostwhite, floralwhite, seashell, azure, linen, aliceblue, cornsilk, oldlace, ivory, snow, whitesmoke. Wow!

Maybe there is a clever algorithm to detect which foreground colors give best contrast and distinguishability, but I didn’t find it. It seems much more convenient to let a human do the job.

Since editing some lines of code would require to switch back-and-forth between the code editor and the web browser, I wrote a palette editor based on a HTML page. It looks like this:

It’s easy to change the order of appearance of a color with a drag and drop:

And, after clicking on one color, you set the foreground with an alt-click on wished color.

Don’t forget to save using the Save button (File > Save in Web browser’s menu won’t work). Yes, the color palette editor only runs on Firefox by now.

The color palette is located in:

src/main-resources/style/javascript/colors.htm

This new feature (and the beautiful color palette) will be available in the next release of Novelang (0.29.0).

2009-05-24

Novelang-0.28.0 released!

Latest release available here.

Now tags are handled as query parameters. This is much faster on big documents, and it works for every kind of document.

See documentation for details, and the list of other enhancements.

2009-05-17

Missing closing delimiters

By now, a block with a missing closing delimiter was properly detected as an error, but the error message was ugly. See, for this:

Something -- missing

You got:

line 0:-1 mismatched input '' expecting HYPHEN_MINUS

Not a great deal here, but pretty annoying in a 1000-line long source document.

After a close look, it looked very complex to determine where the error was coming from. Considering this case:

There " is ( something " missing

… The problem is obviously with the unclosed parenthesis. It’s easy to see (for a human) because parenthesis are paired delimiters: there is an opening and a closing one. The double quotes " is single in the sense it may be used for both opening and closing a block, depending on the context. In the example above, the Novelang parser started evaluating a parenthesized block, and the double quote looked like an unclosed block. How to handle this correctly?

— In order to avoid grammar bloat, the grammar emits some kind of events, telling it started parsing a block with such or such delimiter. The position of every token for a start delimiter is kept. If something goes wrong, the error message(s) will report the position of the unclosed delimiter.

— Event consistency check is scoped: if an unclosed delimiter is detected inside a paragraph, this should have no influence on the way unclosed delimiters are handled inside another paragraph.

— When trying to figure where is the opening delimiter with no closing counterpart, the trick is, to look at paired delimiters first. If something went wrong with paired delimiters, just report the errors about them. Otherwise, report errors with single delimiters.

I just checked this new feature into Github and the results are pretty good. Given source document like this (line numbers added for clarity):

1   ( s
2  t -- u
3  v )
4  
5  // w
6  x [ y
7  z //

Instead of a bunch of nonsense, Novelang now returns following problems:

2:2: Missing delimiter. For '--' there should be a matching '--' or '-_'
7:4: no viable alternative at input '\n'
6:2: Missing delimiter. For '[' there should be a matching ']'

This will be available in the next version (0.28.0). Keep informed reading this blog!

2009-05-10

Novelang-0.27.1 released!

Just a fix after I messed MIME type for rendered documents. As usual, available here.

Novelang-0.27.0 released!

Latest release available here.

Novelang-0.27.0 enhances the tag feature with standard HTML stylesheet displaying the list of user-defined tags. In a source document, tags are words preceded by an arrobas @. Levels, paragraphs, paragraphs inside angled bracket pairs (aka blockquotes) and cell rows (aka tables) may be tagged.

  @javascript @performance
By now this feature all relies on Javascript 
running inside the Web browser.

HTML generated using default stylesheet renders tags like this, with a nice color set making tags distinguishable at a glance:

It is now possible to hide all the text which is not tagged, selecting tags in a list which appear on topright corner of HTML document, with a fixed position that keeps it always visible and a disclosure box which hides the list by default:

If a level or a set of paragraphs inside angled bracket pairs do have at least one of requested tags, it is displayed with all of its content. If a paragraph has at least one of requested tags, it is displayed, as all its parents (levels or set of paragraphs).

By now this feature all relies on Javascript running inside the Web browser. This doesn’t scale on big documents (with lots of paragraphs and levels). For some big document with HTML generation taking about 13 s, selecting one tag takes more than 70 s and triggers several “slow script” warnings.

A more suitable approach would be to trim the AST (Abstract Syntax Tree) server-side. This requires passing parameters to the query. Because of pre-rendering processing, tag-based filtering would work for any other other format than HTML for free.

There would be less to do in Javascript; it would just update the tag list in order to reflect document’s state.

2009-04-25

Novelang-0.26.0 released!

Latest release available here.

Outstanding new feature: tags. Now you can tag some pieces of the source documents with arbitrary labels:

  @my-tag  @foo
This is a tagged paragraph.

Novelang's default HTML stylesheet generates pretty colorful tags in the margin, and a tag summary at the end. It's intended to be for humans but stylesheets may take advantage of tags, too.

See documentation for details.

2009-04-14

Novelang-0.25.0 released!

Latest release available here.

This version brings relaxed syntax to associate names to URL:

Go to the "website"
http://novelang.sourceforge.net

See documentation for details.

2009-04-12

Novelang-0.24.0 released!

Latest release of Novelang available here. This version brings embedded lists. See documentation for details.

2009-04-09

Named URL: try again!

With Novelang-0.23.0 comes a new feature: named URL. The purpose is to associate some text to a URL, in order to let the stylesheet display something nicer than the URL itself. With source document like this:

This is a 
  "url"
http://url.net/my-very-long-path
.

… we get:

This is a url.

The rationale of this syntax is the consistency with decorations. Decorations are source metadata that is conveniently before decorated source, appearing on its own single line and with some indentation for the visual comfort. It seemed a good idea to follow the same scheme.

But the syntax described above has many drawbacks.

First, it’s very space-consuming. The scarce resource is the vertical space inside the text editor, because you cannot stretch the display device in height (and long horizontal lines are hard to read). So text like this seems to waste space:

This is
  "url one"
http://url.net/1
and here is 
  "url two"
http://url.net/2
.

URL must appear on their very own line, for already discussed reasons, so the full stop character at the end must stay as it is. The problem is with the blocks inside double quotes: it’s supposed to remain short, so reserving a whole line for it is obviously a waste. Finally I’d like to write text like this:

This is "url one"
http://url.net/1
and here is "url two"
http://url.net/2
.

One question then arises: how to distinguish a block associated to a URL from one which is not? After all you may need to display some text in double quotes right before some URL. First I thought about a new “attach” operator which would tell explicitely that some block inside double quotes is related to the following URL:

Stupid: "url" ~
http://url.net

This is not a good idea because experience shows that, 99 % of times, the block is related to the following URL. So it makes no sense to make the most common case a special thing which breaks the consistency of the grammar. And what if the text of the URL should appear inside double quotes? Is there some new clever escape mechanism to invent?

Corresponding Abstract Syntax Tree is also broken, in the sense where n:external-link and n:link-title nodes do carry semantic meaning, while I claim everywhere that such meaning is confusing when stylesheet defines alternate meaning. The n:external-link was a clever idea to wrap the URL and the title in one single element, but I should find something else.

The solution

When a block inside double quotes, or a block inside square brackets, are located right before a URL, they become URL children. Considering such source document:

This is a ["url"]
http://url.net
.

… we get something like this (consistent with stylesheet’s rendering of block inside double quotes):

This is a “url”.

In the rare case where a block inside double quotes must appear verbatim, we “break” the proximity with some “invisible” character which is an empty block of literal inside grave accents. That’s a little weird but it’s not a problem as it should remain the exception:

That's a "url" ``
http://url.net

… so it renders like this:

That’s a “url” http://url.net.

Finally, the n:external-link disappears in favor of n:url. The n:link-title becomes a n:block-inside-double-quotes or n:block-inside-square-brackets. The text of the URL gets wrapped inside a n:url-literal and that’s all.

n:url
  + n:url-literal
  + n:block-inside-double-quotes

2009-03-22

Novelang-0.23.0 released!

Latest release of Novelang available here. This version brings named URL (called "external-link"). See documentation for details.

2009-03-20

Wiki Creole grammar

Finally I’ve found a Wiki grammar for ANTLR: http://www.riehle.org/2008/01/09/an-ebnf-grammar-for-wiki-creole-10

That’s interesting to compare with Novelang’s one.

— Paragraphs are the same central thing.

— Rule names embed rendering-oriented meaning (like text_boldcontent).

— Some rules embed their own terminator (a list may end by a end_of_list).

— Some predicates look like hand-coded lookahead:

{ input.LA(1) != STAR || (input.LA(1) == STAR && input.LA(2) == STAR) }?

This grammar is just a skeleton, it doesn’t produce an AST tree or whatever.

2009-03-05

Novelang-0.22.0 released!

Latest release of Novelang available here. This version brings various fixes related to images. See documentation for details.

2009-03-03

SVG support

I knew it, I knew it… SVG support cannot be seamless, at least not at the first try. First I thought that Safari beta 4 was not SVG-enabled but it appears that it displays whole SVG files (with .svg extension in the URL). And, when embedded in a <object> tag, SVG may work, too, as shown here:

http://labs.silverorange.com/archive/2006/january/howtoinclude

So why are Novelang-generated pages not displaying embedded SVG? I noticed that everytime I refreshed the page, Safari downloaded the .svg file. Could that be something about the MIME type? For the page above, Safari’s Web Inspector tells that Content-Type is text/xml. For the Novelang-generated page, there is no Content-Type. I should fix that first.

Camino (same rendering engine as Firefox) understands SVG well, but adds some ugly scrollbars. They disappear when the size is set manually (maybe adding a few pixels). Adding image size will require some SVG parsing, the same way raster images are loaded by the Part to get their true size.

Aside of this, the article on silverorange.com gives a nice trick for replacing SVG by a raster image: just embed the reference to the raster image inside the <object> element. So when Novelang is requested a non-existing .png file, it could try to locate the same file but with a .svg extension and return the rasterized image. As a consequence, the batch mode should anticipate this and produce .png files for all .svg ones.

Yes, the Web server is something to be invented again and again.

2009-03-02

Novelang-0.21.0 released!

Latest release of Novelang available here.

This version introduces support for raster and vector images. The source document references them through a path which can be either relative (from the document itself) or absolute (from the project root). And, yes, images may show inside table cells! Otherwise they must appear outside of paragraphs. Suported formats are .jpg, .png, .gif, .svg. SVG may or may not be rendered in HTML pages, depending on Web browser capability.

See documentation for details.

2009-02-27

Images

I just checked a first working version of images into github. It’s far from complete, but with such source document:

image:foo/bar.jpg

… the stylesheet receives a n:image element containing foo/bar.jpg. How beautiful. It is supposed to work the way URL do and offer a uniform syntax. It addition, it’s more or less used elsewhere, see http://www.wikicreole.org/wiki/ImagesReasoning.

But it’s crap. It’s wrong, it’s inconsistent, it will not be comfortable.

URL exist on their own line because this makes copy-paste easy. There is no operating system supporting image: as a protocol. And URL are links to another resource, while image:... represents the resource itself.

Other wikis need the image: prefix because they accept a / in the middle of the content. Novelang requires the solidus to appear as literal. Therefore something like screenshots/preferences.png cannot be confused with three words with punctuation signs or symbols inbetween.

Current image support takes paths relative to the project root, but a path relative to current Part file is comfortable in some cases. If the image is in the same directory, we need to make the solidus character appear so we’ll have ./preferences.png (instead of preferences.png which could be two dot-separated words).

The extension is enough to make the difference with other resources. Almost everybody knows that .jpg, .png, .gif, .svg are for images, and it leaves room for other stuff like .csv.

In a previous post about tables, I stated that image declaration would be too long to fit in a cell, but with relative path and no image: prefix this has to be revised.

Images are definitely not the same thing as URL, but the decorations should work the same way, with an identifier and a name.

Then, the identifier could replace the image, or provide some kind of reference. I’ll tell more about identifiers another day but here’s a complete example using two different source files.

First file: we declare the image, decorating it with metadata.

  \dog-with-bone
  "My dog with its bone"
/photos/dog.jpg

Second file: using images declared above.

See a picture of my dog:

\dog-with-bone

... later in the text, we want some reference to the 
picture to appear, like its name, some hyperlink or 
a figure number (depending on the stylesheet). 

You have already seen my dog in -\dog-with-bone .

Finally, it seems we’ve the best of every world with a compact notation.

2009-02-21

Embedded maths

Musing on the Web I discovered those little gems: JEuclid, Open Office Math and ASCIIMathML.

JEuclid is a renderer for MathML. MathML is an XML-based representation of mathematical formulæ. It has a FOP plugin, which transforms an embedded MathML expression into a nice formula in the resulting PDF.

http://jeuclid.sourceforge.net

JEuclid claims it supports the .mml files exported by Open Office Math.

MathML is horribly verbose and not intended to be used by humans, but rather to help programs to interoperate.

OpenOffice Math is a math formula editor, bundled with Open Office. It’s partially WYSIWYG as it lets you type a formula in plain text and produces a preview in (quasi) real time. Open Office Math favors its .odf format but is able to save and load formulæ in MathML. Its text-based editor supports formulæ like this:

f(x)=sum from{n=0} to{infinity} 
  { { f^{(n)}(a) } over {n!} (x-a)^n }

You can see OpenOffice Math in action here: http://en.wikipedia.org/wiki/OpenOffice.org_Math

Used together, JEuclid and OpenOffice Math could make Novelang more attractive to TeX users, who always have been unbeatable when it comes to craft beautiful graphics from text-based formulæ.

Novelang could learn to recognize a reference to a MathML file to be edited with Open Office:

When ``a > b`` we always have
math:my-formula.mml
bla bla blah.

In a perfect world, Novelang would support formulæ inside the source document (with a tweak to make a n:lines-of-literal appear inside a paragraph).

When ``a > b`` we always have
<<<math
f(x)= ...
>>>
blah blah blah.

This requires a translator from text-based formula to MathML. Such a translator is hard to find, especially with the OSS constraint.

Maybe I’ve found this rare beast, with ASCIIMathML. It’s a Javascript-based translator designed to run inside a Web browser. The interactive demo is stunning!

http://mathcs.chapman.edu/~jipsen/mathml/asciimath.html

It recognizes TeX (same formula as above):

$f(x)=\sum_{n=0}^\infty\frac{f^{(n)}(a)}{n!}(x-a)^n$

… or its custom ASCII-based format. Here again, the same formula:

`f(x)=sum_(n=0)^oo(f^((n))(a))/(n!)(x-a)^n`

ASCIIMathML is released under the LGPL. Great. I just wonder if it works inside a Java-powered Javascript interpreter.

Font survey

Gentium

According tho its website, Gentium is a typeface family designed to enable the diverse ethnic groups around the world who use the Latin and Greek scripts to produce readable, high-quality publications.

http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&item_id=Gentium

The license is quite liberal. It avoids the “viral” effect of GPL: embedding the Gentium font in the PDF you redistribute on a numerical form won’t turn on the GPL for the PDF.

http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&id=OFL

Bad news: the OFL license doesn’t seem compatible with the GPL v3 (documentation is unclear). So you can use Gentium fonts but you’ll have to download it with your bare hands, as it cannot fit in a Novelang distribution, which is released under GPL v3.

Junicode

Good news: the beautiful Junicode font will become part of Novelang. Junicode (short for Junius-Unicode) is a Unicode font for medievalists. It embeds various amazing characters like old English and runes. But as a great classical serif font, it’s a nice replacement for boring Times New Roman.

http://junicode.sourceforge.net

But the Junicode license, which is GPL v2, has no exception about embedding so if you embed it in a PDF (what Novelang always does) your PDF will be redistributable under the GPL license.

Linux Libertine

Better news: the Linux Libertine font, bundled in Novelang for a long time, is dual-licensed, under both GPL and OFL, so you can use at no legal risk.

http://linuxlibertine.sourceforge.net/Libertine-EN.html

2009-02-20

Novelang-0.20.0 released!

Latest release of Novelang available here.

This version brings tables!

While source documents prevent to embed any formatting information it's possible to have very fine control at stylesheet level, with all FO power! Novelang documetation does this for its PDF, by overriding n:cell... elements when it detects a special style (purposefully named character-escapes here). See pdf.xsl. The style is set from the novelang.nlb book file, which sets the structure of the whole Book.

See release notes for details.

Enjoy!

2009-02-19

Tables

Other wikis make tables a daunting subject, often because they try to embed crazy formatting instructions. Yo get a taste here: http://www.wikicreole.org/wiki/ListOfTableMarkups. First I decided that tables wouldn’t be a concern before a long, long time because I felt I could live without them. But everybody needs to create tables, and simulating them with literal in a fixed-width font is not great. I reconsidered my position upon an external request that made me ask to myself: after all, would it be so hard to implement the simplest, smallest, least controversial feature set to let people define tables?

Obivously this would start around the well-adopted vertical line separator. To keep things clear, no line break is allowed inside table definition. Vertical lines may not be vertically-aligned. So it would look like:

| row1, col1 | row1, col2 | row1, col3 | 
| row2, col1 |  row2, col2   | row2, col3 |

No headers, no justification, no span, no calculated fields, no neverending list of conflicting features. All complicated stuff must be done in the stylesheet or never. Then it becomes easy to support tables in Novelang grammar, and easy to write correct source documents using them.

There is still one potential conflict, however, with URL (or images, which will use the same grammatical approach). In a Novelang source document, the URL must appear at the start of a line. But how to make it appear inside a list of row cells while keeping the grammar readable? The answser is: no URL inside a table. A URL is a lengthy thing which doesn’t fit in a table cell, so there is another syntax to find. Meanwhile, users are free to handle particular cases with a context-specific hack in their stylesheet.

Because users are encouraged to write stylesheets on their own, the XML structure given in input has to be very consistent. Of course the structure will look like a HTML table (table containing rows which contain divisions). By avoiding semantic notations (e.g. drop n:emphasis and prefer n:block-inside-solidus-pairs) the XML tells offers a more structural view, letting the stylesheet give a meaning on its own.

There is the temptation to call a table a “table” but it would be lying. See this:

| item1 | item2 | item3 |

The first vertical line indicates the start of the first item, the last vertical line indicates the end of the list, and items are separated by vertical lines. The table row is just a picture in your mind. In fact, it’s no more than a list. The whole table is a list of lists. The default stylesheet will show it as a table, maybe with fancy headers, but its just one manner to show lists of lists.

So which name could tell it’s a list of list, while not completely hiding the fact it’s a table?

The word “cell” is a good start. Table have cells, but we’re living in a universe where cells exist outside of tables. So our items get wrapped in n:cell elements.

For a row, n:cell-list is an obvious choice but it doesn’t tell about the “horizontality”. So we’ll prefer n:cell-row which is compatible with the table vocabulary, while staying close to the representation in the source document. Other XML elements tell about their delimiting character, like n:list-with-triple-hyphen so it’s tempting to consider n:cell-row-with-vertical-lines-between-cells for consistency.

The solution is to tell about the delimiter only in the element enclosing all the rows. (This is what embedded lists will intend to do, with a n:embedded-list-with-hyphen wrapping items with a more generic name.) Something like n:cell-rows-delimited-by-vertical-lines is a bit long and we don’t need to embed the whole specification in the name. n:cell-rows-with-vertical-lines is shorter and evocative enough.

To summarize, the XML representation of a table which is in fact a list of lists would look like this:

<n:cell-rows-with-vertical-lines>
  <n:cell-row>
    <n:cell>item1</n:cell>
    <n:cell>item2</n:cell>
    <n:cell>item3</n:cell>
  </n:cell-row>
</n:cell-rows-with-vertical-lines>

Novelang-0.19.0 released!

Latest release of Novelang available here. This version enhances custom charset support, and brings a convenient interface to the batch document generator. See documentation for details. Enjoy!

2009-02-18

Batch charset transcoding

Here is a Bash script transcoding all .nlp files from ISO-8859-1 to UTF-8, including those in subdirectories.

#!/bin/bash

for i in `find . -name "*.nlp" `
do
  echo $i
  if test ! -d $i ; then
    iconv -f iso-8859-1 -t utf8 $i >> $i-utf
    rm $i
    mv $i-utf $i
  fi
done

Of course it doesn’t perform ultra-clever, Novelang-friendly transcoding like changing «latin-capital-letter-o-with-double-acute» to Ő inside the sources, but I can’t beat its performance / price ratio.

Found on: http://niko.gramophon.com/index.php?op=ViewArticle&articleId=6380

2009-02-16

Novelang-0.18.0 released!

Latest release of Novelang available here.

This version brings an experimental support for custom charset for both source and rendered documents. See "Internationalization" chapter for details.

Enjoy!

2009-02-15

More charsets

While Novelang documentation and samples are in English, Novelang already supports French characters and typography very well. I must confess there is yet no testing with other charset that ISO-8859-1 (Western European) which is (almost) perfect for both French and English.

What does happen when trying to add support for a new charset? This should be just a few additional declarations inside the grammar file. Here are Hungarian characters submitted by a reader of this blog:

ö ü ó ő ú é á ű í 
Ö Ü Ó Ő Ú É Á Ű Í

First, Novelang has to read those characters as they are, provided the right charset.

According to Wikipedia, all those characters are part of ISO-8859-2 (Western Central European) charset. See: http://en.wikipedia.org/wiki/ISO_8859-2. They don’t seem to belong to Mac Roman charset. Of course, they fit perfectly into UTF-16 charset.

This is confirmed by Smultron, a text editor which cleverly refuses to save a file containing characters which don’t belong to declared charset.

Starting Novelang with -Dfile.encoding=ISO-8859-2 the two O with double agrave are rejected as unknown but other characters do pass, including the U with double agrave.

Starting Novelang with -Dfile.encoding=UTF-16 gives a lot of mess because of the two bytes instead of one.

Starting Novelang with UTF-16 or ISO-8859-2 as default value inside novelang.parser.Encoding class, all characters do pass. This includes recompiling Novelang.

What’s the problem with -Dfile.encoding system property? I can’t tell. Anyways, this is not the right place to set the charset of source documents because this property would apply to all other files, including configuration files. So the constant inside novelang.parser.Encoding should become a configurable thing.

How configurable? I can see several useful places to set the charset.

— For the whole Novelang daemon instance (almost like -Dfile.encoding).

— For a whole Book with a source-charset command.

— For a source document read from a book using insert command.

— For currently rendered document with a source-charset query parameter (making sense only for Parts).

This would enable Books with various charsets in their Parts. Great!

Now what happens at rendering time?

PDF may render well, given the right fonts. When specifying no font directory (with --font-dirs command-line argument) there is a number sign # instead of the vowels with double grave accent. With Linux Libertine font (shipped with Novelang) those characters appear as they should.

PDF is the easy case, because it uses Unicode during all its processing, and finally embeds the fonts.

HTML is more complicated and it will never show well if user agent doesn’t provide the right font. Before that, HTML should embed the right character with the right charset.

Setting novelang.parser.Encoding to ISO-8859-2 is not enough. Characters appear correctly in HTML only with -Dfile.encoding=ISO-8859-2 set in addition. I guess this is needed for the streaming of transformation result.

Novelang passes rendered document charset as parameter to XSL stylesheets. This parameter by now reflects the constant value in novelang.parser.Encoding. Novelang’s default XSL stylesheet for HTML injects this parameter in the HTML header (meta/content/charset). What about adding a rendering-charset command for Books, and a rendering-charset query parameter? There are several things to consider.

— An XSL stylesheet defines the charset of the resulting HTML in content/charset. In the default stylesheet, this is where the value of our new rendering-charset should appear. But users may write stylesheets that don’t use this parameter.

— An XSL stylesheet may use characters outside of its own charset. This is made possible using character escape (like —), possibly through XML entity inclusion. Novelang comes with ISOlat1.pen, ISOnum.pen, ISOpub.pen standard entity sets.

— An XSL stylesheet may render characters outside of the charset of resulting document. This would be obviously the case with Books made of documents with various charsets, but this is already the case with escaped characters like OE ligatured, a character which does not appear in ISO-8859-1. Current implementation would perform unneeded escapes if rendering a document in a charset which supports OE ligatured.

From the last case, we can state this rule: “when feeding the XSL stylesheet with text, some characters which are not supported by the charset of resulting HTML must be escaped”. By now, Novelang does a bit of this inside the novelang.rendering.HtmlWriter. When feeding the stylesheet with text, the HtmlWriter calls novelang.parser.Escape#escapeHtml which avoids wrecking HTML with literal &, < and >. But escaping also occurs for OE ligatured, which plagues French users as not a part of ISO-8859-1! So we know where to plug character escaping, but this should occur depending on the charset of resulting HTML.

How could Novelang know the charset used by the output of an XSL stylesheet? The rendering-charset parameter described above offers a straightforward solution. What about some XSL metadata to express the specific charset needed by a stylesheet? Use cases are quite obscure and XSL metadata is not a simple thing. Go for the rendering-charset parameter!

In order to keep HTML source readable, character entities would be named character entities, instead of numerical ones. Sometimes – like for O with double agrave – the named entity doesn’t exist, but whenever possible, something like È is definitely more readable than È. In order to keep all definitions at the same place, named entity could appear aside of character declaration in the Novelang grammar. It just implies some extra parsing in the SupportedCharactersGenerator.

As I’m writing this post with Novelang, I realize there is no generic way to escape characters which are part of the grammar. The use case here is obvious: my source document is in ISO-8859-1 and I want those Hungarian characters to appear in the text. The novelang.parser.Escape class holds hardcoded definitions for character escapes with Unicode names (like euro-sign) and even HTML entity names as shortcuts (oelig being an alias of latin-small-ligature-oe). The novelang.parser.Escape class could feed its table from values in SupportedCharacters which is kept in sync with the grammar.

Note: the two terms charset and encoding are almost synonyms. Because W3C and ISO seem to prefer the term charset, Novelang should use the latter. This saves the more generic term encoding for other uses, which charset is clearly scoped to characters. The -Dfile.encoding system property is just badly named (and its effect is not restricted to files). See: http://en.wikipedia.org/wiki/Character_encoding.

2009-02-11

Novelang-0.17.0 released!

Latest release of Novelang available here. Just introduced unlimited level depth. There was no special need for this feature: depth of 2 is enough most of time. As Linus Torvalds said: "More than than 3 levels of indentation means you're screwed". But while developing embedded lists I found that a generic way to handle levels was removing an ambiguity. Please note that built-in stylesheets don't handle a depth greater than 2 for now. Enjoy!

Sourceforge: useful links when releasing

As SourceForge greatly improved its services, especially for remote shell and file upload, there are plenty of things to learn again. Here is the documentation on SSH client, File Release System, File management service. It's good to note that only rsync/SSH, SFT and SCP are recommended above 20MB. rsync supports resume but it doesn't seem to be an Ant task for emulating it. SCP'ing file release:

scp Novelang-VERSION.zip USERNAME@frs.sourceforge.net:uploads

SCP'ing documentation (while in target/documentation/site:

 scp * USERNAME,PROJECTNAME@frs.sourceforge.net:/home/groups/n/no/novelang/htdocs

Opening an SSH session:

 ssh -t USERNAME,PROJECTNAME@shell.sourceforge.net create

Session info with timeleft and shutdown. Unzip javadoc after uploading it, from project home:

 unzip -o -d htdocs/javadoc javadoc.zip

Now, all of this can be automated in the build scripts!

2009-02-09

Multiple outputs

The HTML documentation of Novelang is made of one huge HTML page. That’s inconvenient. In perfect world there would be a mean to split the document in several pages. Curious of how this could be done I found that Xalan 2.7.1 provides an extension which helps a lot doing this through the Redirect class. http://xml.apache.org/xalan-j/extensionslib.html#redirect The syntax is clean: just wrap the XSL expression with a element. I think this extension is too much permissive, as you can specify absolute files. At least it’s a good starting point for rewriting a new one, where to give only logical names for files.

2009-02-08

Of inconsistent depth for levels and embedded lists

Now I’m working on the tree rehierarchization, which means taking flat items and giving them a tree-like structure. This applies for embedded lists and levels. For a Novelang source document like this:

Blah.

== Depth 1

Boo.

=== Depth 2

Yuck.

The parser converts this into an almost flat structure where text preceded by a sequence of equal signs become level introducers. So we get this intermediary structure:

n:part
 +-- n:paragraph-regular "Blah."
 +-- level-introducer "==", "Depth 1"
 +-- n:paragraph-regular "Boo."
 +-- level-introducer "===", "Depth 2"
 +-- n:paragraph-regular "Yuck."

A rather hidden step is the hierarhizer which converts level introducers (which also convey indentation information) into plain levels. Hierarchizer’s result looks like this:

n:part
 +-- n:paragraph-regular "Blah."
 +-- n:level
      +-- n:level-title "Depth 1"
      +-- n:paragraph-regular "Boo."
      +-- n:level
           +-- n:level-title "Depth 2"
           +-- n:paragraph-regular "Yuck."

This all works the same with (yet unimplemented) embedded lists so we won’t have the discussion twice. Just keep in mind how embedded lists look like:

- depth 1
  - depth 2

Now what should happen if source document looks like this? See:

=== Depth 2

== Depth 1

More generally, what should happen when a level introducer has no preceding level introducer of a smaller depth? A tempting approach is promoting first item to the smallest depth. This looks smart but this would distort information. Example above becomes:

n:part
 +-- n:level
      +-- n:level-title "Depth 2"
 +-- n:level
      +-- n:level-title "Depth 1"

The hierarchizer could also create a level from nothing in order to keep the correct depth. This is information distortion as well:

n:part
 +-- n:level         // Created from nothing
      +-- n:level
           +-- n:level-title "Depth 2"
 +-- n:level
      +-- n:level-title "Depth 1"

Here is the real question: how could inconsistent level depth mean something? There could be some extreme cases when concatenating different Parts, but this should not pollute more general cases. Maybe the intent is to have a rendered document with small titles in some kind of introduction happening before a the big title of some kind of chapter. Then it should be handled at stylesheet level. Yet the cleanest and simplest approach for handling depth inconsistencies for levels and embedded lists is to spit an error.

2009-02-05

JPdfUnit

JPdfUnit looks sweet. It checks that some PDF document has expected properties, like containing some text or embedding some fonts.

2009-01-05

Novelang-0.16.0 released!

Latest release of Novelang available here. This version introduces changes that may break existing documents and stylesheets. — Brand new naming scheme for syntactic nodes! — Chapters and sections don’t exist anymore. All what stylesheets will see are “levels”. — Top-level delimiter (formerly “chapter”) in Part now starting with two equal signs == instead of three asterisks. — Character escape now basing on Unicode name, plus optional HTML entity name. — Option createchapter of the insert command renamed to createlevel. — Updated documentation accordingly. Enjoy! c.

2009-01-04

How to drop a feature : CSS for XML

As I was updating documentation for the incoming Novelang-0.16.0 I suddenly got bored that Safari doesn’t display raw XML in a convenient manner. Camino does a far better job by assigning a default CSS and some JavaScript enabling element folding. “Let’s have some fun”, I said to myself. I quickly drafted a CSS applying directly to Novelang’s XML elements. Including the CSS was done with one line of code inside the XmlWriter. XML just needs a processing instruction like this:

<?xml-stylesheet type="text/css" href="/xml.css"?>

As I quickly discovered, this was the wrong approach. CSS have no mean to add the name of the element itself through :before and :after pseudo-selectors, so for each XML element I should add the two selectors and copy-paste the name of the element. And giving the element names a special appearance is not possible, so the element tag would have the same appearance than its delimited content. CSS for XML are fairly limited, indeed. Now I realize that raw XML is only useful for debugging stylesheets, so it’s not a good place to put eye-candy to. If folding makes sense, it would be on default HTML view, for getting an overview of generated levels.