The Novelang blog: May 2008

2008-05-22

Better character escaping

By now character escape looks like this: &oelig; for "œ" < for "<" and so on. By now some character escapes make sense for characters which are not supported as words or punctuation signs. You cannot write this because closing parenthesis requires matching opening parenthesis:

Either 1) choice one 2) choice two.

The correct solution would be to support lists but for now there is the workaround:

Either 1&rp; choice one 2&rp; choice two.

That's pretty uncomfortable anyways. Current syntax (&character-name;) is easy to parse and looks like well-known HTML entities. It could even be documented with a HTML page showing all available escapes. By now to find such escapes you have to look in code source. Pretty ugly, right? A better syntax could be: `oe for "œ" `< for "<" The problem is, with the two-letter symbol escape `oe. With current syntax (&character-name;) it's easy to resolve the escaped code from some Java-defined table. The same table can generate list of escapes for (yet unexisting) online help. If there is an unknown amount of letters in the escaped code, ANTLR-generated lexer handles it perfectly, "eating" characters until they match with a sequence it knows. But ANTLR-generated lexer cannot tell how look token it recognizes in a static definition. The parser exposes token names (tokenNames constant as a String[]) but the character themselves (like `oe) do appear nowhere because they are the result of a computation which can only occur at runtime. I've read somewhere in the book or elsewhere that custom code may perform a lookahead from inside the lexer. This may work. There is the temptation to drop multi-letter support as œ, Œ and € seem the only cases, so let's say we can find another hack for them. Amazingly, œ and Œ are not part of ISO-8859-1 character set and while it's easy to get them on the keyboard of my Mac I'm not sure it's as easy on other platform. A problem with multiletter escapes using a single-character delimiter is possible mess with existing text when adding a new rule. The best option I can see for now is to use single backquote for single-letter escapes and another non-colliding rule for multiletter escapes. That may not be so simple because resulting text like "<x>" has to be written `<x`> so the `<x` fragment may be understood as a multi-character escape. Once again, the lexer can handle it well while there is no multi-character escapes starting with a character used in a single-character escape but we lose reflexive abilities. WikiCreole has some interesting discussion on character escape but it is about disabling whole markup features and I find it sometimes ambiguous. I'd like to keep escape mechanism at character level.

Better identifiers

The Book feature described in the previous post relies on identifiers for Chapters, Sections and paragraphs. Now let see how to set an identifier in a Part file. Paragraphs have special identifiers in the sense they must be unique inside the whole Section, not inside the whole Book. But the semantic must remain the same -- just use a different marker. Decorations (additional information) for a Chapter / Section can take place right below the element.

*** My Chapter
#some-identifier-here

[Sections and paragraphs follow]

But Paragraphs don't have nested elements so the nesting rule is somewhat violated. Decoration for a Paragraph must look like this:

§some-paragraph-identifier
This is my paragraph.

Placing the decoration above the Chapter / Section delimiter would be consistent but I find it less readable:

#some-chapter-identifier
*** My Chapter

#some-section-identifier
=== My Section

§some-paragraph-identifier
My Paragraph.

There is another feature I never talked about: Tags. This is for adding even more decorations to those elements.

#some-chapter-identifier
@Tag-one @Tag-two
*** My Chapter

#some-section-identifier
@Tag-one
=== My Section

§some-paragraph-identifier
@Tag-three @Tag-four
My Paragraph.

That seems pretty unreadable now but the case above is not realistic. And indentations can be added to differenciate decorating stuff. Real-world example rather looks like this:

  #some-chapter-identifier
@Tag-one @Tag-two
*** My Chapter

  #some-section-identifier
  @Tag-one
=== My Section

  §some-paragraph-identifier
  @Tag-three @Tag-four
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Pellentesque
tortor justo, blandit at, tincidunt sed, blandit et, dolor. Donec pede
massa, cursus sed, ornare et, volutpat non, mi. In pede. Nullam
hendrerit tellus vitae justo. Etiam et massa in dolor scelerisque
faucibus. Praesent libero ante, ultrices eu, egestas ac, interdum in,
dolor. Phasellus sit amet tellus eget metus ultricies suscipit. Nam sed
neque sit amet neque gravida bibendum. Cras elementum. Suspendisse
potenti.

I definitely don't like decorations above Chapters and Elements. Let's see how it looks when placing them below:

*** My Chapter
  #some-chapter-identifier
  @Tag-one @Tag-two

=== My Section
  #some-section-identifier
  @Tag-one
 
  §some-paragraph-identifier
  @Tag-three @Tag-four
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Pellentesque
tortor justo, blandit at, tincidunt sed, blandit et, dolor. Donec pede
massa, cursus sed, ornare et, volutpat non, mi. In pede. Nullam
hendrerit tellus vitae justo. Etiam et massa in dolor scelerisque
faucibus. Praesent libero ante, ultrices eu, egestas ac, interdum in,
dolor. Phasellus sit amet tellus eget metus ultricies suscipit. Nam sed
neque sit amet neque gravida bibendum. Cras elementum. Suspendisse
potenti.

That's definitely better. Let's say that Paragraph identifiers is a marginal case. Let's say that it's pretty uncommon to have identifiers for Chapters and Sections because if you want to pick the whole Chapter it's unusual to pick some Sections inside. And let's avoid identifiers whenever possible : that will reduce bloat and do less typing. Let's say a Section / Chapter title can be referenced as an identifier when unique in the whole Book scope. The ::add syntax supports almost any character so a title with spaces and nested text (like parenthesis) is ok. Now let's do the mix:

*** My Chapter
  @Tag-one @Tag-two

=== My Section
  #some-section-identifier @Tag-one

  §1
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Pellentesque
tortor justo, blandit at, tincidunt sed, blandit et, dolor. Donec pede
massa, cursus sed, ornare et, volutpat non, mi. In pede. Nullam
hendrerit tellus vitae justo. Etiam et massa in dolor scelerisque
faucibus. Praesent libero ante, ultrices eu, egestas ac, interdum in,
dolor. Phasellus sit amet tellus eget metus ultricies suscipit. Nam sed
neque sit amet neque gravida bibendum. Cras elementum. Suspendisse
potenti.

That looks pretty readable to me.

2008-05-17

The Book feature

I'm thinking on the new Book feature: a way to define how to fetch different parts from here and here and assemble them in something called a Book. Splitting down the work in small files is more convenient for version tracking, collaborative work, more accurate search and so on. A Book is a separate file with its own syntax. It's a perfect placeholder to hack the Parts in many ways, without polluting their structure. Which kind of hacking? Parts are parsed into an AST (Abstract Syntax Tree) where Chapters, Sections and Words are well-identified nodes. Nodes can be added and removed, then be handled at the rendering stage. Such a hack is word counting: a function walks recursively through the whole tree, counting occurences of the WORD node, then adds a METADATA node right under the root. Another hack is defining scoped styles (for one given Chapter or Section). So different pieces of text will look different, while keeping underlying Part files clean and simple. In fact, the Book file holds the internal "logic" of the final document, while Part files hold simple content. That's how I like to think about Novelang: a platform to parse, hack, then render trees representing text documents. Sure, that's sounds much like Cocoon which does the same thing using XML. But I already played with Cocoon and its generic approach adds a lot of burden to things that should remain simple, like picking parts of different files with the XPath syntax. Of course AST hacking should be open to any developer, with a syntax for embedding custom functions while keeping the Book syntax unchanged. That's an additional constraint when defining Book syntax. First, I'd like to insert a file as it is, e.g. insert its AST inplace. For doing this I introduce the :: symbol (double colon) announcing a function name. The ::insert function takes a single parameter : an URL.

::insert file://foo.nlp

I can also include several files at once. As insertion order may not be determinate, I have to think on passing a parameter here, or add a global sort option, but I won't solve all the problems now. So far we have this:

::insert file://*.nlp

Now I want to pick some elements inside a Part. Chapters and Sections support identifiers. This is the way to say "load this file and get aware of all Chapters and Sections which have an identifier":

::import file://foo.nlp

Of course wildcard is supported:

::import file://*.nlp

Now I create a Chapter which doesn't exist in any Part. As with Part syntax, a Chapter title may have punctuation signs, parenthesis and so on:

*** Chapter title

I want to change the style for this very Chapter, so I use another function which inserts a STYLE node in the AST tree. The style identifier must be a well-formed Word. The ::style function is understood as relative to the previous Chapter defition.

::style define-chapter-specific-style

I can include a Section the same way :

=== Section title

Now I add a Section which identifier is "Some chapter or section without its title." (ending dot included). There should be one and only one Section / Chapter with such identifier in all imported Parts.

::add Some chapter or section without its title.

If some Part element has a title, I may want to preserve it. Note the introduction of a function parameter (colon prefix):

::add :withtitle Some section

Now the tricky thing: add Paragraphs. This implies finding a way to identify paragraphs inside a Section, but it's about Part file syntax and won't be discussed here, let's just pretend it works. I must reference the Section (like above, except that :withtitle parameter is illegal), then I add some valued parameters for each Paragraph.

::add Some Section
 :p  Some Paragraph  
 :p+ Some other

I'm proud of this syntax because it is non-ambiguous, while Section and Paragraph identifiers may contain an unknown amount of Words (as identifiers are Paragraphs). The AST looks like this:

 (MACRO
   (MACRO_NAME (WORD add))
   (PARAGRAPH (WORD Some) (WORD Chapter) (WORD or)  (WORD Section))
   (MACRO_PARAMETER p (PARAGRAPH (WORD paragraph-id1)))
   (MACRO_PARAMETER p+ (PARAGRAPH (WORD paragraph-id1)))
 )

This will add some burden to the function code which has to interpret many things but the goal was to provide a clean syntax with no delimiter like a closing brace to mark the end of a list. In a general manner, the Book syntax relies on the idea to declare a tree of a known maximum depth, with well-known relationships between nodes (a Books contains Chapter containing Sections containing Paragraphs; and Functions may have parameters). So it's always clear which node a Function invocation refers to, and this avoids things like closing braces.

2008-05-12

Command-line arguments with JOpt Simple

Novelang started as an interactive tool because it was easier to use a Web browser for a quick preview of generated documents. But now that I want to publish documentation on the still-to-come website I need to automate document generation, because Novelang website has to be generated with Novelang (or I'd get ashamed). So I have to support a command-line based interface. By now it's a fairly simple one. Helps prints with one of the -h -? --help options whenever other options are activated. That seemed useful to me because when I type some long command and get lost with options, I don't want to run it (starting dangerous things or polluting the console with error message) nor hit Ctrl-C to request help on the next prompt -- and have to copy-paste it again. Here is the help message for now:

Usage:   Main [Options] <document [document2 [...]]>

<document>: logical name of a document to generate.
Starts with '/part/' or '/book/' (like URLs).
Valid names: /part/somedir/mydoc.html or /book/mybook.pdf

Option                      Description                                      
--------------------------- ------------------------------------------------------
-?, -h, --help              Shows this help message                          
-t, --targetDirectory <dir> Where generated files go to (default is './generated')

It's surprisingly hard to write concise help messages. I had to trick the terms to make -t option as short as possible and I'm still above 80 chars. Including line breaks and whitespace to make the "(default is [...]" message appear on another line doesn't work well as the framework generating help message understands this as one single line and generates many, many dashes in the table header. It could have limited dashes to an underline of "Option" and "Description" headers. I chose to explain document names right after general command syntax because the option list "breaks" the layout somehow. On the other hand, if the option list grows a lot, general syntax will scroll up out of view. But there is always the option to add a "verbose help" option later. One special case arises when a document file starts with a dash (-) and therefore can be confused with an option. A double dash (--) as the last option disambiguates this and we can write (though I'm not sure Novelang supports file names starting with a dash but it's another story):

Main -- -mybook.pdf
Main -t somewhere -- -mybook.pdf

Writing nice command-line interface is always more complicated than it seems. Lucky me, I found JOpt Simple, an excellent framework that perfectly suited my needs (including the double dash support). It worked flawlessly for everything I needed, including double dash disambiguation. I just ran through some minor annoyances:

Esoteric Artistic Free License.
Maven-only distribution hidden somewhere on the website.
Doesn't use Java Generics (at least OptionSet#nonOptionArguments() should return a List<String>).

I'd also appreciate multiline option description. Paul R. Holser, author of JOpt Simple, claims that his tool is better that competing ones, and he's not afraid to link on them from his project's home page. After repeated disappointements with Jakarta Commons-CLI and a few tries to grok the API of his competitors, I agree that JOpt Simple is far above others.