2008-11-02

Refactoring Novelang grammar

I started working on a refactoring of the Novelang grammar and it's a big job. By the way I switched to ANTLR-3.1.1, the latest version of ANTLR. The development occurs in the ANTLR-3.1.1 branch. ANTLR is the greatest tool for generating parsers. With version 3.1, it supports grammar imports, which means a complex grammar can split in smaller files. With careful design it would be possible to let third-party developers extend Novelang grammar, as ANTLR supports rule overriding. Alas, ANTLR-3.1.1 doesn't work well with multiple import levels so I'm keeping one huge grammar file for now. You can have a look at current Novelang grammar (master branch). For the end-user, the biggest feature brought by this refactoring is support for "monoline" text items. Basically this is for stuff delimited by a pair of line breaks and that may stand in the middle of a paragraph. By now, Novelang only recognizes URLs when delimited by two pairs of line breaks.
(This is some paragraph before the URL.)

http://novelang.sourceforge.net

(This is some paragraph after the URL.)
Recogninizing a URL as "monoline" text item would allow something like this:
This is a paragraph.
http://with-url-inside.com
...Same paragraph, continued.
That's a lot more natural. The URL still must start at the start of the line, because it's much easier to copy from the text editor. I previously discussed URL syntax here. Coding Horror blog has a nice post that should deter anyone to include URLs in plain text with no machine-understandable delimiter. The full-blown URL syntax supports URL decorations like this:
Go to 
  "Novelang website" 
  [Novelang website on Sourceforge.net]
http://novelang.sourceforge.net
and see all useful links.
The quoted and bracketed text blocks are optional and provide display text and alternate text. I can see no reasonable way to support them at grammar level. The best way to handle them is at tree-mangling stage (reordering the Abstract Syntax Tree generated by the parser). This means, the parser-generated AST should include nodes describing whitespace and line breaks. Support for monoline items is helpful (necessary?) for supporting lists. As previously discussed, here is how I want to write a list:
Here is a list on two levels:
* First item
* Second item
  * First subitem
  * Second subitem
* Third item
...And the paragraph continues here.
As for URL decoration, the grouping of list items is made at tree-mangling stage. Because identation matters, whitespaces in AST should tell how big they are. A list which can appear inside a paragraph will be called a small list. There is the need for another kind of list where items are paragraphs, to be called big list. The symbols for designating list items ("*", "#", "-", "---",...) are left to another discussion.

No comments: