2008-12-20

Grammar refactoring complete!

Wow, I just checked the result of a major grammar refactoring into the master branch! Now I'll ship a new version with just those changes to get sure that nothing breaks in existing documents. But what's next? I've many options to consider. Split the project down to smaller pieces With one big project with lots of Java code depending on ANTLR-generated parser, playing with the grammar and deactivating some rules breaks the whole project, so it becomes difficult to find out what's wrong. During the refactoring, I splitted the sources in several projects: common stuff, parser stuff, rest of the stuff. Then I could focus on the grammar itself. Keeping such a split would be great for future experiments. On the other hand the Ant build would become increasingly complex. Maven is the tool for working with several subprojects but the migration has a cost. Maybe I should defer it until the next time I need it. Tree nodes renaming With usage, is appears that Novelang grammar has nothing to do with a semantic markup. It supports semantic-like markup (like "//foo//" where double solidus pictures slanted characters which are like italics) but the flexible nature of stylesheets definitely avoids to freeze the meaning of grammar constructs. For this reason, the tree node which is now named emphasis should be renamed into something like block-in-double-solidus-pair. Stylesheet consistency check Changing the name of the tree nodes will break existing stylesheets. But checking if stylesheets use correct node names would ease the pain a lot. Such a check could be made by parsing attributes like match="n:chapter" and raise errors when an unknown node name appears (in the "n:" namespace). Automatic detection of node names in the grammar It's the same principle as above, applied to Java code. By now tree node names defined in the grammar are duplicated in the NodeKind class. The NodeKind class is updated manually. By generating corresponding code, there would be no need to update it manually. Fix list of supported characters Class SupportedCharacters is broken because of changes introduced by ANTLR-3.1.1. The fix could get trivial if we generate some Java code in the same was as for node names (described above). List items That was one of the main features justifying this refactoring, remember? There are plenty of lists to be envisaged. There are two main families of lists: big lists, whose items behave like paragraphs with a special introducer, and small lists, whose items are separated by a single line breaks and therefore may appear inside a paragraph.
--- Big list item with hyphens

### Big list item with number signs

*** Big list item with asterisks

This is a paragraph.
- Small list item with hyphen
  - Some sub-item
  - Sub-item, again
Number sign hints the renderer to generate numbered list.
Interpreting indentation is left to tree-mangling and rendering.
  # Small list item with number sign
  # Number two
  # Number three
Asterisk mean almost the same as hyphens. Do we need them?
* Small list item with asterisk
* Yet another one item with asterisk
URL Another main feature is support for URL inside paragraphs and URL having title.
This is a paragraph embedding two URL. Here is the first:
http://foo1.com
  "This is the title of second URL"
http://foo2.com
Conclusion Adding code generation from ANTLR grammar then stylesheet checker seems the first thing to do in order to secure existing features and build on solid basis. Generating code from existing grammar (for enumerating supported node names and characters) may be complex as it may involve ancillary classes. This could be a reason to Mavenize the project.

No comments: