2008-12-30

New beasts: delimiters and levels

To my average surprise, the renaming of all tree nodes went smoothly and all existing stylesheets were quickly put back to work, thanks to the XPath verifier. The token names are now beautifully consistent. Except a few ones, which I left out of scope until I get some insights about them. These are: CHAPTER, SECTION, TITLE. While Novelang claims to have no semantic markup, how to ignore the fact that chapters and section have highly structural effect? With a closer look, it appears that chapters and sections are quite not the same thing, depending we take them at parsing stage, or at rendering stage. At parsing stage, chapters and sections are delimiters which may be followed by some text (becoming the title). They exist at the same level as paragraphs. It’s only after the whole tree is parsed that Novelang creates the hierarchy by re-hierarchizing the tree before passing it to the rendering stage. At rendering stage, chapters and sections become containers (a source document declaring a chapter then a paragraph is not “flat” anymore, as the chapter now contains the paragraph). This all looks the same as parsing stage, except that a Book may define new chapters and sections on its own like with insert $createchapters. My guess is, with a rendering system supporting recursive processing, it would be a waste to limit ourselves to a fixed hierarchy.
Figure some big documents with numerous title levels, between five and fifteen. Creating a standard template under a tool like MS-Word doesn’t work well. When forcing each top-level title to appear on a blank page, this is a waste for small documents. But when allowing several top-level titles on the same page, big documents become unreadable and deep titles even smaller than the text body. It’s not a solution neither, to hack title depth by starting at, let’s say, the third level, because automatic numbering would cause our first chapter to be numbered “0.0.1”. On the other hand, a programmatic templating system like FOP would support such hacks.
With different names for nodes at parsing and rendering stage, we make the thing clearer. At parsing stage, we have delimiters. Delimiters are the new beast. They look like a list item but may contain less things (by now the only restriction is, they cannot contain a URL immediately following the equals signs). They will be processed in a different manner. Anyways, the “delimiter” name is not supposed to appear at rendering stage. Let’s use DELIMITER as a radix. Because we need to remain consistent with the rest we’ll continue the names with the characters it contains, so we have DELIMITER_DOUBLE_EQUAL_SIGN and DELIMITER_TRIPLE_EQUAL_SIGN. More levels is probably a bad news (you should split your text in smaller parts) but the naming scales up to “octuple”. We introduce a new convention: node names that don’t appear at rendering stage are suffixed by a low line (“_”). The low line as a prefix is already taken for node names which don’t appear in the parser. Finally our delimiter nodes are DELIMITER_TRIPLE_EQUAL_SIGN_ and so on. But what about the title? Even if its content is close to a paragraph’s, the title is structurally different. For its name, I’d like something like “text for a delimiter” but “text” carries no structural meanin. DELIMITING_TEXT_ is better but not so good as it suggests that the text itself is the delimiter. Let’s keep it until better. Now there is an obvious choice for the name of the nodes containing other nodes: _LEVEL (“level” is a palindrome, by the way). The title becomes _LEVEL_DESCRIPTION. Yes, this starts to look like semantic markup but it’s just reflecting the reality, because the processing of the delimiter gives something of a higher meaning. A consequence of dropping the chapter / section difference is some loss of information. Source documents which define top-level sections are now structurally undistinguishable from source documents defining top-level chapters with no section under. This looks like a good thing, because these differences are supposed to be managed at Book level.

Novelang-0.15.0 released!

Latest release of Novelang available here. Main new feature: stylesheets containing XPath expressions relative to non-existing grammar token are detected and rejected. Fixed some annoying bugs. See "Status and history" for details. Enjoy! c.

2008-12-28

New naming scheme for nodes

The detection of incorrect XPath expression in XSL files now works (it is in the master branch). It’s based on code generation for the NodeKind class which describes every supported node names as produced by the parser, and this is expected to be of a great help during the incoming huge refactoring of the naming scheme of the nodes.

While Novelang grammar contains no semantic information, it has semantic-like markup. Text like //hi there// is supposed to become italics because of the slanting evocation caused by the solidus (or “forward slash”“forward slash”) character. But here is the lie: while the stylesheet processes an emphasis node, its output is whatever the author wished – including something not related to emphasis at all. The grammar is just wrong to claim its node is about emphasis, because the choice of making it appear as emphasis (through italics) is out of grammar’s scope.

The new naming scheme of the nodes intends to make the intent clearer: Novelang grammar carries no semantic meaning. The meaning is given by the stylesheet. In the case of the text between the two pairs of solidus, all what the grammar surely knows is… well, it’s about two pairs of solidus. Just before diving in the gory details here is a taste of the new naming scheme: EMPHASIS would become BLOCK_INSIDE_SOLIDUS_PAIRS and inside the XSL stylesheet it would be n:block-inside-solidus-pairs.

Finding a consistent (and extensible) naming scheme is not easy because of plenty of overlapping cases. Many terms need clarification and sometimes consistency may impact some structural aspects.

Let’s start with the paragraph. The paragraph is a very central object which helps finding out two families of nodes: those taking place inside a paragraph, and others (which define the paragraph itself, or define stuff that may contain a paragraph).

Let’s say a paragraph is a sequence of characters which does not contain two consecutive line breaks. This draws interesting questions.

Should a standalone URL appear enclosed in a paragraph? If a URL appears standalone, it reflects the author’s will to make it appear as a paragraph so we’ll enclose it into a paragraph (and the definition of “paragraph”“paragraph” gets clearer!).

Is a big list item a paragraph? A big list item could be renamed in order to contain the “paragraph”“paragraph” word. For consistency, the “small list item”“small list item” would become something like “embedded item”“embedded item”, we just lose brevity here.

There is a temptation to embed the list item node inside a PARAGRAPH block, as we do for URL. The stylesheet could rely on paragraph’s parenthood (the list node) to determine it’s a list item. And then we have only PARAGRAPH node, not two distinct cases. But in practice, stylesheet writers will make two distinct cases everytime because the two have different indentation and so on. So we really need to flavors of the “paragraph”“paragraph” node.

PARAGRAPH_REGULAR is a good name for the regular paragraph, hinting there can be not-so-regular ones.

PARAGRAPH_ITEM makes sense, as starting with PARAGRAPH tells structural things about the node. On the other hand, it can be understood as if paragraphs were holding items. So let’s be true and use PARAGRAPH_AS_LIST_ITEM.

For a source document like this:

Hello

http://novelang.sf.net

--- item1

--- item2

We end up with such a node structure:

+-- PARAGRAPH_REGULAR
|   +-- WORD
+-- PARAGRAPH_REGULAR
|   +-- URL
+-- LIST
    +-- PARAGRAPH_AS_LIST_ITEM
    |   +-- WORD
    +-- PARAGRAPH_AS_LIST_ITEM
        +-- WORD

Now that we are clear about paragraphs, let’s consider the case of paragraphs tied together by a paired delimiter (a delimiter including a start marker and an end marker). This is what current “blockquote”“blockquote” does. The delimiters are a whole line starting with “<<”“<<” (lower than sign) and ending with “>>”“>>” (greater than sign). The “blockquote”“blockquote” may only contain paragraphs. As we reserve the word “block”“block” for another usage to explained later, we must find a way to tell there is an enclosed sequence of paragraphs. A prefix like SEQUENCE_OF_PARAGRAPHS is not so bad because it puts the emphasis on the word “sequence”“sequence”. But PARAGRAPH_SEQUENCE doesn’t carry a plural form so it doesn’t look stupid in case of only one paragraph. On the other hand, names describing a paragraph (PARAGRAPH_REGULAR and PARAGRAPH_AS_LIST_ITEM) start with the word “paragraph”“paragraph”, causing some confusion. Finally, PARAGRAPHS is best because if we drop the plural vs. singular thing, we avoid the lengthy “sequence”“sequence” word with the same meaning. Now we have a radix for the node name, we just add a suffix to describe the delimiter. After all, it would make sense to have similar structures with different markup (as we have for stuff inside a paragraph). Since the delimiter is a pair of angled brackets, just tell it. We end up with PARAGRAPHS_INSIDE_ANGLED_BRACKET_PAIRS.

I had a strong debate with myself: should I use IN (shorter) or INSIDE (more explicit)? The “inside”“inside” word is very clear about a block contained by something. Later, when creating names around the “block”“block” word, we’ll see that a construct like “block inside”“block inside” is less ambiguous than “block in”“block in” that may look like a verb.

In ANGLED_BRACKET_PAIRS the name of the delimiter is left in singular. The word “pair”“pair” is used because “double”“double” is required for “double quote”“double quote”. “Double quote”“Double quote” is the Unicode name of the character, and it’s a Novelang standard to always use Unicode names. So we can’t use “double”“double” to say there are many delimiters and we use “pair”“pair” instead. Telling there are several pairs (plural) is ok because we can’t honestly figure how there could be more than two.

Novelang’s current “literal”“literal” looks a lot like “blockquote”“blockquote” (just three angled brackets instead of two). But literal doesn’t care about any paragraph structure. It’s just uninterpreted lines, including line breaks as they are. That’s a thing to know when writing the stylesheet: it hints there will be no subnode to process. In this case, LITERAL should appear inside the node name. But, as we stated for paragraphs, it’s important to highlight the structural implications of the node. So we end up putting the “line aspect”“line aspect” first and we get LINES_OF_LITERAL as a radix. Adding the suffix, we end up with LINES_OF_LITERAL_INSIDE_ANGLED_BRACKET_TRIPLETS. The suffix here is questionable because I don’t see any reason to offer another support to literal. So let’s keep LINES_OF_LITERAL finally.

There is another kind of nodes that may contain other nodes, especially paragraphs: “chapter”“chapter” and “section”“section”. There is matter for a discussion because if both become sections and sections become nestable, we could do amazing things, especially if the depth of sections can be adjusted at Book level. But we don’t need to solve every problem today and we leave this to another discussion.

Now let’s look at what happens inside a paragraph. All subelements acting like a container inside a paragraph (like parenthesis) are called blocks. “Block”“Block” is a good word because as it is short and it’s not wasted here because it will appear a lot. In order to follow the emerging rule of telling about structure first, we’ll use the prefix BLOCK.

For stuff inside parenthesis and square brackets (and curly braces in a near future), something like BLOCK_INSIDE_PARENTHESIS is clear enough.

For paired delimiters like double hyphen (for -- interpolated clause --) or double solidus, it’s right to say there are two pairs of something. So BLOCK_INSIDE_SOLIDUS_PAIRS looks reasonable.

Current “interpolated clause”“interpolated clause” has a special case when it has a “silent end”“silent end” (like --this-_.). It’s useful for making only the first dash character appear, while a dumb punctuation sign would have released the level of control provided by the parser. In this case, it’s hard to claim there are two pairs of hyphens. BLOCK_INSIDE_2_HYPHENS_THEN_HYPHEN_LOW_LINE is accurate, while not very concise. THEN has the special role to tell delimiters are asymmetrical, describing the first delimiter then the second.

For double quotes, rules stated above still apply (no chance here) and we have BLOCK_INSIDE_DOUBLE_QUOTES.

For current “superscript”“superscript”, there is only one opening delimiter. The closing delimiter is implicit with the end of contained word (super^script) so we don’t have exactly a block. But rules above still apply and we get WORD_AFTER_CIRCUMFLEX_ACCENT.

Punctuation signs are left unchanged: by now we have a PUNCTUATION_SIGN node enclosing a node representing the sign itself (SIGN_COMMA, SIGN_PERIOD…).

Now here is the summary of old node names vs. new ones:

CHAPTER                      
  -> CHAPTER

SECTION                      
  -> SECTION

PARAGRAPH_PLAIN              
  -> PARAGRAPH_REGULAR

PARAGRAPH_SPEECH             
  -> PARAGRAPH_AS_LIST_ITEM

BLOCKQUOTE                   
  -> PARAGRAPHS_INSIDE_ANGLED_BRACKET_PAIRS

LITERAL                      
  -> LINES_OF_LITERAL

EMPHASIS                     
  -> BLOCK_INSIDE_SOLIDUS_PAIRS

QUOTES                       
  -> BLOCK_INSIDE_DOUBLE_QUOTES

PARENTHESIS                  
  -> BLOCK_INSIDE_PARENTHESIS

SQUARE_BRACKETS              
  -> BLOCK_INSIDE_SQUARE_BRACKETS

INTERPOLATEDCLAUSE           
  -> BLOCK_INSIDE_HYPHEN_PAIRS

INTERPOLATEDCLAUSE_SILENTEND 
  -> BLOCK_INSIDE_2_HYPHENS_THEN_HYPHEN_LOW_LINE

SOFT_INLINE_LITERAL          
  -> BLOCK_OF_LITERAL_INSIDE_GRAVE_ACCENTS

HARD_INLINE_LITERAL          
  -> BLOCK_OF_LITERAL_INSIDE_GRAVE_ACCENT_PAIRS

SUPERSCRIPT                  
  -> WORD_AFTER_CIRCUMFLEX_ACCENT 

Now I’m having a look at some ideas I blogged down for extending the Novelang grammar. It was in: http://novelang.blogspot.com/2008/07/some-ideas-for-novelang-syntax.html. The new naming scheme seems to scale!

Please note the use of AND used for describing the first delimiter. For “++=”“++=” we have 2_PLUS_SIGNS_AND_EQUAL_SIGN_PAIRS where AND means “immediately followed by”“immediately followed by”. There is no hint the delimiters are symmetrical.

^^ small caps ^^          
  -> BLOCK_INSIDE_CIRCUMFLEX_ACCENT_PAIRS
  
__- single underline -__ 
  -> BLOCK_INSIDE_2_LOW_LINES_AND_HYPHEN_PAIRS
  
++= double strike =++     
  -> BLOCK_INSIDE_2_PLUS_SIGNS_AND_EQUAL_SIGN_PAIRS

sub_script                
  -> WORD_AFTER_LOW_LINE

2008-12-20

Novelang-0.14.0 released!

Latest release of Novelang available here. Refactored ANTLR grammar for supporting some planned features. Known regression: some useful characters missing from font list. Enjoy!

Grammar refactoring complete!

Wow, I just checked the result of a major grammar refactoring into the master branch! Now I'll ship a new version with just those changes to get sure that nothing breaks in existing documents. But what's next? I've many options to consider. Split the project down to smaller pieces With one big project with lots of Java code depending on ANTLR-generated parser, playing with the grammar and deactivating some rules breaks the whole project, so it becomes difficult to find out what's wrong. During the refactoring, I splitted the sources in several projects: common stuff, parser stuff, rest of the stuff. Then I could focus on the grammar itself. Keeping such a split would be great for future experiments. On the other hand the Ant build would become increasingly complex. Maven is the tool for working with several subprojects but the migration has a cost. Maybe I should defer it until the next time I need it. Tree nodes renaming With usage, is appears that Novelang grammar has nothing to do with a semantic markup. It supports semantic-like markup (like "//foo//" where double solidus pictures slanted characters which are like italics) but the flexible nature of stylesheets definitely avoids to freeze the meaning of grammar constructs. For this reason, the tree node which is now named emphasis should be renamed into something like block-in-double-solidus-pair. Stylesheet consistency check Changing the name of the tree nodes will break existing stylesheets. But checking if stylesheets use correct node names would ease the pain a lot. Such a check could be made by parsing attributes like match="n:chapter" and raise errors when an unknown node name appears (in the "n:" namespace). Automatic detection of node names in the grammar It's the same principle as above, applied to Java code. By now tree node names defined in the grammar are duplicated in the NodeKind class. The NodeKind class is updated manually. By generating corresponding code, there would be no need to update it manually. Fix list of supported characters Class SupportedCharacters is broken because of changes introduced by ANTLR-3.1.1. The fix could get trivial if we generate some Java code in the same was as for node names (described above). List items That was one of the main features justifying this refactoring, remember? There are plenty of lists to be envisaged. There are two main families of lists: big lists, whose items behave like paragraphs with a special introducer, and small lists, whose items are separated by a single line breaks and therefore may appear inside a paragraph.
--- Big list item with hyphens

### Big list item with number signs

*** Big list item with asterisks

This is a paragraph.
- Small list item with hyphen
  - Some sub-item
  - Sub-item, again
Number sign hints the renderer to generate numbered list.
Interpreting indentation is left to tree-mangling and rendering.
  # Small list item with number sign
  # Number two
  # Number three
Asterisk mean almost the same as hyphens. Do we need them?
* Small list item with asterisk
* Yet another one item with asterisk
URL Another main feature is support for URL inside paragraphs and URL having title.
This is a paragraph embedding two URL. Here is the first:
http://foo1.com
  "This is the title of second URL"
http://foo2.com
Conclusion Adding code generation from ANTLR grammar then stylesheet checker seems the first thing to do in order to secure existing features and build on solid basis. Generating code from existing grammar (for enumerating supported node names and characters) may be complex as it may involve ancillary classes. This could be a reason to Mavenize the project.

2008-12-09

Ongoing grammar refactoring

As refactoring goes on, it's time to answer some questions. The main new features (URLs and list items inside a paragraph) have a great on the whole design. Both may take place inside a paragraph; both start (and end) with a line break. That's a major change, as in former design the paragraph was a single piece as long as there were no two contiguous line breaks. While it's not a part of the grammar itself (as in the Novelang.g file), URLs support being "decorated" with a preceding double-quoted block, or an angled-bracketed block. So we should consider that URLs inherently spread over many lines and expose a double-quoted block. List items inside a paragraph are called "small list items". They cannot spread over more than one line (no break inside). For this reason, a small list item may not contain a paragraph. So we have a new beast: text blocks which don't contain line breaks as paragraphs do. For this reason, they cannot contain URLs or small lists. We'll call monoblock a text block with no line break. We'll call spreadblock a text with line breaks (as it may spread over several lines). We can't have URLs inside small lists. But's that's ok because there is another thing called "big list items" that's a plain paragraph with a special indicator at its start. By now (master branch), titles for sections and chapters are casual paragraphs. I wonder if it now makes sense to have URLs or small lists inside a title? Another limit I put arbitrarily is to forbid URLs and small lists inside double-quoted blocks. This sounds right for the URL because of the double-quoted block that may decorate a URL (it's logically impossible to have double-quoted block inside double-quoted block without an additional delimiter). For the small list, it's more for typographical sanity. Does it make sense to extend this to any asymmetrical delimiter (emphasis, interpolated clause...)? Sometimes it's useful to emphasize a whole paragraph, including its URLs and small lists. Because I'd like Novelang grammar to be twisted and abused in any technically-feasible way (for creating idioms), it doesn't make sense to forbid titles to be paragraphs, nor to forbid asymmetrically-delimited blocks (except double-quoted blocks) to spread over several lines and contain URLs and small list items. Depending on the stylesheet, things like this could make sense:
=== This is a section with embedded small list items:
- item1
- item2
- item3

This is a paragraph.
Rendering may look like:
This is a section with embedded small list items: item1, item2, item3.
Same for URLs. Because URL href must stand on its own line, this encourages adding a display text:
=== "This is URL display text"
http://foo.com
At the end, it seems that technically feasible things drive to well-designed grammar! Thanks to ANTLR for helping me to express the grammar so clearly.

2008-12-05

Tree manipulation languages

As a background task I'm thinking on how to improve the Book definition. By now, Novelang books are a sequence of imperative tasks:
insert file:the-preamble.nlp 
  $style=preamble

insert file:. 
  $recurse 
  $createchapter

mapstylesheets 
    $pdf=my-pdf-stylesheet.xsl
Apart of the mapstylesheets command, insert command is all about inserting trees in the main document. Default behavior is to create one tree from a Part file. The $recurse command creates many trees (one per Part file) from given directory. The $style command creates a style node right under the root of the inserted tree. One of the planned features of Novelang is to build Books from identifiers, which are subtrees. As it's all manipulating trees one can note that XSLT transforming a Book tree into a PDF or a HTML document are all about manipulating trees, too, but they act at a different stage: once the content of the document is well-defined. And XSLT work as producing a changed version of one input tree, they don't build a tree from multiple sources. As XSLT make a good work, I googled on "tree manipulation language" to see if there is something useful here, at least to take inspiration from. TXL I found TXL, which seems backed by serious research. Unfortunately it doesn't come an an embeddable Java library. TXL scripts define an input grammar for building up the tree, then rules for creating / adding / moving / deleting nodes. It looks sweet for experimenting with languages. Tregex and Tsurgeon Tregex is a regex-like language for extracting nodes from a tree. Tsurgeon is Tregex extension for manipulating the trees extracted by Tregex. The Tregex language itself looks good. Given a tree like this:
    NP
  / |  \
NP  PP  PP
The following Tregex expression means something like "Call 'n' the node that has a 'NP' node as a parent and with a sibling which has a 'PP' node as a sibling while this last node should be called 'pp2' by the way." :
NP=np < (NP $+ (PP $+ PP=pp2))
Then here is a Tsurgeon expression :
adjoin (NP=new_np NP@) np
move pp2 >- new_np
Applying the Trex and Tsurgeon expressions on the tree above give a new tree like this:
      NP
     /  \  
   NP    PP
  / \ 
NP   PP
Tregex and Tsurgeon are bundled in a Java library. I don't like their design of the Tree class I don't discuss the fact their Tree class should be mutable, at least because this may save memory with some algorithms. But the Tree class is a concrete class declaring more than one hundred of public methods. Most of them could have been part of a utility class. You are dishonestly invited to compare to Novelang's Tree definition! Conclusion As far as I see, there is much work done on tree manipulation languages. While they enable to do anything on a given tree, they are a very special jargon that won't help non-geeks to create Novelang Book files. So I should find other areas to investigate if I want new ideas in this area.

2008-12-01

Leo

Leo is some kind of text editor with the ability to aggregate files and parse special processing directives. It can be seen as a tool to create and manage graphs of text fragments. Where Leo shines is for extracting some code fragments, using its own directives. There is some likeliness with Novelang, that has books including parts. I don't want anything like a graphical front-end for Novelang now but there may be some ideas to steal from Leo.