2008-05-22

Better character escaping

By now character escape looks like this: &oelig; for "œ" &lt; for "<" and so on. By now some character escapes make sense for characters which are not supported as words or punctuation signs. You cannot write this because closing parenthesis requires matching opening parenthesis:
Either 1) choice one 2) choice two.
The correct solution would be to support lists but for now there is the workaround:
Either 1&rp; choice one 2&rp; choice two.
That's pretty uncomfortable anyways. Current syntax (&character-name;) is easy to parse and looks like well-known HTML entities. It could even be documented with a HTML page showing all available escapes. By now to find such escapes you have to look in code source. Pretty ugly, right? A better syntax could be: `oe for "œ" `< for "<" The problem is, with the two-letter symbol escape `oe. With current syntax (&character-name;) it's easy to resolve the escaped code from some Java-defined table. The same table can generate list of escapes for (yet unexisting) online help. If there is an unknown amount of letters in the escaped code, ANTLR-generated lexer handles it perfectly, "eating" characters until they match with a sequence it knows. But ANTLR-generated lexer cannot tell how look token it recognizes in a static definition. The parser exposes token names (tokenNames constant as a String[]) but the character themselves (like `oe) do appear nowhere because they are the result of a computation which can only occur at runtime. I've read somewhere in the book or elsewhere that custom code may perform a lookahead from inside the lexer. This may work. There is the temptation to drop multi-letter support as œ, Œ and € seem the only cases, so let's say we can find another hack for them. Amazingly, œ and Œ are not part of ISO-8859-1 character set and while it's easy to get them on the keyboard of my Mac I'm not sure it's as easy on other platform. A problem with multiletter escapes using a single-character delimiter is possible mess with existing text when adding a new rule. The best option I can see for now is to use single backquote for single-letter escapes and another non-colliding rule for multiletter escapes. That may not be so simple because resulting text like "<x>" has to be written `<x`> so the `<x` fragment may be understood as a multi-character escape. Once again, the lexer can handle it well while there is no multi-character escapes starting with a character used in a single-character escape but we lose reflexive abilities. WikiCreole has some interesting discussion on character escape but it is about disabling whole markup features and I find it sometimes ambiguous. I'd like to keep escape mechanism at character level.

No comments: