The Novelist: random text generation

Novelang already does all the typesetting for you. What’s next? Writing text, of course! The just-started Novelist subproject, which aims to generate big documents for Novelang testing under heavy load.

Based on French metrics, random text looks like this:

Uomuecto eaufues xuner ig ocanerr, ebanu otpaa. Uuse, on eian aibtd, rttaintlufe elvettarrh, yrn enemlcmlun, ebcazepuer madscg, êiiovemtt teeost eseeerde? Fetn eearréetcs emrseoss icia ntmvesrud. Aoasro cênit ctainetda aèugedet css eali, unero aaie eneoden, nrortio. Oovlod; tfsmenco méttsna, eesdis uoeaeanao rcuent, desungtt av au oneerao, dxuaste umeinétniu lccdeiilne rliùearde veyiritisac yàslu. Iinmseuo odiapqied cmiiapearlo ebnjtus uauueis, libginmasa edrc emaèi sllieyr sode!

It bases on simplistic distribution algorithm. Word count and letter count from uniform distribution in a pre-defined range (something like 5-20 for words and 2-12 for letters). Letters come from a frequency table giving the percentage of appearance for each letter.

While the result doesn’t look much like real text, it’s good enough to stress basic parsing and typesetting.

There has been a lot of research about text analysis, first for cryptography, next for natural language analysis and Web crawling. Among all of them, there is a nifty one: the n-grams , which describe all the different letter sequences of a fixed length in a given text. The demo on Wolfram Alpha is gorgeous. It shows how combinations grow fast: a simple sentence like “ceramics come from” contains 69 3-grams. Google’s n-grams database (ranging from 1-grams to 5-grams) weights 24 GiB gzip’ed and contains near 1 billion of 3-grams. Amazingly, this number doesn’t increase so much for 4-grams and 5-grams.


Novelang-0.41.1 released!

Download Novelang-0.41.1 here !
  • Fixed bug with Promoted Tags, not detected under some circumstances.


Novelang-0.41.0 released!

Download Novelang-0.41.0 here !
  • New feature: Promoted Tags. Implicit Tags matching Explicit Tags become Promoted Tags.
  • Support lines of literal inside paragraphs inside angled bracket pairs.
  • Minor enhancements on HTML default stylesheet.


Novelang-0.40.1 released!

Download Novelang-0.40.1 here !
  • Fixed display bug on generated documentation.

Novelang-0.40.0 released!

Download Novelang-0.40.0 here !
  • Brand new stylesheet for HTML.


HTML default stylesheet improvements

A new default HTML stylesheet will be available soon. It should improve Novelang usability a lot. Key features are:

  • A better look.
  • Scaling up with metadata-oriented features.

If a picture is worth a thousand words:

Fluid layout

New layout supports horizontal resize. The column for rendered text may span from 500 to 1000 pixels.

Lines of literal (<pre> tag) wrap if they are too long. Because wrapping only occurs with the white-space : pre-line style, which discards indentation by default. To prevent this, some JavaScript replaces every space character inside a <pre> by a non-brekable space, immediately followed by a zero-width space. This causes a clean-looking wrapping, but text copied in the clipboard has unwanted character.

Overall look

Titles are indented. This is a compromise with the Descriptor feature (described later).

Line spacing is constant, even between two paragraphs, or between a paragraph and an embedded list. There is a slight loss of information (it may be hard to see where a paragraph begins) but this globally increases readability.


Font choice has a huge impact on overall look. Chosing fonts is hard stuff, because fonts rendering is hardly the same across Web browsers. Font readability also changes a lot, depending on line spacing, contrast, and other fonts around.

The convention is: serif font for rendered document, sans-serif for extra information like actions and tags. Literal (<pre> and <code>) shows with a fixed-with font, which is serif, too.

After experimenting with a lot of combinations, finally, the winners are:

—  Palatino Linotye for the rendered document. This gorgeous font is a bit more readable than Times New Roman. It’s available on all platform, and looks gorgeous with appropriate contrats (dark grey over light gray instead of black over white).

—  Lucida Grande with Tahoma as second choice. Lucida Grande is highly readable (was chosen as default for Mac OS X), but sophisticated enough to not look “poor” aside Palatino.

—  Courier New is not new at all, but it mixes harmoniously across text in Palatino. Those fonts display much better on Mac OS X, or with Safari on Windows XP.


Descriptors appeared in Novelang-0.39.0, as an experimental feature. They now display with a nice fade and animation, in order to preserve user’s visual landmarks. Descriptor have a vertical bar that helps to see the scope of the descriptor. This vertical bar only shows when Descriptor is discloed.

Descriptor disclosers now appear close to Tag column. This avoids polluting the left margin.

Scalable lists for metadata

On big documents, there can be so many tags they don’t fit in the height on a Web browser’s window. But most of time, they all fit so it’s convenient to have all of them at a fixed position. How to deal with the exception without hurting common case? Having a 2nd scrollbar in a browser’s frame looks confusing. But the scrollbar has a great feature: it shows that some items are out of sight. One trick could be displaying a huge popup, but this probably means a lot of work for a poor result.

Finally, the solution comes with a fade to grey at the end of the list to show that all items don’t show. A tiny button “unpins” the tag list from the top of Web browser’s window and lets it go to the document’s beginning. So, when entering the “1 % case” we still have a standard behavior.

Here is the Tag tab in its default pinned state (note the fade at the bottom and scrollbar position):

Unpinning causes it to scroll with the rest of the document:

In addition to Tags, there will be, in a (hopefully near) future, more metadata like Identifiers. Tags show up under a tab bar where it’s easy to add new tabs.