2008-07-01

Encoding(s)

Attempting to run Novelang-0.5.0 the way the documentation said (java -jar Novelang-0.5.0.jar), I discovered that some characters (especially those with accents) were not rendered as they should. As I'm working on MacOSX I shouldn't have been surprised when reading the system properties dump:
  file.encoding = MacRoman
I couldn't see the mismatch under my development environment, as it was "kindly" (and rather stealthily) forcing the file.encoding system property of a new process to encoding I did set for editing my files. If you, happy early adopter, hit such a problem and were to shy to post on Novelang User list, you've got to try this:
  java -Dfile.encoding=ISO-8859-1 -jar Novelang-0.5.0.jar
By now Novelang expects all files to be in ISO-8859-1 (aka ISO Latin 1). This is defined as a constant somewhere, and passed through method calls. I thought this would make Novelang should be insensitive to the file.encoding system property but obviously I missed something. No doubt I'll find what sooner or later, but this lead me to more interesting reflexions. File encoding must be known to convert the 8-bit characters from a file to Java's internal 16-bit Unicode characters. Novelang grammar defines very precisely the characters it accepts: a subset of ISO-8859-1. But it doesn't mean the document source file has to be encoded this way! The "é" character also exists in MacRoman encoding, so file.encoding property must be taken for what it is: just a hint to read Unicode characters from a stripped-down format. This lead me to the following conclusions:
  • Novelang grammar defines every supported character, no matter which encoding as they are defined in Unicode. This guarantees a lot of fun with Greeks, Germans, Russians...
  • If one encoding for all files is the option then the file.encoding system property provides the simplest approach. This is current option (because of a bug preventing ISO-8859-1 to be forced as default).
  • Per-file encoding would be a must. A Book function would be ok, telling "from now on, read Part files with encoding xxx". A HTTP query parameter could provide such a hint when previewing Parts with special encodings in a Web browser.
I strongly recommand this excellent paper about character encoding and Unicode, by Joel Spolsky:

No comments: