Testing for Unicode Safety

February 11, 2009 at 10:30 AM | Unicode | View Comments

After yesterdays post, Greg suggested I write another on how to test for Unicode safety... And unfortunately I've got some bad news: it's hard.

You never know when some developer, somewhere, will unintentionally encode or decode something the wrong way (for example, log("request for %s", unicode(url))).

But there is hope!

In my experience, almost all Unicode-related issues follow the same pattern: someone using str or unicode incorrectly and code which unexpectedly encodes/decodes a string.

The first is easy to check for: grep through the code for str( and unicode(.

The second is harder to check for, and requires an understanding of both the code base: all of the points where the code interacts with other parts of the system (filesystem, database, network) must be found and checked.

Finally, it isn't a bad idea to throw some Unicode into the test suite. Instead of calling mock users 'user0', 'user1', Call them u'\u03bcs\xeb\u044f' (u"μsëя")*. Keep a central "database" of these sorts of strings, so it's easy for developers who don't normally write in Cyrillic to use Cyrillic characters in their code (I keep my own personal list at http://wolever.net/~wolever/wiki/unicode_audit -- a url I can now type from memory).

One word of caution, though: you're asking for world of pain if you actually think you can commit UTF-8 encoded text -- any number of things will break (subversion may helpfully fail, your editor may helpfully re-encode the file, your unenlightened developers will complain about funny question marks in their code, etc...). Instead, have a central file which defines these "canned test strings" using escaped Python strings (ie, u'\u03bc...') then import that into your test suite:

from app.tests import i18n
...
def test_user():
    u = new User(name=i18n.user)
    ...

Or tar up all the offensive files, then write a script to un-tar them when they are needed**.

So, to sum it up:

  • Make sure your developers grok (or, at least, understand) Unicode and encodings
  • Make sure your code uses str and unicode safely
  • Make sure your exit points are covered
  • Make it really easy to include Unicode in tests

Then maybe, if you're lucky, all those inconsiderate people who have the audacity to ask for more than 127 characters will be able to use your application :-)

*: A good choice both because it's easy for ignorant North Americans like myself to see that's it's correct.

**: This is how I tested DrProject's handling of Unicode filenames which are checked into the Subversion repository.