Wednesday, February 11. 2009Testing for Unicode SafetyAfter yesterdays post, Greg suggested I write another on how to test for Unicode safety... And unfortunately I've got some bad news: it's hard. You never know when some developer, somewhere, will unintentionally encode or decode something the wrong way (for example, But there is hope! In my experience, almost all Unicode-related issues follow the same pattern: someone using The first is easy to check for: grep through the code for The second is harder to check for, and requires an understanding of both the code base: all of the points where the code interacts with other parts of the system (filesystem, database, network) must be found and checked. Finally, it isn't a bad idea to throw some Unicode into the test suite. Instead of calling mock users 'user0', 'user1', Call them u'\u03bcs\xeb\u044f' (u"μsëя")*. Keep a central "database" of these sorts of strings, so it's easy for developers who don't normally write in Cyrillic to use Cyrillic characters in their code (I keep my own personal list at http://wolever.net/~wolever/wiki/unicode_audit -- a url I can now type from memory). One word of caution, though: you're asking for world of pain if you actually think you can commit UTF-8 encoded text -- any number of things will break (subversion may helpfully fail, your editor may helpfully re-encode the file, your unenlightened developers will complain about funny question marks in their code, etc...). Instead, have a central file which defines these "canned test strings" using escaped Python strings (ie, u'\u03bc...') then import that into your test suite:
Or tar up all the offensive files, then write a script to un-tar them when they are needed**. So, to sum it up:
Then maybe, if you're lucky, all those inconsiderate people who have the audacity to ask for more than 127 characters will be able to use your application *: A good choice both because it's easy for ignorant North Americans like myself to see that's it's correct. **: This is how I tested DrProject's handling of Unicode filenames which are checked into the Subversion repository. Tuesday, February 10. 2009str(...): 'yer probably doin' it wrong.Unicode is an ugly beast... And until people start standardizing on Python 3k*, we're going to have to live with the eccentricities of Python 2's strings. But, fear not! There is (at least some) hope. By changing a few patterns in the way you code, you can alleviate the bulk of Unicode-related problems. First, using the str function. In just about every case, if you're using the str function, you're probably doing it wrong. Let me demonstrate:
Cool, we can hash things then print out the hash:
But, wait... What happens if the thing we're hashing isn't a string (even though it can be represented as a string):
Oh no! Ok, let's fix the code:
Great -- we can hash numbers now:
And, for most people who only speak English, this is a perfect place to stop.
After all, everything is a Well... No. What happens if the input is a
Crap. Where the heck is 'ascii' coming from? Well, it's a long story (which I've covered over at Encoding and Decoding Text
in
Python),
but basically the "Alright...", you're probably thinking, "If the problem is with Ok, let's see what happens.
Yup, that's right -- you just can't win What's happening here? Well, this time, the Confused yet? So how can we save ourselves from all this insanity? Actually, it's not too difficult:
And what about those
So there you have it.
A quick and fairly easy way to avoid many of your encoding-related options If you're still not quite feeling comfortable with all of this, though, take a read over Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) *: My newly installed Debian 4 machine is still running Python 2.4 (released December 2004)... So that wait might be a while. Thursday, May 1. 2008Encoding and Decoding Text in Python (or: "I didn't ask you to use the 'ascii' codec!")When dealing with Unicode in Python, it doesn't take long to get the dreaded You never see it coming. It doesn't make any sense. You didn't even ask for So what's the deal? I'm glad you asked. I will demonstrate:
If you guessed that Now, if we want to do anything useful with this data, it needs to be decoded:
We have just taken an encoded hunk of data and decoded it to get a useful hunk of data.
Now we can take that useful hunk of data (the English in 7-bit ASCII), do something useful with it (in this case, replace 'world' with 'Marguerite'), and finally encode the data. So how does all this relate back to Unicode and ascii error messages? I have used base64 encoded data here, but the same concept applies when dealing with Unicode data:
(of course, in the Real World, you've got to figure out which encoding was used on the data (UTF-8, Latin1, etc)... But that's a topic for another post.) Ok, back to the I know what you're saying, "but I never asked Python to decode anything! I'm just trying to turn it into unicode!"
Two questions arise here: First, "Where is the 'ascii' coming from?" Second, "How do I make it work?" To answer the first question, it's important to think about what's happening when the call to So how can you make it work? Tell
(now, as I mentioned before, figuring out which encoding to use is another huge problem... But I'll leave that for another day) Another problem I run into quite often is this:
And, by now, the cause of this should be painfully obvious: I've given Python an encoded string, so I should be decoding it, not encoding it again. But why the confusing error message? Well, I'm not entirely sure, but my guess is that the UTF-8 encoder expects a Is there any end to this insanity?!Yes! Python 3000 will have two distinct classes: one for strings, one for hunks of data. Whenever data is read, it will come in as a "hunk of data". It will have to be explicitly decoded to a string before it can be used as such. Hopefully that will make life a little bit less painful. See also:
(Page 1 of 1, totaling 3 entries)
|
QuicksearchArchivesLinks
CategoriesSyndicate This Blog |
