Unicode is an ugly beast... And until people start standardizing on Python 3k*,
we're going to have to live with the eccentricities of Python 2's strings.
But, fear not! There is (at least some) hope. By changing a few patterns in the
way you code, you can alleviate the bulk of Unicode-related problems.
First, using the str function. In just about every case, if you're
using the str function, you're probably doing it wrong.
Let me demonstrate:
from hashlib import sha256
def hash(to_hash):
hash = sha256(to_hash).hexdigest()
print to_hash, ":", hash
Cool, we can hash things then print out the hash:
>>> hash('ohai')
ohai : e84712238709398f6d349dc2250b0efca4b72d8c2bfb7b74339d30ba94056b14
But, wait... What happens if the thing we're hashing isn't a string (even though
it can be represented as a string):
>>> class Person:
... name = 'David'
... def __str__(self):
... return self.name
...
>>> hash(Person())
...
TypeError: new() argument 1 must be string or read-only buffer, not instance
Oh no! Ok, let's fix the code:
from hashlib import sha256
def hash(to_hash):
to_hash = str(to_hash) # Convert the object to a string before we hash it
hash = sha256(to_hash).hexdigest()
print to_hash, ":", hash
Great -- we can hash numbers now:
>>> hash(Person())
David : a6b54c20a7b96eeac1a911e6da3124a560fe6dc042ebf270e3676e7095b95652
And, for most people who only speak English, this is a perfect place to stop.
After all, everything is a str, right?
Well... No. What happens if the input is a unicode object?
>>> p = Person()
>>> # This person had the audacity to give themselves a name containing
>>> # non-ascii symbols, so we represent it with a unicode object
>>> p.name = u'I\xf1t\xebrn\xe2ti\xf4n\xe0liz\xe6ti\xf8n'
>>> hash(p)
...
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf1' in position
1: ordinal not in range(128)
Crap. Where the heck is 'ascii' coming from?
Well, it's a long story (which I've covered over at Encoding and Decoding Text
in
Python),
but basically the __str__ method of the unicode object (u'I\xf1...') is trying to
encode the unicode object using the system's default encoding... Which, in this
case, is ascii.
"Alright...", you're probably thinking, "If the problem is with unicode, maybe
I could just replace that call to str with a call to unicode"
Ok, let's see what happens.
from hashlib import sha256
def hash(to_hash):
to_hash = unicode(to_hash) # Convert the object to a unicode before we hash it
hash = sha256(to_hash).hexdigest()
print to_hash, ":", hash
>>> # this time, though, the person's name has come from The Internet, so it
>>> # is not yet a unicode object
>>> p.name = 'I\xc3\xb1t\xc3\xabrn\xc3\xa2ti\xc3\xb4n\xc3\xa0liz\xc3\xa6ti\xc3\xb8n'
>>> hash(p)
...
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
ordinal not in range(128)
Yup, that's right -- you just can't win 
What's happening here? Well, this time, the unicode function is trying to
decode the input ('I\xc3...') into a unicode object... But, because the input
isn't valid 7-bit ascii (again, the system's default), it explodes. Crap.
Confused yet?
So how can we save ourselves from all this insanity?
Actually, it's not too difficult:
Convert every single string, with out exception, to unicode as soon as they
enter the system. For example, if you are writing a web application,
GET and POST variables should be converted to unicode as soon as they are
read from the environment:
for (key, value) in environment.get_vars:
request.GET[to_unicode(key)] = to_unicode(value)
Only convert the unicode objects back to str strings when you absolutly
must. For example, when they are written to a file:
log_file.write("New user '%s' created" %(to_str(p.name)))
Think hard before you call str or unicode. Each time your fingers type
"s", "t", flashing lights and sirens should go off in your head, reminding
you to make sure that the object you are string could never, ever, ever
possibly contain unicode.
And what about those to_unicode and to_str functions? What should they look
like? Well, probably something like this:
import locale
def to_unicode(text):
""" Convert, at all consts, 'text' to a `unicode` object.
Note: as a last-ditch effort, this function tries to decode the text
as latin1... Which will always succeed. If you expect to get
text encoded with latin[2-9] or some other character set, this
may not be desierable.
>>> to_unicode(u'I\xf1t\xebrn\xe2ti')
u'I\xf1t\xebrn\xe2ti'
>>> to_unicode('I\xc3\xb1t\xc3\xabrn\xc3\xa2ti')
u'I\xf1t\xebrn\xe2ti'
>>> class Foo:
... def __str__(self):
... return 'foo'
...
>>> f = Foo()
>>> to_unicode(f)
u'foo'
>>> f.__unicode__ = u'bar'
>>> to_unicode(f)
u'bar'
>>> """
if isinstance(text, unicode):
return text
if hasattr(text, '__unicode__'):
return text.__unicode__()
text = str(text)
try:
return unicode(text, 'utf-8')
except UnicodeError:
pass
try:
return unicode(text, locale.getpreferredencoding())
except UnicodeError:
pass
return unicode(text, 'latin1')
def to_str(text):
""" Convert 'text' to a `str` object.
>>> to_str(u'I\xf1t\xebrn\xe2ti')
'I\xc3\xb1t\xc3\xabrn\xc3\xa2ti'
>>> to_str(42)
'42'
>>> to_str('ohai')
'ohai'
>>> class Foo:
... def __str__(self):
... return 'foo'
...
>>> f = Foo()
>>> to_str()
'foo'
>>> f.__unicode__ = lambda: u'I\xf1t\xebrn\xe2ti'
>>> to_str(f)
'I\xc3\xb1t\xc3\xabrn\xc3\xa2ti'
>>> """
if isinstance(text, str):
return text
if hasattr(text, '__unicode__'):
text = text.__unicode__()
if hasattr(text, '__str__'):
return text.__str__()
return text.encode('utf-8')
So there you have it.
A quick and fairly easy way to avoid many of your encoding-related options 
If you're still not quite feeling comfortable with all of this, though, take a
read over Joel Spolsky's The Absolute Minimum Every Software Developer
Absolutely, Positively Must Know About Unicode and Character Sets (No
Excuses!)
*: My newly installed Debian 4 machine is still running Python 2.4 (released
December 2004)... So that wait might be a while.
After yesterdays post, Greg suggested I write another on how to test for Unicode safety... And unfortunately I've got some bad news: it's hard. You never know when some developer, somewhere, will unintentionally encode or decode something the wrong way (
Tracked: Feb 11, 10:35