Unicode is an ugly beast... And until people start standardizing on Python 3k*, we're going to have to live with the eccentricities of Python 2's strings.
But, fear not! There is (at least some) hope. By changing a few patterns in the way you code, you can alleviate the bulk of Unicode-related problems.
First, using the str function. In just about every case, if you're using the str function, you're probably doing it wrong.
Let me demonstrate:
from hashlib import sha256 def hash(to_hash): hash = sha256(to_hash).hexdigest() print to_hash, ":", hash
Cool, we can hash things then print out the hash:
>>> hash('ohai') ohai : e84712238709398f6d349dc2250b0efca4b72d8c2bfb7b74339d30ba94056b14
But, wait... What happens if the thing we're hashing isn't a string (even though it can be represented as a string):
>>> class Person: ... name = 'David' ... def __str__(self): ... return self.name ... >>> hash(Person()) ... TypeError: new() argument 1 must be string or read-only buffer, not instance
Oh no! Ok, let's fix the code:
from hashlib import sha256 def hash(to_hash): to_hash = str(to_hash) # Convert the object to a string before we hash it hash = sha256(to_hash).hexdigest() print to_hash, ":", hash
Great -- we can hash numbers now:
>>> hash(Person()) David : a6b54c20a7b96eeac1a911e6da3124a560fe6dc042ebf270e3676e7095b95652
And, for most people who only speak English, this is a perfect place to stop.
After all, everything is a
Well... No. What happens if the input is a
>>> p = Person() >>> # This person had the audacity to give themselves a name containing >>> # non-ascii symbols, so we represent it with a unicode object >>> p.name = u'I\xf1t\xebrn\xe2ti\xf4n\xe0liz\xe6ti\xf8n' >>> hash(p) ... UnicodeEncodeError: 'ascii' codec can't encode character u'\xf1' in position 1: ordinal not in range(128)
Crap. Where the heck is 'ascii' coming from?
Well, it's a long story (which I've covered over at Encoding and Decoding Text
but basically the
__str__ method of the unicode object (u'I\xf1...') is trying to
encode the unicode object using the system's default encoding... Which, in this
case, is ascii.
"Alright...", you're probably thinking, "If the problem is with
I could just replace that call to
str with a call to
Ok, let's see what happens.
from hashlib import sha256 def hash(to_hash): to_hash = unicode(to_hash) # Convert the object to a unicode before we hash it hash = sha256(to_hash).hexdigest() print to_hash, ":", hash >>> # this time, though, the person's name has come from The Internet, so it >>> # is not yet a unicode object >>> p.name = 'I\xc3\xb1t\xc3\xabrn\xc3\xa2ti\xc3\xb4n\xc3\xa0liz\xc3\xa6ti\xc3\xb8n' >>> hash(p) ... UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
Yup, that's right -- you just can't win
What's happening here? Well, this time, the
unicode function is trying to
decode the input ('I\xc3...') into a unicode object... But, because the input
isn't valid 7-bit ascii (again, the system's default), it explodes. Crap.
So how can we save ourselves from all this insanity?
Actually, it's not too difficult:
Convert every single string, with out exception, to unicode as soon as they enter the system. For example, if you are writing a web application, GET and POST variables should be converted to
unicodeas soon as they are read from the environment:
for (key, value) in environment.get_vars: request.GET[to_unicode(key)] = to_unicode(value)
Only convert the
unicodeobjects back to
strstrings when you absolutly must. For example, when they are written to a file:
log_file.write("New user '%s' created" %(to_str(p.name)))
Think hard before you call
unicode. Each time your fingers type "s", "t", flashing lights and sirens should go off in your head, reminding you to make sure that the object you are
string could never, ever, ever possibly contain unicode.
And what about those
to_str functions? What should they look
like? Well, probably something like this:
import locale def to_unicode(text): """ Convert, at all consts, 'text' to a `unicode` object. Note: as a last-ditch effort, this function tries to decode the text as latin1... Which will always succeed. If you expect to get text encoded with latin[2-9] or some other character set, this may not be desierable. >>> to_unicode(u'I\xf1t\xebrn\xe2ti') u'I\xf1t\xebrn\xe2ti' >>> to_unicode('I\xc3\xb1t\xc3\xabrn\xc3\xa2ti') u'I\xf1t\xebrn\xe2ti' >>> class Foo: ... def __str__(self): ... return 'foo' ... >>> f = Foo() >>> to_unicode(f) u'foo' >>> f.__unicode__ = u'bar' >>> to_unicode(f) u'bar' >>> """ if isinstance(text, unicode): return text if hasattr(text, '__unicode__'): return text.__unicode__() text = str(text) try: return unicode(text, 'utf-8') except UnicodeError: pass try: return unicode(text, locale.getpreferredencoding()) except UnicodeError: pass return unicode(text, 'latin1') def to_str(text): """ Convert 'text' to a `str` object. >>> to_str(u'I\xf1t\xebrn\xe2ti') 'I\xc3\xb1t\xc3\xabrn\xc3\xa2ti' >>> to_str(42) '42' >>> to_str('ohai') 'ohai' >>> class Foo: ... def __str__(self): ... return 'foo' ... >>> f = Foo() >>> to_str() 'foo' >>> f.__unicode__ = lambda: u'I\xf1t\xebrn\xe2ti' >>> to_str(f) 'I\xc3\xb1t\xc3\xabrn\xc3\xa2ti' >>> """ if isinstance(text, str): return text if hasattr(text, '__unicode__'): text = text.__unicode__() if hasattr(text, '__str__'): return text.__str__() return text.encode('utf-8')
So there you have it. A quick and fairly easy way to avoid many of your encoding-related options
If you're still not quite feeling comfortable with all of this, though, take a read over Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
*: My newly installed Debian 4 machine is still running Python 2.4 (released December 2004)... So that wait might be a while.