str(...): 'yer probably doin' it wrong.
February 10, 2009 at 11:13 AM | Python, Unicode | View CommentsUnicode is an ugly beast... And until people start standardizing on Python 3k*, we're going to have to live with the eccentricities of Python 2's strings.
But, fear not! There is (at least some) hope. By changing a few patterns in the way you code, you can alleviate the bulk of Unicode-related problems.
First, using the str function. In just about every case, if you're using the str function, you're probably doing it wrong.
Let me demonstrate:
from hashlib import sha256
def hash(to_hash):
hash = sha256(to_hash).hexdigest()
print to_hash, ":", hash
Cool, we can hash things then print out the hash:
>>> hash('ohai')
ohai : e84712238709398f6d349dc2250b0efca4b72d8c2bfb7b74339d30ba94056b14
But, wait... What happens if the thing we're hashing isn't a string (even though it can be represented as a string):
>>> class Person:
... name = 'David'
... def __str__(self):
... return self.name
...
>>> hash(Person())
...
TypeError: new() argument 1 must be string or read-only buffer, not instance
Oh no! Ok, let's fix the code:
from hashlib import sha256
def hash(to_hash):
to_hash = str(to_hash) # Convert the object to a string before we hash it
hash = sha256(to_hash).hexdigest()
print to_hash, ":", hash
Great -- we can hash numbers now:
>>> hash(Person())
David : a6b54c20a7b96eeac1a911e6da3124a560fe6dc042ebf270e3676e7095b95652
And, for most people who only speak English, this is a perfect place to stop.
After all, everything is a str, right?
Well... No. What happens if the input is a unicode object?
>>> p = Person()
>>> # This person had the audacity to give themselves a name containing
>>> # non-ascii symbols, so we represent it with a unicode object
>>> p.name = u'I\xf1t\xebrn\xe2ti\xf4n\xe0liz\xe6ti\xf8n'
>>> hash(p)
...
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf1' in position
1: ordinal not in range(128)
Crap. Where the heck is 'ascii' coming from?
Well, it's a long story (which I've covered over at Encoding and Decoding Text
in
Python),
but basically the __str__ method of the unicode object (u'I\xf1...') is trying to
encode the unicode object using the system's default encoding... Which, in this
case, is ascii.
"Alright...", you're probably thinking, "If the problem is with unicode, maybe
I could just replace that call to str with a call to unicode"
Ok, let's see what happens.
from hashlib import sha256
def hash(to_hash):
to_hash = unicode(to_hash) # Convert the object to a unicode before we hash it
hash = sha256(to_hash).hexdigest()
print to_hash, ":", hash
>>> # this time, though, the person's name has come from The Internet, so it
>>> # is not yet a unicode object
>>> p.name = 'I\xc3\xb1t\xc3\xabrn\xc3\xa2ti\xc3\xb4n\xc3\xa0liz\xc3\xa6ti\xc3\xb8n'
>>> hash(p)
...
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
ordinal not in range(128)
Yup, that's right -- you just can't win ![]()
What's happening here? Well, this time, the unicode function is trying to
decode the input ('I\xc3...') into a unicode object... But, because the input
isn't valid 7-bit ascii (again, the system's default), it explodes. Crap.
Confused yet?
So how can we save ourselves from all this insanity?
Actually, it's not too difficult:
Convert every single string, with out exception, to unicode as soon as they enter the system. For example, if you are writing a web application, GET and POST variables should be converted to
unicodeas soon as they are read from the environment:for (key, value) in environment.get_vars: request.GET[to_unicode(key)] = to_unicode(value)
Only convert the
unicodeobjects back tostrstrings when you absolutly must. For example, when they are written to a file:log_file.write("New user '%s' created" %(to_str(p.name)))
Think hard before you call
strorunicode. Each time your fingers type "s", "t", flashing lights and sirens should go off in your head, reminding you to make sure that the object you arestring could never, ever, ever possibly contain unicode.
And what about those to_unicode and to_str functions? What should they look
like? Well, probably something like this:
import locale
def to_unicode(text):
""" Convert, at all consts, 'text' to a `unicode` object.
Note: as a last-ditch effort, this function tries to decode the text
as latin1... Which will always succeed. If you expect to get
text encoded with latin[2-9] or some other character set, this
may not be desierable.
>>> to_unicode(u'I\xf1t\xebrn\xe2ti')
u'I\xf1t\xebrn\xe2ti'
>>> to_unicode('I\xc3\xb1t\xc3\xabrn\xc3\xa2ti')
u'I\xf1t\xebrn\xe2ti'
>>> class Foo:
... def __str__(self):
... return 'foo'
...
>>> f = Foo()
>>> to_unicode(f)
u'foo'
>>> f.__unicode__ = u'bar'
>>> to_unicode(f)
u'bar'
>>> """
if isinstance(text, unicode):
return text
if hasattr(text, '__unicode__'):
return text.__unicode__()
text = str(text)
try:
return unicode(text, 'utf-8')
except UnicodeError:
pass
try:
return unicode(text, locale.getpreferredencoding())
except UnicodeError:
pass
return unicode(text, 'latin1')
def to_str(text):
""" Convert 'text' to a `str` object.
>>> to_str(u'I\xf1t\xebrn\xe2ti')
'I\xc3\xb1t\xc3\xabrn\xc3\xa2ti'
>>> to_str(42)
'42'
>>> to_str('ohai')
'ohai'
>>> class Foo:
... def __str__(self):
... return 'foo'
...
>>> f = Foo()
>>> to_str()
'foo'
>>> f.__unicode__ = lambda: u'I\xf1t\xebrn\xe2ti'
>>> to_str(f)
'I\xc3\xb1t\xc3\xabrn\xc3\xa2ti'
>>> """
if isinstance(text, str):
return text
if hasattr(text, '__unicode__'):
text = text.__unicode__()
if hasattr(text, '__str__'):
return text.__str__()
return text.encode('utf-8')
So there you have it.
A quick and fairly easy way to avoid many of your encoding-related options ![]()
If you're still not quite feeling comfortable with all of this, though, take a read over Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
*: My newly installed Debian 4 machine is still running Python 2.4 (released December 2004)... So that wait might be a while.