str(...): 'yer probably doin' it wrong.

February 10, 2009 at 11:13 AM | Python, Unicode | View Comments

Unicode is an ugly beast... And until people start standardizing on Python 3k*, we're going to have to live with the eccentricities of Python 2's strings.

But, fear not! There is (at least some) hope. By changing a few patterns in the way you code, you can alleviate the bulk of Unicode-related problems.

First, using the str function. In just about every case, if you're using the str function, you're probably doing it wrong.

Let me demonstrate:

from hashlib import sha256
def hash(to_hash):
  hash = sha256(to_hash).hexdigest()
  print to_hash, ":", hash

Cool, we can hash things then print out the hash:

>>> hash('ohai')
ohai : e84712238709398f6d349dc2250b0efca4b72d8c2bfb7b74339d30ba94056b14

But, wait... What happens if the thing we're hashing isn't a string (even though it can be represented as a string):

>>> class Person:
...   name = 'David'
...   def __str__(self):
...     return self.name
...
>>> hash(Person())
...
TypeError: new() argument 1 must be string or read-only buffer, not instance

Oh no! Ok, let's fix the code:

from hashlib import sha256
def hash(to_hash):
  to_hash = str(to_hash) # Convert the object to a string before we hash it
  hash = sha256(to_hash).hexdigest()
  print to_hash, ":", hash

Great -- we can hash numbers now:

>>> hash(Person())
David : a6b54c20a7b96eeac1a911e6da3124a560fe6dc042ebf270e3676e7095b95652

And, for most people who only speak English, this is a perfect place to stop. After all, everything is a str, right?

Well... No. What happens if the input is a unicode object?

>>> p = Person()
>>> # This person had the audacity to give themselves a name containing
>>> # non-ascii symbols, so we represent it with a unicode object
>>> p.name = u'I\xf1t\xebrn\xe2ti\xf4n\xe0liz\xe6ti\xf8n'
>>> hash(p)
...
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf1' in position
1: ordinal not in range(128)

Crap. Where the heck is 'ascii' coming from?

Well, it's a long story (which I've covered over at Encoding and Decoding Text in Python), but basically the __str__ method of the unicode object (u'I\xf1...') is trying to encode the unicode object using the system's default encoding... Which, in this case, is ascii.

"Alright...", you're probably thinking, "If the problem is with unicode, maybe I could just replace that call to str with a call to unicode"

Ok, let's see what happens.

from hashlib import sha256
def hash(to_hash):
  to_hash = unicode(to_hash) # Convert the object to a unicode before we hash it
  hash = sha256(to_hash).hexdigest()
  print to_hash, ":", hash

>>> # this time, though, the person's name has come from The Internet, so it
>>> # is not yet a unicode object
>>> p.name = 'I\xc3\xb1t\xc3\xabrn\xc3\xa2ti\xc3\xb4n\xc3\xa0liz\xc3\xa6ti\xc3\xb8n'
>>> hash(p)
...
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
ordinal not in range(128)

Yup, that's right -- you just can't win :-(

What's happening here? Well, this time, the unicode function is trying to decode the input ('I\xc3...') into a unicode object... But, because the input isn't valid 7-bit ascii (again, the system's default), it explodes. Crap.

Confused yet?

So how can we save ourselves from all this insanity?

Actually, it's not too difficult:

  1. Convert every single string, with out exception, to unicode as soon as they enter the system. For example, if you are writing a web application, GET and POST variables should be converted to unicode as soon as they are read from the environment:

    for (key, value) in environment.get_vars: request.GET[to_unicode(key)] = to_unicode(value)

  2. Only convert the unicode objects back to str strings when you absolutly must. For example, when they are written to a file:

    log_file.write("New user '%s' created" %(to_str(p.name)))

  3. Think hard before you call str or unicode. Each time your fingers type "s", "t", flashing lights and sirens should go off in your head, reminding you to make sure that the object you are string could never, ever, ever possibly contain unicode.

And what about those to_unicode and to_str functions? What should they look like? Well, probably something like this:

import locale
def to_unicode(text):
    """ Convert, at all consts, 'text' to a `unicode` object.

        Note: as a last-ditch effort, this function tries to decode the text
              as latin1... Which will always succeed.  If you expect to get
              text encoded with latin[2-9] or some other character set, this
              may not be desierable.

        >>> to_unicode(u'I\xf1t\xebrn\xe2ti')
        u'I\xf1t\xebrn\xe2ti'
        >>> to_unicode('I\xc3\xb1t\xc3\xabrn\xc3\xa2ti')
        u'I\xf1t\xebrn\xe2ti'
        >>> class Foo:
        ...   def __str__(self):
        ...       return 'foo'
        ...
        >>> f = Foo()
        >>> to_unicode(f)
        u'foo'
        >>> f.__unicode__ = u'bar'
        >>> to_unicode(f)
        u'bar'
        >>> """

    if isinstance(text, unicode):
        return text

    if hasattr(text, '__unicode__'):
        return text.__unicode__()

    text = str(text)

    try:
        return unicode(text, 'utf-8')
    except UnicodeError:
        pass

    try:
        return unicode(text, locale.getpreferredencoding())
    except UnicodeError:
        pass

    return unicode(text, 'latin1')


def to_str(text):
    """ Convert 'text' to a `str` object.

        >>> to_str(u'I\xf1t\xebrn\xe2ti')
        'I\xc3\xb1t\xc3\xabrn\xc3\xa2ti'
        >>> to_str(42)
        '42'
        >>> to_str('ohai')
        'ohai'
        >>> class Foo:
        ...     def __str__(self):
        ...         return 'foo'
        ...
        >>> f = Foo()
        >>> to_str()
        'foo'
        >>> f.__unicode__ = lambda: u'I\xf1t\xebrn\xe2ti'
        >>> to_str(f)
        'I\xc3\xb1t\xc3\xabrn\xc3\xa2ti'
        >>> """
    if isinstance(text, str):
        return text

    if hasattr(text, '__unicode__'):
        text = text.__unicode__()

    if hasattr(text, '__str__'):
        return text.__str__()

    return text.encode('utf-8')

So there you have it. A quick and fairly easy way to avoid many of your encoding-related options :-)

If you're still not quite feeling comfortable with all of this, though, take a read over Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

*: My newly installed Debian 4 machine is still running Python 2.4 (released December 2004)... So that wait might be a while.