Encoding and Decoding Text in Python (or: "I didn't ask you to use the 'ascii' codec!")

May 01, 2008 at 01:27 PM | Unicode | View Comments

When dealing with Unicode in Python, it doesn't take long to get the dreaded 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in range(128).

You never see it coming. It doesn't make any sense. You didn't even ask for ascii!

So what's the deal?

I'm glad you asked. I will demonstrate:

>>> s = file("data").read()
>>> s
'SGVsbG8sIHdvcmxkIQ==\n'

If you guessed that s is a hunk of base64 encoded data, you'd be right! Give yourself a gold star.

Now, if we want to do anything useful with this data, it needs to be decoded:

>>> s.decode('base64')
'Hello, world!'

We have just taken an encoded hunk of data and decoded it to get a useful hunk of data.

>>> s.decode('base64').replace('world', 'Marguerite')
'Hello, Marguerite!'
>>> _.encode('base64')
'SGVsbG8sIE1hcmd1ZXJpdGUh\n'

Now we can take that useful hunk of data (the English in 7-bit ASCII), do something useful with it (in this case, replace 'world' with 'Marguerite'), and finally encode the data.

So how does all this relate back to Unicode and ascii error messages?

I have used base64 encoded data here, but the same concept applies when dealing with Unicode data:

  1. Hunk of opaque data comes in (but we know that it contains some sort of Unicode text)
  2. Hunk of opaque data is decoded, creating a unicode object
  3. The unicode object is used for something useful
  4. The unicode object is encoded and saved (to disk, to a database, or sent to a browser)

(of course, in the Real World, you've got to figure out which encoding was used on the data (UTF-8, Latin1, etc)... But that's a topic for another post.)

Ok, back to the 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in range(128) error. It should be fairly clear that this error is coming up because Python is trying to decode a bunch of bytes as 7-bit ASCII, but some of them are out of that range (eg, they have a value over 127).

I know what you're saying, "but I never asked Python to decode anything! I'm just trying to turn it into unicode!"

>>> unicode("Ol\xc3\xa1, mundo!")
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in range(128)

Two questions arise here: First, "Where is the 'ascii' coming from?" Second, "How do I make it work?"

To answer the first question, it's important to think about what's happening when the call to unicode(...) is made. The unicode function accepts an encoded string, decodes it, and creates a unicode object. In this case, though, we haven't given the function any indication of which decoder it should use, so it falls back to the computer's default encoding: ascii.

So how can you make it work? Tell unicode which encoding to use:

>>> unicode("Ol\xc3\xa1, mundo!", 'utf8')
u'Ol\xe1, mundo!'

(now, as I mentioned before, figuring out which encoding to use is another huge problem... But I'll leave that for another day)

Another problem I run into quite often is this:

>>> "Ol\xc3\xa1, mundo!".encode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in range(128)

And, by now, the cause of this should be painfully obvious: I've given Python an encoded string, so I should be decoding it, not encoding it again.

But why the confusing error message? Well, I'm not entirely sure, but my guess is that the UTF-8 encoder expects a unicode object, so it tries to convert the input (in this case, "Ol\xc3...") to Unicode before encoding it.

Is there any end to this insanity?!

Yes! Python 3000 will have two distinct classes: one for strings, one for hunks of data. Whenever data is read, it will come in as a "hunk of data". It will have to be explicitly decoded to a string before it can be used as such. Hopefully that will make life a little bit less painful.

See also:

Permalink + Comments