Python 2.X's str.format is unsafe

September 22, 2011 at 07:33 PM | Python, Unicode | View Comments

I posted a tweet today when I learned that Python's %-string-formatting isn't actually a special case - the str class just implements the __mod__ method.

One side effect of this is that a few people commented that %-formatting is to be replaced with .format formatting... So I'd like to take this opportunity to explain why .format string formatting is unsafe in Python 2.X.

With %-formatting, if the format string is a str while one of the replacements is a unicode the result will be unicode:

>>> "Hello %s" %(u"world", )
u'Hello world'

However, .format will always return the same type of string (str or unicode) as the format string:

>>> "Hello {}".format(u"world")
'Hello world'

This is a problem in Python 2.X because unqualified string literals are instances of str, and the implicit encoding of unicode arguments will almost certainly explode at the least opportune moments:

>>> "Hello {}".format(u"\u263a")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u263a' in position 0: ordinal not in range(128)

Of course, one possible solution to this is remembering to prefix all string literals with u:

>>> u"Hello {}".format(u"\u263a")
u'Hello \u263a'

But I prefer to simply use %-style formatting, because then I don't need to remember anything:

>>> "Hello %s" %(u"\u263a", )
u'Hello \u263a'
>>> print _.encode('utf-8')
Hello ☺

Of course, as you've probably noticed, this means that the format string is being implicitly decoded to unicode... But since my string literals generally don't contain non-ASCII characters it's not much of an issue.

Note that this is not a problem in Py 3k because string literals are unicode.

Permalink + Comments

The no-good very-bad &#151;

January 22, 2011 at 01:59 PM | Unicode | View Comments

In today's instalment of Adventures in Unicode, we meet the sneaky &#151;.

When a web browser encounters &#151;, it renders an em-dash (—). However, when &#151; is decoded to Unicode (U+0097, 9716 == 15110), encoded to UTF-8 (\xc2\x97), written to a file, then opened with exactly the same web browser, the browser renders…

queue ominous music

Nothing!

Nothing is rendered because U+0097 is actually the END OF GUARDED AREA control character[0]… So it shouldn't be rendered.

So why is &#151; being rendered? Because of our old friend, the Windows-1252 encoding, where character 151 is an em-dash. So the browser sees &#151;, it helpfully assumes that the author is an idiot[1] and wanted an em-dash to be displayed instead of a control character[2].

What can be done?

I have been using a function which looks like this:

_fix_mixed_unicode_re = re.compile("([\x7F-\xFF]+)")
def fix_mixed_unicode(mixed_unicode):
    assert isinstance(mixed_unicode, unicode)
    def handle_match(match):
        return match.group(0).encode("raw_unicode_escape").decode("1252")
    return _fix_mixed_unicode_re.sub(handle_match, mixed_unicode)

It accepts a unicode string, then assumes any characters between 127 and 255 are actually Windows-1252 encoded, so it encodes them as bytes, then decodes those bytes as 1251, yielding a correct unicode string.

[0]: Which is represented by a line that looks very similar to an em-dash

[1]: A generally safe assumption.

[2]: It should be noted that this happens regardless of the document's encoding.

Permalink + Comments

Testing for Unicode Safety

February 11, 2009 at 10:30 AM | Unicode | View Comments

After yesterdays post, Greg suggested I write another on how to test for Unicode safety... And unfortunately I've got some bad news: it's hard.

You never know when some developer, somewhere, will unintentionally encode or decode something the wrong way (for example, log("request for %s", unicode(url))).

But there is hope!

In my experience, almost all Unicode-related issues follow the same pattern: someone using str or unicode incorrectly and code which unexpectedly encodes/decodes a string.

The first is easy to check for: grep through the code for str( and unicode(.

The second is harder to check for, and requires an understanding of both the code base: all of the points where the code interacts with other parts of the system (filesystem, database, network) must be found and checked.

Finally, it isn't a bad idea to throw some Unicode into the test suite. Instead of calling mock users 'user0', 'user1', Call them u'\u03bcs\xeb\u044f' (u"μsëя")*. Keep a central "database" of these sorts of strings, so it's easy for developers who don't normally write in Cyrillic to use Cyrillic characters in their code (I keep my own personal list at http://wolever.net/~wolever/wiki/unicode_audit -- a url I can now type from memory).

One word of caution, though: you're asking for world of pain if you actually think you can commit UTF-8 encoded text -- any number of things will break (subversion may helpfully fail, your editor may helpfully re-encode the file, your unenlightened developers will complain about funny question marks in their code, etc...). Instead, have a central file which defines these "canned test strings" using escaped Python strings (ie, u'\u03bc...') then import that into your test suite:

from app.tests import i18n
...
def test_user():
    u = new User(name=i18n.user)
    ...

Or tar up all the offensive files, then write a script to un-tar them when they are needed**.

So, to sum it up:

  • Make sure your developers grok (or, at least, understand) Unicode and encodings
  • Make sure your code uses str and unicode safely
  • Make sure your exit points are covered
  • Make it really easy to include Unicode in tests

Then maybe, if you're lucky, all those inconsiderate people who have the audacity to ask for more than 127 characters will be able to use your application :-)

*: A good choice both because it's easy for ignorant North Americans like myself to see that's it's correct.

**: This is how I tested DrProject's handling of Unicode filenames which are checked into the Subversion repository.

Permalink + Comments

str(...): 'yer probably doin' it wrong.

February 10, 2009 at 11:13 AM | Python, Unicode | View Comments

Unicode is an ugly beast... And until people start standardizing on Python 3k*, we're going to have to live with the eccentricities of Python 2's strings.

But, fear not! There is (at least some) hope. By changing a few patterns in the way you code, you can alleviate the bulk of Unicode-related problems.

First, using the str function. In just about every case, if you're using the str function, you're probably doing it wrong.

Let me demonstrate:

from hashlib import sha256
def hash(to_hash):
  hash = sha256(to_hash).hexdigest()
  print to_hash, ":", hash

Cool, we can hash things then print out the hash:

>>> hash('ohai')
ohai : e84712238709398f6d349dc2250b0efca4b72d8c2bfb7b74339d30ba94056b14

But, wait... What happens if the thing we're hashing isn't a string (even though it can be represented as a string):

>>> class Person:
...   name = 'David'
...   def __str__(self):
...     return self.name
...
>>> hash(Person())
...
TypeError: new() argument 1 must be string or read-only buffer, not instance

Oh no! Ok, let's fix the code:

from hashlib import sha256
def hash(to_hash):
  to_hash = str(to_hash) # Convert the object to a string before we hash it
  hash = sha256(to_hash).hexdigest()
  print to_hash, ":", hash

Great -- we can hash numbers now:

>>> hash(Person())
David : a6b54c20a7b96eeac1a911e6da3124a560fe6dc042ebf270e3676e7095b95652

And, for most people who only speak English, this is a perfect place to stop. After all, everything is a str, right?

Well... No. What happens if the input is a unicode object?

>>> p = Person()
>>> # This person had the audacity to give themselves a name containing
>>> # non-ascii symbols, so we represent it with a unicode object
>>> p.name = u'I\xf1t\xebrn\xe2ti\xf4n\xe0liz\xe6ti\xf8n'
>>> hash(p)
...
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf1' in position
1: ordinal not in range(128)

Crap. Where the heck is 'ascii' coming from?

Well, it's a long story (which I've covered over at Encoding and Decoding Text in Python), but basically the __str__ method of the unicode object (u'I\xf1...') is trying to encode the unicode object using the system's default encoding... Which, in this case, is ascii.

"Alright...", you're probably thinking, "If the problem is with unicode, maybe I could just replace that call to str with a call to unicode"

Ok, let's see what happens.

from hashlib import sha256
def hash(to_hash):
  to_hash = unicode(to_hash) # Convert the object to a unicode before we hash it
  hash = sha256(to_hash).hexdigest()
  print to_hash, ":", hash

>>> # this time, though, the person's name has come from The Internet, so it
>>> # is not yet a unicode object
>>> p.name = 'I\xc3\xb1t\xc3\xabrn\xc3\xa2ti\xc3\xb4n\xc3\xa0liz\xc3\xa6ti\xc3\xb8n'
>>> hash(p)
...
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
ordinal not in range(128)

Yup, that's right -- you just can't win :-(

What's happening here? Well, this time, the unicode function is trying to decode the input ('I\xc3...') into a unicode object... But, because the input isn't valid 7-bit ascii (again, the system's default), it explodes. Crap.

Confused yet?

So how can we save ourselves from all this insanity?

Actually, it's not too difficult:

  1. Convert every single string, with out exception, to unicode as soon as they enter the system. For example, if you are writing a web application, GET and POST variables should be converted to unicode as soon as they are read from the environment:

    for (key, value) in environment.get_vars: request.GET[to_unicode(key)] = to_unicode(value)

  2. Only convert the unicode objects back to str strings when you absolutly must. For example, when they are written to a file:

    log_file.write("New user '%s' created" %(to_str(p.name)))

  3. Think hard before you call str or unicode. Each time your fingers type "s", "t", flashing lights and sirens should go off in your head, reminding you to make sure that the object you are string could never, ever, ever possibly contain unicode.

And what about those to_unicode and to_str functions? What should they look like? Well, probably something like this:

import locale
def to_unicode(text):
    """ Convert, at all consts, 'text' to a `unicode` object.

        Note: as a last-ditch effort, this function tries to decode the text
              as latin1... Which will always succeed.  If you expect to get
              text encoded with latin[2-9] or some other character set, this
              may not be desierable.

        >>> to_unicode(u'I\xf1t\xebrn\xe2ti')
        u'I\xf1t\xebrn\xe2ti'
        >>> to_unicode('I\xc3\xb1t\xc3\xabrn\xc3\xa2ti')
        u'I\xf1t\xebrn\xe2ti'
        >>> class Foo:
        ...   def __str__(self):
        ...       return 'foo'
        ...
        >>> f = Foo()
        >>> to_unicode(f)
        u'foo'
        >>> f.__unicode__ = u'bar'
        >>> to_unicode(f)
        u'bar'
        >>> """

    if isinstance(text, unicode):
        return text

    if hasattr(text, '__unicode__'):
        return text.__unicode__()

    text = str(text)

    try:
        return unicode(text, 'utf-8')
    except UnicodeError:
        pass

    try:
        return unicode(text, locale.getpreferredencoding())
    except UnicodeError:
        pass

    return unicode(text, 'latin1')


def to_str(text):
    """ Convert 'text' to a `str` object.

        >>> to_str(u'I\xf1t\xebrn\xe2ti')
        'I\xc3\xb1t\xc3\xabrn\xc3\xa2ti'
        >>> to_str(42)
        '42'
        >>> to_str('ohai')
        'ohai'
        >>> class Foo:
        ...     def __str__(self):
        ...         return 'foo'
        ...
        >>> f = Foo()
        >>> to_str()
        'foo'
        >>> f.__unicode__ = lambda: u'I\xf1t\xebrn\xe2ti'
        >>> to_str(f)
        'I\xc3\xb1t\xc3\xabrn\xc3\xa2ti'
        >>> """
    if isinstance(text, str):
        return text

    if hasattr(text, '__unicode__'):
        text = text.__unicode__()

    if hasattr(text, '__str__'):
        return text.__str__()

    return text.encode('utf-8')

So there you have it. A quick and fairly easy way to avoid many of your encoding-related options :-)

If you're still not quite feeling comfortable with all of this, though, take a read over Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

*: My newly installed Debian 4 machine is still running Python 2.4 (released December 2004)... So that wait might be a while.

Permalink + Comments

Encoding and Decoding Text in Python (or: "I didn't ask you to use the 'ascii' codec!")

May 01, 2008 at 01:27 PM | Unicode | View Comments

When dealing with Unicode in Python, it doesn't take long to get the dreaded 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in range(128).

You never see it coming. It doesn't make any sense. You didn't even ask for ascii!

So what's the deal?

I'm glad you asked. I will demonstrate:

>>> s = file("data").read()
>>> s
'SGVsbG8sIHdvcmxkIQ==\n'

If you guessed that s is a hunk of base64 encoded data, you'd be right! Give yourself a gold star.

Now, if we want to do anything useful with this data, it needs to be decoded:

>>> s.decode('base64')
'Hello, world!'

We have just taken an encoded hunk of data and decoded it to get a useful hunk of data.

>>> s.decode('base64').replace('world', 'Marguerite')
'Hello, Marguerite!'
>>> _.encode('base64')
'SGVsbG8sIE1hcmd1ZXJpdGUh\n'

Now we can take that useful hunk of data (the English in 7-bit ASCII), do something useful with it (in this case, replace 'world' with 'Marguerite'), and finally encode the data.

So how does all this relate back to Unicode and ascii error messages?

I have used base64 encoded data here, but the same concept applies when dealing with Unicode data:

  1. Hunk of opaque data comes in (but we know that it contains some sort of Unicode text)
  2. Hunk of opaque data is decoded, creating a unicode object
  3. The unicode object is used for something useful
  4. The unicode object is encoded and saved (to disk, to a database, or sent to a browser)

(of course, in the Real World, you've got to figure out which encoding was used on the data (UTF-8, Latin1, etc)... But that's a topic for another post.)

Ok, back to the 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in range(128) error. It should be fairly clear that this error is coming up because Python is trying to decode a bunch of bytes as 7-bit ASCII, but some of them are out of that range (eg, they have a value over 127).

I know what you're saying, "but I never asked Python to decode anything! I'm just trying to turn it into unicode!"

>>> unicode("Ol\xc3\xa1, mundo!")
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in range(128)

Two questions arise here: First, "Where is the 'ascii' coming from?" Second, "How do I make it work?"

To answer the first question, it's important to think about what's happening when the call to unicode(...) is made. The unicode function accepts an encoded string, decodes it, and creates a unicode object. In this case, though, we haven't given the function any indication of which decoder it should use, so it falls back to the computer's default encoding: ascii.

So how can you make it work? Tell unicode which encoding to use:

>>> unicode("Ol\xc3\xa1, mundo!", 'utf8')
u'Ol\xe1, mundo!'

(now, as I mentioned before, figuring out which encoding to use is another huge problem... But I'll leave that for another day)

Another problem I run into quite often is this:

>>> "Ol\xc3\xa1, mundo!".encode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in range(128)

And, by now, the cause of this should be painfully obvious: I've given Python an encoded string, so I should be decoding it, not encoding it again.

But why the confusing error message? Well, I'm not entirely sure, but my guess is that the UTF-8 encoder expects a unicode object, so it tries to convert the input (in this case, "Ol\xc3...") to Unicode before encoding it.

Is there any end to this insanity?!

Yes! Python 3000 will have two distinct classes: one for strings, one for hunks of data. Whenever data is read, it will come in as a "hunk of data". It will have to be explicitly decoded to a string before it can be used as such. Hopefully that will make life a little bit less painful.

See also:

Permalink + Comments