Python 2.X's str.format is unsafe

September 22, 2011 at 07:33 PM | Python, Unicode | View Comments

I posted a tweet today when I learned that Python's %-string-formatting isn't actually a special case - the str class just implements the __mod__ method.

One side effect of this is that a few people commented that %-formatting is to be replaced with .format formatting... So I'd like to take this opportunity to explain why .format string formatting is unsafe in Python 2.X.

With %-formatting, if the format string is a str while one of the replacements is a unicode the result will be unicode:

>>> "Hello %s" %(u"world", )
u'Hello world'

However, .format will always return the same type of string (str or unicode) as the format string:

>>> "Hello {}".format(u"world")
'Hello world'

This is a problem in Python 2.X because unqualified string literals are instances of str, and the implicit encoding of unicode arguments will almost certainly explode at the least opportune moments:

>>> "Hello {}".format(u"\u263a")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u263a' in position 0: ordinal not in range(128)

Of course, one possible solution to this is remembering to prefix all string literals with u:

>>> u"Hello {}".format(u"\u263a")
u'Hello \u263a'

But I prefer to simply use %-style formatting, because then I don't need to remember anything:

>>> "Hello %s" %(u"\u263a", )
u'Hello \u263a'
>>> print _.encode('utf-8')
Hello ☺

Of course, as you've probably noticed, this means that the format string is being implicitly decoded to unicode... But since my string literals generally don't contain non-ASCII characters it's not much of an issue.

Note that this is not a problem in Py 3k because string literals are unicode.