In MySQL land, "latin1" isn't actually latin1

March 20, 2012 at 02:41 AM | Unicode | View Comments

Lesson learned: in MySQL land, "latin1" isn't actually latin1 — it's cp1252[0].

The the consequence? Magic. Everything will appear to work until a connection character encoding is specified, a SELECT INTO OUTFILE is issued, or you start to realize that unicode data are taking up two or three times more disk space than they reasonably should.

More specifically: when no connection character set is specified, MySQL defaults to using "latin1". Additionally, programmers will occasionally send utf8 encoded data over a MySQL connection without setting the connections' character set… Which leads to unexpected results under the conditions described above.

For example, imagine that the string u"☃" is encoded as utf8 ("\xe2\x98\x83") and sent to MySQL over a connection using the cp1252 character set (the default if no SET CHARACTER SET command is issued). MySQL will receive these three bytes, then decoded them as cp1252, yielding three unicode code points: u"\xe2\u02dc\u0192". These three code points are then stored to disk using the column's character set (for example, if the column's character set is utf8, the bytes "\xc3\xa2\xcb\x9c\xc6\x92" will be written to disk):

>>> u"☃"
u'\u2603'
>>> _.encode("utf8")
"\xe2\x98\x83"
>>> _.decode("cp1252")
u"\xe2\u02dc\u0192"
>>> _.encode("utf8")
"\xc3\xa2\xcb\x9c\xc6\x92"

Next, when that string is sent back to a client, the bytes are read from disk and decoded using the column's character set: "\xc3\xa2\xcb\x9c\xc6\x92" decodes to u"\xe2\u02dc\u0192". This string is then encoded using the connections character set and the resulting bytes are sent back to the client: u"\xe2\u02dc\u0192" encodes to "\xe2\x98\x83" — the "correct" utf8 bytes:

>>> "\xc3\xa2\xcb\x9c\xc6\x92".decode("utf8")
u"\xe2\u02dc\u0192"
>>> _.encode("cp1252")
"\xe2\x98\x83"
>>> _.decode("utf8")
u'\u2603'
>>> print _
☃

And the client will continue to see the "correct" utf8 bytes until the last "encode as cp1252" step is omitted… For example, because the connection's character set has changed, or because the SELECT INTO OUTFILE command is issued[1].

In cases when the last "encode as cp1252" step is omitted, results will seem very strange. For example, if the SET CHARACTER SET binary command is issued (to simulate a SELECT INTO OUTFILE), the bytes "\xc3\xa2\xcb\x9c\xc6\x92" will be returned, and similar things will happen if the connection encoding is set to utf8.

Note also that six bytes are being used to store a three utf8 bytes.

With the luxury of planning and foresight, this madness could have been avoided by:

  • Issuing SET CHARACTER SET utf8 at the start of connections.
  • Ensuring that (unless there is a good reason not to), databases have DEFAULT CHARACTER SET utf8.
  • Ensuring that only utf8 bytes are sent to the database.

But, as is so often the case, the particular data which lead to this discovery were generated by a PHP application that is out of my control... So for now, I will be living with .decode("utf8").encode("cp1252").decode("utf8").

(thanks to Taavi Burns, who pointed out that MySQL assumes "latin1" means "cp1252", making it possible to solve my original problem)

[0]: as documented here: http://dev.mysql.com/doc/refman/5.0/en/charset-we-sets.html (fun fact: the at the top of the Wikipedia entry for 8859-1 (latin1), there is the notice: “For the character encoding commonly mislabeled as "ISO-8859-1", see Windows-1252”).

[1]: SELECT INTO OUTFILE uses the column's encoding, not the connection's: http://dev.mysql.com/doc/refman/5.0/en/select-into.html

Permalink + Comments

Python 2.X's str.format is unsafe

September 22, 2011 at 07:33 PM | Python, Unicode | View Comments

I posted a tweet today when I learned that Python's %-string-formatting isn't actually a special case - the str class just implements the __mod__ method.

One side effect of this is that a few people commented that %-formatting is to be replaced with .format formatting... So I'd like to take this opportunity to explain why .format string formatting is unsafe in Python 2.X.

With %-formatting, if the format string is a str while one of the replacements is a unicode the result will be unicode:

>>> "Hello %s" %(u"world", )
u'Hello world'

However, .format will always return the same type of string (str or unicode) as the format string:

>>> "Hello {}".format(u"world")
'Hello world'

This is a problem in Python 2.X because unqualified string literals are instances of str, and the implicit encoding of unicode arguments will almost certainly explode at the least opportune moments:

>>> "Hello {}".format(u"\u263a")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u263a' in position 0: ordinal not in range(128)

Of course, one possible solution to this is remembering to prefix all string literals with u:

>>> u"Hello {}".format(u"\u263a")
u'Hello \u263a'

But I prefer to simply use %-style formatting, because then I don't need to remember anything:

>>> "Hello %s" %(u"\u263a", )
u'Hello \u263a'
>>> print _.encode('utf-8')
Hello ☺

Of course, as you've probably noticed, this means that the format string is being implicitly decoded to unicode... But since my string literals generally don't contain non-ASCII characters it's not much of an issue.

Note that this is not a problem in Py 3k because string literals are unicode.

Permalink + Comments

The no-good very-bad &#151;

January 22, 2011 at 01:59 PM | Unicode | View Comments

In today's instalment of Adventures in Unicode, we meet the sneaky &#151;.

When a web browser encounters &#151;, it renders an em-dash (—). However, when &#151; is decoded to Unicode (U+0097, 9716 == 15110), encoded to UTF-8 (\xc2\x97), written to a file, then opened with exactly the same web browser, the browser renders…

queue ominous music

Nothing!

Nothing is rendered because U+0097 is actually the END OF GUARDED AREA control character[0]… So it shouldn't be rendered.

So why is &#151; being rendered? Because of our old friend, the Windows-1252 encoding, where character 151 is an em-dash. So the browser sees &#151;, it helpfully assumes that the author is an idiot[1] and wanted an em-dash to be displayed instead of a control character[2].

What can be done?

I have been using a function which looks like this:

_fix_mixed_unicode_re = re.compile("([\x7F-\xFF]+)")
def fix_mixed_unicode(mixed_unicode):
    assert isinstance(mixed_unicode, unicode)
    def handle_match(match):
        return match.group(0).encode("raw_unicode_escape").decode("1252")
    return _fix_mixed_unicode_re.sub(handle_match, mixed_unicode)

It accepts a unicode string, then assumes any characters between 127 and 255 are actually Windows-1252 encoded, so it encodes them as bytes, then decodes those bytes as 1251, yielding a correct unicode string.

[0]: Which is represented by a line that looks very similar to an em-dash

[1]: A generally safe assumption.

[2]: It should be noted that this happens regardless of the document's encoding.

Permalink + Comments

Testing for Unicode Safety

February 11, 2009 at 10:30 AM | Unicode | View Comments

After yesterdays post, Greg suggested I write another on how to test for Unicode safety... And unfortunately I've got some bad news: it's hard.

You never know when some developer, somewhere, will unintentionally encode or decode something the wrong way (for example, log("request for %s", unicode(url))).

But there is hope!

In my experience, almost all Unicode-related issues follow the same pattern: someone using str or unicode incorrectly and code which unexpectedly encodes/decodes a string.

The first is easy to check for: grep through the code for str( and unicode(.

The second is harder to check for, and requires an understanding of both the code base: all of the points where the code interacts with other parts of the system (filesystem, database, network) must be found and checked.

Finally, it isn't a bad idea to throw some Unicode into the test suite. Instead of calling mock users 'user0', 'user1', Call them u'\u03bcs\xeb\u044f' (u"μsëя")*. Keep a central "database" of these sorts of strings, so it's easy for developers who don't normally write in Cyrillic to use Cyrillic characters in their code (I keep my own personal list at http://wolever.net/~wolever/wiki/unicode_audit -- a url I can now type from memory).

One word of caution, though: you're asking for world of pain if you actually think you can commit UTF-8 encoded text -- any number of things will break (subversion may helpfully fail, your editor may helpfully re-encode the file, your unenlightened developers will complain about funny question marks in their code, etc...). Instead, have a central file which defines these "canned test strings" using escaped Python strings (ie, u'\u03bc...') then import that into your test suite:

from app.tests import i18n
...
def test_user():
    u = new User(name=i18n.user)
    ...

Or tar up all the offensive files, then write a script to un-tar them when they are needed**.

So, to sum it up:

  • Make sure your developers grok (or, at least, understand) Unicode and encodings
  • Make sure your code uses str and unicode safely
  • Make sure your exit points are covered
  • Make it really easy to include Unicode in tests

Then maybe, if you're lucky, all those inconsiderate people who have the audacity to ask for more than 127 characters will be able to use your application :-)

*: A good choice both because it's easy for ignorant North Americans like myself to see that's it's correct.

**: This is how I tested DrProject's handling of Unicode filenames which are checked into the Subversion repository.

Permalink + Comments

str(...): 'yer probably doin' it wrong.

February 10, 2009 at 11:13 AM | Python, Unicode | View Comments

Unicode is an ugly beast... And until people start standardizing on Python 3k*, we're going to have to live with the eccentricities of Python 2's strings.

But, fear not! There is (at least some) hope. By changing a few patterns in the way you code, you can alleviate the bulk of Unicode-related problems.

First, using the str function. In just about every case, if you're using the str function, you're probably doing it wrong.

Let me demonstrate:

from hashlib import sha256
def hash(to_hash):
  hash = sha256(to_hash).hexdigest()
  print to_hash, ":", hash

Cool, we can hash things then print out the hash:

>>> hash('ohai')
ohai : e84712238709398f6d349dc2250b0efca4b72d8c2bfb7b74339d30ba94056b14

But, wait... What happens if the thing we're hashing isn't a string (even though it can be represented as a string):

>>> class Person:
...   name = 'David'
...   def __str__(self):
...     return self.name
...
>>> hash(Person())
...
TypeError: new() argument 1 must be string or read-only buffer, not instance

Oh no! Ok, let's fix the code:

from hashlib import sha256
def hash(to_hash):
  to_hash = str(to_hash) # Convert the object to a string before we hash it
  hash = sha256(to_hash).hexdigest()
  print to_hash, ":", hash

Great -- we can hash numbers now:

>>> hash(Person())
David : a6b54c20a7b96eeac1a911e6da3124a560fe6dc042ebf270e3676e7095b95652

And, for most people who only speak English, this is a perfect place to stop. After all, everything is a str, right?

Well... No. What happens if the input is a unicode object?

>>> p = Person()
>>> # This person had the audacity to give themselves a name containing
>>> # non-ascii symbols, so we represent it with a unicode object
>>> p.name = u'I\xf1t\xebrn\xe2ti\xf4n\xe0liz\xe6ti\xf8n'
>>> hash(p)
...
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf1' in position
1: ordinal not in range(128)

Crap. Where the heck is 'ascii' coming from?

Well, it's a long story (which I've covered over at Encoding and Decoding Text in Python), but basically the __str__ method of the unicode object (u'I\xf1...') is trying to encode the unicode object using the system's default encoding... Which, in this case, is ascii.

"Alright...", you're probably thinking, "If the problem is with unicode, maybe I could just replace that call to str with a call to unicode"

Ok, let's see what happens.

from hashlib import sha256
def hash(to_hash):
  to_hash = unicode(to_hash) # Convert the object to a unicode before we hash it
  hash = sha256(to_hash).hexdigest()
  print to_hash, ":", hash

>>> # this time, though, the person's name has come from The Internet, so it
>>> # is not yet a unicode object
>>> p.name = 'I\xc3\xb1t\xc3\xabrn\xc3\xa2ti\xc3\xb4n\xc3\xa0liz\xc3\xa6ti\xc3\xb8n'
>>> hash(p)
...
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
ordinal not in range(128)

Yup, that's right -- you just can't win :-(

What's happening here? Well, this time, the unicode function is trying to decode the input ('I\xc3...') into a unicode object... But, because the input isn't valid 7-bit ascii (again, the system's default), it explodes. Crap.

Confused yet?

So how can we save ourselves from all this insanity?

Actually, it's not too difficult:

  1. Convert every single string, with out exception, to unicode as soon as they enter the system. For example, if you are writing a web application, GET and POST variables should be converted to unicode as soon as they are read from the environment:

    for (key, value) in environment.get_vars: request.GET[to_unicode(key)] = to_unicode(value)

  2. Only convert the unicode objects back to str strings when you absolutly must. For example, when they are written to a file:

    log_file.write("New user '%s' created" %(to_str(p.name)))

  3. Think hard before you call str or unicode. Each time your fingers type "s", "t", flashing lights and sirens should go off in your head, reminding you to make sure that the object you are string could never, ever, ever possibly contain unicode.

And what about those to_unicode and to_str functions? What should they look like? Well, probably something like this:

import locale
def to_unicode(text):
    """ Convert, at all consts, 'text' to a `unicode` object.

        Note: as a last-ditch effort, this function tries to decode the text
              as latin1... Which will always succeed.  If you expect to get
              text encoded with latin[2-9] or some other character set, this
              may not be desierable.

        >>> to_unicode(u'I\xf1t\xebrn\xe2ti')
        u'I\xf1t\xebrn\xe2ti'
        >>> to_unicode('I\xc3\xb1t\xc3\xabrn\xc3\xa2ti')
        u'I\xf1t\xebrn\xe2ti'
        >>> class Foo:
        ...   def __str__(self):
        ...       return 'foo'
        ...
        >>> f = Foo()
        >>> to_unicode(f)
        u'foo'
        >>> f.__unicode__ = u'bar'
        >>> to_unicode(f)
        u'bar'
        >>> """

    if isinstance(text, unicode):
        return text

    if hasattr(text, '__unicode__'):
        return text.__unicode__()

    text = str(text)

    try:
        return unicode(text, 'utf-8')
    except UnicodeError:
        pass

    try:
        return unicode(text, locale.getpreferredencoding())
    except UnicodeError:
        pass

    return unicode(text, 'latin1')


def to_str(text):
    """ Convert 'text' to a `str` object.

        >>> to_str(u'I\xf1t\xebrn\xe2ti')
        'I\xc3\xb1t\xc3\xabrn\xc3\xa2ti'
        >>> to_str(42)
        '42'
        >>> to_str('ohai')
        'ohai'
        >>> class Foo:
        ...     def __str__(self):
        ...         return 'foo'
        ...
        >>> f = Foo()
        >>> to_str()
        'foo'
        >>> f.__unicode__ = lambda: u'I\xf1t\xebrn\xe2ti'
        >>> to_str(f)
        'I\xc3\xb1t\xc3\xabrn\xc3\xa2ti'
        >>> """
    if isinstance(text, str):
        return text

    if hasattr(text, '__unicode__'):
        text = text.__unicode__()

    if hasattr(text, '__str__'):
        return text.__str__()

    return text.encode('utf-8')

So there you have it. A quick and fairly easy way to avoid many of your encoding-related options :-)

If you're still not quite feeling comfortable with all of this, though, take a read over Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

*: My newly installed Debian 4 machine is still running Python 2.4 (released December 2004)... So that wait might be a while.

Permalink + Comments