In MySQL land, "latin1" isn't actually latin1

March 20, 2012 at 02:41 AM | Unicode | View Comments

Lesson learned: in MySQL land, "latin1" isn't actually latin1 — it's cp1252[0].

The the consequence? Magic. Everything will appear to work until a connection character encoding is specified, a SELECT INTO OUTFILE is issued, or you start to realize that unicode data are taking up two or three times more disk space than they reasonably should.

More specifically: when no connection character set is specified, MySQL defaults to using "latin1". Additionally, programmers will occasionally send utf8 encoded data over a MySQL connection without setting the connections' character set… Which leads to unexpected results under the conditions described above.

For example, imagine that the string u"☃" is encoded as utf8 ("\xe2\x98\x83") and sent to MySQL over a connection using the cp1252 character set (the default if no SET CHARACTER SET command is issued). MySQL will receive these three bytes, then decoded them as cp1252, yielding three unicode code points: u"\xe2\u02dc\u0192". These three code points are then stored to disk using the column's character set (for example, if the column's character set is utf8, the bytes "\xc3\xa2\xcb\x9c\xc6\x92" will be written to disk):

>>> u"☃"
u'\u2603'
>>> _.encode("utf8")
"\xe2\x98\x83"
>>> _.decode("cp1252")
u"\xe2\u02dc\u0192"
>>> _.encode("utf8")
"\xc3\xa2\xcb\x9c\xc6\x92"

Next, when that string is sent back to a client, the bytes are read from disk and decoded using the column's character set: "\xc3\xa2\xcb\x9c\xc6\x92" decodes to u"\xe2\u02dc\u0192". This string is then encoded using the connections character set and the resulting bytes are sent back to the client: u"\xe2\u02dc\u0192" encodes to "\xe2\x98\x83" — the "correct" utf8 bytes:

>>> "\xc3\xa2\xcb\x9c\xc6\x92".decode("utf8")
u"\xe2\u02dc\u0192"
>>> _.encode("cp1252")
"\xe2\x98\x83"
>>> _.decode("utf8")
u'\u2603'
>>> print _
☃

And the client will continue to see the "correct" utf8 bytes until the last "encode as cp1252" step is omitted… For example, because the connection's character set has changed, or because the SELECT INTO OUTFILE command is issued[1].

In cases when the last "encode as cp1252" step is omitted, results will seem very strange. For example, if the SET CHARACTER SET binary command is issued (to simulate a SELECT INTO OUTFILE), the bytes "\xc3\xa2\xcb\x9c\xc6\x92" will be returned, and similar things will happen if the connection encoding is set to utf8.

Note also that six bytes are being used to store a three utf8 bytes.

With the luxury of planning and foresight, this madness could have been avoided by:

  • Issuing SET CHARACTER SET utf8 at the start of connections.
  • Ensuring that (unless there is a good reason not to), databases have DEFAULT CHARACTER SET utf8.
  • Ensuring that only utf8 bytes are sent to the database.

But, as is so often the case, the particular data which lead to this discovery were generated by a PHP application that is out of my control... So for now, I will be living with .decode("utf8").encode("cp1252").decode("utf8").

(thanks to Taavi Burns, who pointed out that MySQL assumes "latin1" means "cp1252", making it possible to solve my original problem)

[0]: as documented here: http://dev.mysql.com/doc/refman/5.0/en/charset-we-sets.html (fun fact: the at the top of the Wikipedia entry for 8859-1 (latin1), there is the notice: “For the character encoding commonly mislabeled as "ISO-8859-1", see Windows-1252”).

[1]: SELECT INTO OUTFILE uses the column's encoding, not the connection's: http://dev.mysql.com/doc/refman/5.0/en/select-into.html