The no-good very-bad —

January 22, 2011 at 01:59 PM | Unicode | View Comments

In today's instalment of Adventures in Unicode, we meet the sneaky —.

When a web browser encounters —, it renders an em-dash (—). However, when — is decoded to Unicode (U+0097, 9716 == 15110), encoded to UTF-8 (\xc2\x97), written to a file, then opened with exactly the same web browser, the browser renders…

queue ominous music

Nothing!

Nothing is rendered because U+0097 is actually the END OF GUARDED AREA control character[0]… So it shouldn't be rendered.

So why is — being rendered? Because of our old friend, the Windows-1252 encoding, where character 151 is an em-dash. So the browser sees —, it helpfully assumes that the author is an idiot[1] and wanted an em-dash to be displayed instead of a control character[2].

What can be done?

I have been using a function which looks like this:

_fix_mixed_unicode_re = re.compile("([\x7F-\xFF]+)")
def fix_mixed_unicode(mixed_unicode):
    assert isinstance(mixed_unicode, unicode)
    def handle_match(match):
        return match.group(0).encode("raw_unicode_escape").decode("1252")
    return _fix_mixed_unicode_re.sub(handle_match, mixed_unicode)

It accepts a unicode string, then assumes any characters between 127 and 255 are actually Windows-1252 encoded, so it encodes them as bytes, then decodes those bytes as 1251, yielding a correct unicode string.

[0]: Which is represented by a line that looks very similar to an em-dash

[1]: A generally safe assumption.

[2]: It should be noted that this happens regardless of the document's encoding.