In today's instalment of Adventures in Unicode, we meet the sneaky —
.
When a web browser encounters —
, it renders an em-dash (—). However, when —
is decoded to Unicode (U+0097
, 9716 == 15110), encoded to UTF-8 (\xc2\x97
), written to a file, then opened with exactly the same web browser, the browser renders…
queue ominous music
Nothing!
Nothing is rendered because U+0097
is actually the END OF GUARDED AREA control character[0]… So it shouldn't be rendered.
So why is —
being rendered? Because of our old friend, the Windows-1252 encoding, where character 151 is an em-dash. So the browser sees —
, it helpfully assumes that the author is an idiot[1] and wanted an em-dash to be displayed instead of a control character[2].
What can be done?
I have been using a function which looks like this:
_fix_mixed_unicode_re = re.compile("([\x7F-\xFF]+)")
def fix_mixed_unicode(mixed_unicode):
assert isinstance(mixed_unicode, unicode)
def handle_match(match):
return match.group(0).encode("raw_unicode_escape").decode("1252")
return _fix_mixed_unicode_re.sub(handle_match, mixed_unicode)
It accepts a unicode
string, then assumes any characters between 127 and 255 are actually Windows-1252 encoded, so it encodes them as bytes, then decodes those bytes as 1251
, yielding a correct unicode string.
[0]: Which is represented by a line that looks very similar to an em-dash…
[1]: A generally safe assumption.
[2]: It should be noted that this happens regardless of the document's encoding.