In today's instalment of Adventures in Unicode, we meet the sneaky
When a web browser encounters
—, it renders an em-dash (—). However, when
— is decoded to Unicode (
U+0097, 9716 == 15110), encoded to UTF-8 (
\xc2\x97), written to a file, then opened with exactly the same web browser, the browser renders…
queue ominous music
Nothing is rendered because
U+0097 is actually the END OF GUARDED AREA control character… So it shouldn't be rendered.
So why is
— being rendered? Because of our old friend, the Windows-1252 encoding, where character 151 is an em-dash. So the browser sees
—, it helpfully assumes that the author is an idiot and wanted an em-dash to be displayed instead of a control character.
What can be done?
I have been using a function which looks like this:
_fix_mixed_unicode_re = re.compile("([\x7F-\xFF]+)") def fix_mixed_unicode(mixed_unicode): assert isinstance(mixed_unicode, unicode) def handle_match(match): return match.group(0).encode("raw_unicode_escape").decode("1252") return _fix_mixed_unicode_re.sub(handle_match, mixed_unicode)
It accepts a
unicode string, then assumes any characters between 127 and 255 are actually Windows-1252 encoded, so it encodes them as bytes, then decodes those bytes as
1251, yielding a correct unicode string.
: Which is represented by a line that looks very similar to an em-dash…
: A generally safe assumption.
: It should be noted that this happens regardless of the document's encoding.