In today's instalment of Adventures in Unicode, we meet the sneaky
When a web browser encounters
—, it renders an em-dash (—). However, when
— is decoded to Unicode (
U+0097, 9716 == 15110), encoded to UTF-8 (
\xc2\x97), written to a file, then opened with exactly the same web browser, the browser renders…
queue ominous music
Nothing is rendered because
U+0097 is actually the END OF GUARDED AREA control character… So it shouldn't be rendered.
So why is
— being rendered? Because of our old friend, the Windows-1252 encoding, where character 151 is an em-dash. So the browser sees
—, it helpfully assumes that the author is an idiot and wanted an em-dash to be displayed instead of a control character.
What can be done?
I have been using a function which looks like this:
_fix_mixed_unicode_re = re.compile("([\x7F-\xFF]+)") def fix_mixed_unicode(mixed_unicode): assert isinstance(mixed_unicode, unicode) def handle_match(match): return match.group(0).encode("raw_unicode_escape").decode("1252") return _fix_mixed_unicode_re.sub(handle_match, mixed_unicode)
It accepts a
unicode string, then assumes any characters between 127 and 255 are actually Windows-1252 encoded, so it encodes them as bytes, then decodes those bytes as
1251, yielding a correct unicode string.
: Which is represented by a line that looks very similar to an em-dash…
: A generally safe assumption.
: It should be noted that this happens regardless of the document's encoding.
Out of curiosity, I polled Twitter, asking how often they interact with their version control systems. These are the responses I got, along with (my guess at) the version control system they use:
- 5 - 10 times a day. (?)
- Lots of ‘working directory diff’, then a bit less for ‘commit’ and a lot less for ‘push’. (dvcs)
- Not nearly as much as I should. (git)
- Every few hours lately I wake up from some code daydreaming and do a diff to see how much has gone on, then "hg record" a few times. (hg)
- When I have something that represents a "single" change; could be anywhere from two minutes/one line to afternoon/minor new module. Diff and revert are also minor but nontrivial contributors to VCS interaction while coding. Those happen approximately randomly. (git, cvs)
- When I want to switch contexts (different branch or project), or when things are reasonably functional and I want to break things. In practice, about 2-8 times per full work session. (git)
- Trying to do it much, much more. (?)
- Depends on the team and VCS. Ideally, if I'm using a DVCS and coding heavily, probably hourly... But right now at work, since I'm dealing with a centralized P4 and a kludgy dev toolset, probably once a day. (dvcs, p4)
- Often: I branch like a madman, commit when I am satisfied, and push/merge when i am complete. (git)
- Every 10-15 minutes, typically. (git)
- Usually one to three times a day. Should be more, but I don't do enough TDD these days, so commits are larger. (?)
- I commit approximately hourly and diff/stat about as often. Revert once a day. (hg)
- Very frequently. (hg)
- Usually once a day, when I'm done with my tasks and all tests pass. But I'm the only one touching the code so I never have to update or merge. (?)
And my answer would be "every 15-30 minutes".
Anyway, that's all. Thanks to everyone who replied; I found the results interesting.