In today's instalment of Adventures in Unicode, we meet the sneaky —
.
When a web browser encounters —
, it renders an em-dash (—). However, when —
is decoded to Unicode (U+0097
, 9716 == 15110), encoded to UTF-8 (\xc2\x97
), written to a file, then opened with exactly the same web browser, the browser renders…
queue ominous music
Nothing!
Nothing is rendered because U+0097
is actually the END OF GUARDED AREA control character[0]… So it shouldn't be rendered.
So why is —
being rendered? Because of our old friend, the Windows-1252 encoding, where character 151 is an em-dash. So the browser sees —
, it helpfully assumes that the author is an idiot[1] and wanted an em-dash to be displayed instead of a control character[2].
What can be done?
I have been using a function which looks like this:
_fix_mixed_unicode_re = re.compile("([\x7F-\xFF]+)")
def fix_mixed_unicode(mixed_unicode):
assert isinstance(mixed_unicode, unicode)
def handle_match(match):
return match.group(0).encode("raw_unicode_escape").decode("1252")
return _fix_mixed_unicode_re.sub(handle_match, mixed_unicode)
It accepts a unicode
string, then assumes any characters between 127 and 255 are actually Windows-1252 encoded, so it encodes them as bytes, then decodes those bytes as 1251
, yielding a correct unicode string.
[0]: Which is represented by a line that looks very similar to an em-dash…
[1]: A generally safe assumption.
[2]: It should be noted that this happens regardless of the document's encoding.
Poll results: how often do programmers interact with their version control system?
January 05, 2011 at 08:12 PM | Version control | View CommentsOut of curiosity, I polled Twitter, asking how often they interact with their version control systems. These are the responses I got, along with (my guess at) the version control system they use:
- 5 - 10 times a day. (?)
- Lots of ‘working directory diff’, then a bit less for ‘commit’ and a lot less for ‘push’. (dvcs)
- Not nearly as much as I should. (git)
- Every few hours lately I wake up from some code daydreaming and do a diff to see how much has gone on, then "hg record" a few times. (hg)
- When I have something that represents a "single" change; could be anywhere from two minutes/one line to afternoon/minor new module. Diff and revert are also minor but nontrivial contributors to VCS interaction while coding. Those happen approximately randomly. (git, cvs)
- When I want to switch contexts (different branch or project), or when things are reasonably functional and I want to break things. In practice, about 2-8 times per full work session. (git)
- Trying to do it much, much more. (?)
- Depends on the team and VCS. Ideally, if I'm using a DVCS and coding heavily, probably hourly... But right now at work, since I'm dealing with a centralized P4 and a kludgy dev toolset, probably once a day. (dvcs, p4)
- Often: I branch like a madman, commit when I am satisfied, and push/merge when i am complete. (git)
- Every 10-15 minutes, typically. (git)
- Usually one to three times a day. Should be more, but I don't do enough TDD these days, so commits are larger. (?)
- I commit approximately hourly and diff/stat about as often. Revert once a day. (hg)
- Very frequently. (hg)
- Usually once a day, when I'm done with my tasks and all tests pass. But I'm the only one touching the code so I never have to update or merge. (?)
And my answer would be "every 15-30 minutes".
Anyway, that's all. Thanks to everyone who replied; I found the results interesting.