The no-good very-bad —

January 22, 2011 at 01:59 PM | Unicode | View Comments

In today's instalment of Adventures in Unicode, we meet the sneaky —.

When a web browser encounters —, it renders an em-dash (—). However, when — is decoded to Unicode (U+0097, 9716 == 15110), encoded to UTF-8 (\xc2\x97), written to a file, then opened with exactly the same web browser, the browser renders…

queue ominous music

Nothing!

Nothing is rendered because U+0097 is actually the END OF GUARDED AREA control character[0]… So it shouldn't be rendered.

So why is — being rendered? Because of our old friend, the Windows-1252 encoding, where character 151 is an em-dash. So the browser sees —, it helpfully assumes that the author is an idiot[1] and wanted an em-dash to be displayed instead of a control character[2].

What can be done?

I have been using a function which looks like this:

_fix_mixed_unicode_re = re.compile("([\x7F-\xFF]+)")
def fix_mixed_unicode(mixed_unicode):
    assert isinstance(mixed_unicode, unicode)
    def handle_match(match):
        return match.group(0).encode("raw_unicode_escape").decode("1252")
    return _fix_mixed_unicode_re.sub(handle_match, mixed_unicode)

It accepts a unicode string, then assumes any characters between 127 and 255 are actually Windows-1252 encoded, so it encodes them as bytes, then decodes those bytes as 1251, yielding a correct unicode string.

[0]: Which is represented by a line that looks very similar to an em-dash

[1]: A generally safe assumption.

[2]: It should be noted that this happens regardless of the document's encoding.

Permalink + Comments

Poll results: how often do programmers interact with their version control system?

January 05, 2011 at 08:12 PM | Version control | View Comments

Out of curiosity, I polled Twitter, asking how often they interact with their version control systems. These are the responses I got, along with (my guess at) the version control system they use:

  • 5 - 10 times a day. (?)
  • Lots of ‘working directory diff’, then a bit less for ‘commit’ and a lot less for ‘push’. (dvcs)
  • Not nearly as much as I should. (git)
  • Every few hours lately I wake up from some code daydreaming and do a diff to see how much has gone on, then "hg record" a few times. (hg)
  • When I have something that represents a "single" change; could be anywhere from two minutes/one line to afternoon/minor new module. Diff and revert are also minor but nontrivial contributors to VCS interaction while coding. Those happen approximately randomly. (git, cvs)
  • When I want to switch contexts (different branch or project), or when things are reasonably functional and I want to break things. In practice, about 2-8 times per full work session. (git)
  • Trying to do it much, much more. (?)
  • Depends on the team and VCS. Ideally, if I'm using a DVCS and coding heavily, probably hourly... But right now at work, since I'm dealing with a centralized P4 and a kludgy dev toolset, probably once a day. (dvcs, p4)
  • Often: I branch like a madman, commit when I am satisfied, and push/merge when i am complete. (git)
  • Every 10-15 minutes, typically. (git)
  • Usually one to three times a day. Should be more, but I don't do enough TDD these days, so commits are larger. (?)
  • I commit approximately hourly and diff/stat about as often. Revert once a day. (hg)
  • Very frequently. (hg)
  • Usually once a day, when I'm done with my tasks and all tests pass. But I'm the only one touching the code so I never have to update or merge. (?)

And my answer would be "every 15-30 minutes".

Anyway, that's all. Thanks to everyone who replied; I found the results interesting.

Permalink + Comments