<?xml version="1.0" encoding="UTF-8"?>
<feed
  xmlns="http://www.w3.org/2005/Atom"
  xmlns:thr="http://purl.org/syndication/thread/1.0"
  xml:lang="en"
   >
  <title type="text">Code Kills</title>
  <subtitle type="text"></subtitle>

  <updated>2012-01-22T21:05:58Z</updated>
  <generator uri="http://blogofile.com/">Blogofile</generator>

  <link rel="alternate" type="text/html" href="http://blog.codekills.net" />
  <id>http://blog.codekills.net/feed/atom/</id>
  <link rel="self" type="application/atom+xml" href="http://blog.codekills.net/feed/atom/" />
  <entry>
    <author>
      <name></name>
      <uri>http://blog.codekills.net</uri>
    </author>
    <title type="html">Python 2.X&#39;s str.format is unsafe</title>
    <link rel="alternate" type="text/html" href="http://blog.codekills.net/2011/09/22/python-2.x's-str.format-is-unsafe" />
    <id>http://blog.codekills.net/2011/09/22/python-2.x's-str.format-is-unsafe</id>
    <updated>2011-09-22T23:33:00Z</updated>
    <published>2011-09-22T23:33:00Z</published>
    <category scheme="http://blog.codekills.net" term="Python" />
    <category scheme="http://blog.codekills.net" term="Unicode" />
    <summary type="html">Python 2.X&#39;s str.format is unsafe</summary>
    <content type="html" xml:base="http://blog.codekills.net/2011/09/22/python-2.x's-str.format-is-unsafe">&lt;div class=&#34;document&#34;&gt;
&lt;p&gt;I posted &lt;a class=&#34;reference external&#34; href=&#34;http://twitter.com/wolever/status/116966636606603264&#34;&gt;a tweet&lt;/a&gt;
today when I learned that Python&#39;s %-string-formatting isn&#39;t actually a special
case - the &lt;tt class=&#34;docutils literal&#34;&gt;str&lt;/tt&gt; class just implements the &lt;tt class=&#34;docutils literal&#34;&gt;__mod__&lt;/tt&gt; method.&lt;/p&gt;
&lt;p&gt;One side effect of this is that a few people commented that %-formatting is to
be replaced with &lt;tt class=&#34;docutils literal&#34;&gt;.format&lt;/tt&gt; formatting... So I&#39;d like to take this opportunity
to explain why &lt;tt class=&#34;docutils literal&#34;&gt;.format&lt;/tt&gt; string formatting &lt;strong&gt;is unsafe&lt;/strong&gt; in Python 2.X.&lt;/p&gt;
&lt;p&gt;With %-formatting, if the format string is a &lt;tt class=&#34;docutils literal&#34;&gt;str&lt;/tt&gt; while one of the
replacements is a &lt;tt class=&#34;docutils literal&#34;&gt;unicode&lt;/tt&gt; the result will be &lt;tt class=&#34;docutils literal&#34;&gt;unicode&lt;/tt&gt;:&lt;/p&gt;
&lt;pre class=&#34;literal-block&#34;&gt;
&amp;gt;&amp;gt;&amp;gt; &amp;quot;Hello %s&amp;quot; %(u&amp;quot;world&amp;quot;, )
u&#39;Hello world&#39;
&lt;/pre&gt;
&lt;p&gt;However, &lt;tt class=&#34;docutils literal&#34;&gt;.format&lt;/tt&gt; will always return the same type of string (&lt;tt class=&#34;docutils literal&#34;&gt;str&lt;/tt&gt; or
&lt;tt class=&#34;docutils literal&#34;&gt;unicode&lt;/tt&gt;) as the format string:&lt;/p&gt;
&lt;pre class=&#34;literal-block&#34;&gt;
&amp;gt;&amp;gt;&amp;gt; &amp;quot;Hello {}&amp;quot;.format(u&amp;quot;world&amp;quot;)
&#39;Hello world&#39;
&lt;/pre&gt;
&lt;p&gt;This is a problem in Python 2.X because unqualified string literals are
instances of &lt;tt class=&#34;docutils literal&#34;&gt;str&lt;/tt&gt;, and the implicit encoding of &lt;tt class=&#34;docutils literal&#34;&gt;unicode&lt;/tt&gt; arguments will
almost certainly explode at the least opportune moments:&lt;/p&gt;
&lt;pre class=&#34;literal-block&#34;&gt;
&amp;gt;&amp;gt;&amp;gt; &amp;quot;Hello {}&amp;quot;.format(u&amp;quot;\u263a&amp;quot;)
Traceback (most recent call last):
  File &amp;quot;&amp;lt;stdin&amp;gt;&amp;quot;, line 1, in &amp;lt;module&amp;gt;
UnicodeEncodeError: &#39;ascii&#39; codec can&#39;t encode character u&#39;\u263a&#39; in position 0: ordinal not in range(128)
&lt;/pre&gt;
&lt;p&gt;Of course, one possible solution to this is remembering to prefix all string
literals with &lt;tt class=&#34;docutils literal&#34;&gt;u&lt;/tt&gt;:&lt;/p&gt;
&lt;pre class=&#34;literal-block&#34;&gt;
&amp;gt;&amp;gt;&amp;gt; u&amp;quot;Hello {}&amp;quot;.format(u&amp;quot;\u263a&amp;quot;)
u&#39;Hello \u263a&#39;
&lt;/pre&gt;
&lt;p&gt;But I prefer to simply use %-style formatting, because then I don&#39;t need to
remember anything:&lt;/p&gt;
&lt;pre class=&#34;literal-block&#34;&gt;
&amp;gt;&amp;gt;&amp;gt; &amp;quot;Hello %s&amp;quot; %(u&amp;quot;\u263a&amp;quot;, )
u&#39;Hello \u263a&#39;
&amp;gt;&amp;gt;&amp;gt; print _.encode(&#39;utf-8&#39;)
Hello ☺
&lt;/pre&gt;
&lt;p&gt;Of course, as you&#39;ve probably noticed, this means that the format string is
being implicitly decoded to unicode... But since my string literals generally
don&#39;t contain non-ASCII characters it&#39;s not much of an issue.&lt;/p&gt;
&lt;p&gt;Note that this is &lt;strong&gt;not&lt;/strong&gt; a problem in Py 3k because string literals are
&lt;tt class=&#34;docutils literal&#34;&gt;unicode&lt;/tt&gt;.&lt;/p&gt;
&lt;/div&gt;
</content>
  </entry>
  <entry>
    <author>
      <name>David Wolever</name>
      <uri>http://blog.codekills.net</uri>
    </author>
    <title type="html">The no-good very-bad &amp;amp;#151;</title>
    <link rel="alternate" type="text/html" href="http://blog.codekills.net/2011/01/22/the-no-good-very-bad---151-" />
    <id>http://blog.codekills.net/2011/01/22/the-no-good-very-bad---151-</id>
    <updated>2011-01-22T18:59:00Z</updated>
    <published>2011-01-22T18:59:00Z</published>
    <category scheme="http://blog.codekills.net" term="Unicode" />
    <summary type="html">The no-good very-bad &amp;amp;#151;</summary>
    <content type="html" xml:base="http://blog.codekills.net/2011/01/22/the-no-good-very-bad---151-">

&lt;p&gt;In today&#39;s instalment of &lt;em&gt;Adventures in Unicode&lt;/em&gt;, we meet the sneaky &lt;code&gt;&amp;amp;#151;&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;When a web browser encounters &lt;code&gt;&amp;amp;#151;&lt;/code&gt;, it renders an em-dash (—). However, when &lt;code&gt;&amp;amp;#151;&lt;/code&gt; is decoded to Unicode (&lt;code&gt;U+0097&lt;/code&gt;, 97&lt;sub&gt;16&lt;/sub&gt; == 151&lt;sub&gt;10&lt;/sub&gt;), encoded to UTF-8 (&lt;code&gt;\xc2\x97&lt;/code&gt;), written to a file, then opened with exactly the same web browser, the browser renders…&lt;/p&gt;
&lt;p&gt;&lt;em&gt;queue ominous music&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Nothing!&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Nothing is rendered because &lt;code&gt;U+0097&lt;/code&gt; is &lt;em&gt;actually&lt;/em&gt; the &lt;a href=&#34;http://www.fileformat.info/info/unicode/char/97/index.htm&#34;&gt;END OF GUARDED AREA&lt;/a&gt; control character[0]… So it &lt;em&gt;shouldn&#39;t&lt;/em&gt; be rendered.&lt;/p&gt;
&lt;p&gt;So why is &lt;code&gt;&amp;amp;#151;&lt;/code&gt; being rendered? Because of our old friend, the &lt;a href=&#34;http://en.wikipedia.org/wiki/Windows-1252&#34;&gt;Windows-1252&lt;/a&gt; encoding, where character 151 &lt;em&gt;is&lt;/em&gt; an em-dash. So the browser sees &lt;code&gt;&amp;amp;#151;&lt;/code&gt;, it helpfully assumes that the author is an idiot[1] and wanted an em-dash to be displayed instead of a control character[2].&lt;/p&gt;
&lt;p&gt;What can be done?&lt;/p&gt;
&lt;p&gt;I have been using a function which looks like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;_fix_mixed_unicode_re = re.compile(&#34;([\x7F-\xFF]+)&#34;)
def fix_mixed_unicode(mixed_unicode):
    assert isinstance(mixed_unicode, unicode)
    def handle_match(match):
        return match.group(0).encode(&#34;raw_unicode_escape&#34;).decode(&#34;1252&#34;)
    return _fix_mixed_unicode_re.sub(handle_match, mixed_unicode)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;It accepts a &lt;code&gt;unicode&lt;/code&gt; string, then assumes any characters between 127 and 255 are actually Windows-1252 encoded, so it encodes them as bytes, then decodes those bytes as &lt;code&gt;1251&lt;/code&gt;, yielding a correct unicode string.&lt;/p&gt;
&lt;p&gt;[0]: Which is represented by a line that &lt;a href=&#34;http://www.wolframalpha.com/input/?i=unicode%20151&#34;&gt;looks very similar to an em-dash&lt;/a&gt;…&lt;/p&gt;
&lt;p&gt;[1]: A generally safe assumption.&lt;/p&gt;
&lt;p&gt;[2]: It should be noted that this happens regardless of the document&#39;s encoding.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <author>
      <name>David Wolever</name>
      <uri>http://blog.codekills.net</uri>
    </author>
    <title type="html">Testing for Unicode Safety</title>
    <link rel="alternate" type="text/html" href="http://blog.codekills.net/2009/02/11/testing-for-unicode-safety" />
    <id>http://blog.codekills.net/2009/02/11/testing-for-unicode-safety</id>
    <updated>2009-02-11T15:30:00Z</updated>
    <published>2009-02-11T15:30:00Z</published>
    <category scheme="http://blog.codekills.net" term="Unicode" />
    <summary type="html">Testing for Unicode Safety</summary>
    <content type="html" xml:base="http://blog.codekills.net/2009/02/11/testing-for-unicode-safety">

&lt;p&gt;After &lt;a href=&#34;http://blog.codekills.net/archives/45-str...-yer-probably-doin-it-wrong..html&#34;&gt;yesterdays post&lt;/a&gt;, &lt;a href=&#34;http://blog.third-bit.com/&#34;&gt;Greg&lt;/a&gt; suggested I write another on how to test for Unicode safety... And unfortunately I&#39;ve got some bad news: it&#39;s hard.&lt;/p&gt;
&lt;p&gt;You never know when some developer, somewhere, will unintentionally encode or decode something the wrong way (&lt;a href=&#34;https://www.drproject.org/DrProject/ticket/1627&#34;&gt;for example&lt;/a&gt;, &lt;code&gt;log(&#34;request for %s&#34;, unicode(url))&lt;/code&gt;).&lt;/p&gt;
&lt;p&gt;But there is hope!&lt;/p&gt;
&lt;p&gt;In my experience, almost all Unicode-related issues follow the same pattern: someone using &lt;code&gt;str&lt;/code&gt; or &lt;code&gt;unicode&lt;/code&gt; incorrectly and code which unexpectedly encodes/decodes a string.&lt;/p&gt;
&lt;p&gt;The first is easy to check for: grep through the code for &lt;code&gt;str(&lt;/code&gt; and &lt;code&gt;unicode(&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The second is harder to check for, and requires an understanding of both the code base: all of the points where the code interacts with other parts of the system (filesystem, database, network) must be found and checked.&lt;/p&gt;
&lt;p&gt;Finally, it isn&#39;t a bad idea to throw some Unicode into the test suite.  Instead of calling mock users &#39;user0&#39;, &#39;user1&#39;, Call them u&#39;\u03bcs\xeb\u044f&#39; (u&#34;μsëя&#34;)*.  Keep a central &#34;database&#34; of these sorts of strings, so it&#39;s easy for developers who don&#39;t normally write in Cyrillic to use Cyrillic characters in their code (I keep my own personal list at &lt;a href=&#34;http://wolever.net/~wolever/wiki/unicode_audit&#34;&gt;http://wolever.net/~wolever/wiki/unicode_audit&lt;/a&gt; -- a url I can now type from memory).&lt;/p&gt;
&lt;p&gt;One word of caution, though: you&#39;re asking for world of pain if you actually think you can  &lt;em&gt;commit&lt;/em&gt; UTF-8 encoded text -- any number of things will break (subversion may helpfully fail, your editor may helpfully re-encode the file, your unenlightened developers will complain about funny question marks in their code, etc...).  Instead, have a central file which defines these &#34;canned test strings&#34; using escaped Python strings (ie, u&#39;\u03bc...&#39;) then import that into your test suite:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;from app.tests import i18n
...
def test_user():
    u = new User(name=i18n.user)
    ...
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Or tar up all the offensive files, then write a script to un-tar them when they are needed**.&lt;/p&gt;
&lt;p&gt;So, to sum it up:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Make sure your developers grok (or, at least, understand) Unicode and encodings&lt;/li&gt;
&lt;li&gt;Make sure your code uses &lt;code&gt;str&lt;/code&gt; and &lt;code&gt;unicode&lt;/code&gt; safely&lt;/li&gt;
&lt;li&gt;Make sure your exit points are covered&lt;/li&gt;
&lt;li&gt;Make it really easy to include Unicode in tests&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Then maybe, if you&#39;re lucky, all those inconsiderate people who have the audacity to ask for more than 127 characters will be able to use your application &lt;img src=&#34;/templates/default/img/emoticons/smile.png&#34; alt=&#34;:-)&#34; style=&#34;display: inline; vertical-align: bottom;&#34; class=&#34;emoticon&#34; /&gt;&lt;/p&gt;
&lt;p&gt;*: A good choice both because it&#39;s easy for ignorant North Americans like myself to see that&#39;s it&#39;s correct.&lt;/p&gt;
&lt;p&gt;**: This is how I tested DrProject&#39;s handling of Unicode filenames which are checked into the Subversion repository.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <author>
      <name>David Wolever</name>
      <uri>http://blog.codekills.net</uri>
    </author>
    <title type="html">str(...): &#39;yer probably doin&#39; it wrong.</title>
    <link rel="alternate" type="text/html" href="http://blog.codekills.net/2009/02/10/str(...)--'yer-probably-doin'-it-wrong." />
    <id>http://blog.codekills.net/2009/02/10/str(...)--'yer-probably-doin'-it-wrong.</id>
    <updated>2009-02-10T16:13:00Z</updated>
    <published>2009-02-10T16:13:00Z</published>
    <category scheme="http://blog.codekills.net" term="Python" />
    <category scheme="http://blog.codekills.net" term="Unicode" />
    <summary type="html">str(...): &#39;yer probably doin&#39; it wrong.</summary>
    <content type="html" xml:base="http://blog.codekills.net/2009/02/10/str(...)--'yer-probably-doin'-it-wrong.">

&lt;p&gt;Unicode is an ugly beast... And until people start standardizing on Python 3k*,
we&#39;re going to have to live with the eccentricities of Python 2&#39;s strings.&lt;/p&gt;
&lt;p&gt;But, fear not! There is (at least some) hope. By changing a few patterns in the
way you code, you can alleviate the bulk of Unicode-related problems.&lt;/p&gt;
&lt;p&gt;First, using the &lt;tt&gt;str&lt;/tt&gt; function.  In just about every case, if you&#39;re
using the &lt;tt&gt;str&lt;/tt&gt; function, you&#39;re probably doing it wrong.&lt;/p&gt;
&lt;p&gt;Let me demonstrate:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;from hashlib import sha256
def hash(to_hash):
  hash = sha256(to_hash).hexdigest()
  print to_hash, &#34;:&#34;, hash
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Cool, we can hash things then print out the hash:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; hash(&#39;ohai&#39;)
ohai : e84712238709398f6d349dc2250b0efca4b72d8c2bfb7b74339d30ba94056b14
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;But, wait... What happens if the thing we&#39;re hashing isn&#39;t a string (even though
it can be represented as a string):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; class Person:
...   name = &#39;David&#39;
...   def __str__(self):
...     return self.name
...
&amp;gt;&amp;gt;&amp;gt; hash(Person())
...
TypeError: new() argument 1 must be string or read-only buffer, not instance
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Oh no!  Ok, let&#39;s fix the code:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;from hashlib import sha256
def hash(to_hash):
  to_hash = str(to_hash) # Convert the object to a string before we hash it
  hash = sha256(to_hash).hexdigest()
  print to_hash, &#34;:&#34;, hash
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Great -- we can hash numbers now:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; hash(Person())
David : a6b54c20a7b96eeac1a911e6da3124a560fe6dc042ebf270e3676e7095b95652
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And, for most people who only speak English, this is a perfect place to stop.
After all, everything is a &lt;code&gt;str&lt;/code&gt;, right?&lt;/p&gt;
&lt;p&gt;Well... No.  What happens if the input is a &lt;code&gt;unicode&lt;/code&gt; object?&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; p = Person()
&amp;gt;&amp;gt;&amp;gt; # This person had the audacity to give themselves a name containing
&amp;gt;&amp;gt;&amp;gt; # non-ascii symbols, so we represent it with a unicode object
&amp;gt;&amp;gt;&amp;gt; p.name = u&#39;I\xf1t\xebrn\xe2ti\xf4n\xe0liz\xe6ti\xf8n&#39;
&amp;gt;&amp;gt;&amp;gt; hash(p)
...
UnicodeEncodeError: &#39;ascii&#39; codec can&#39;t encode character u&#39;\xf1&#39; in position
1: ordinal not in range(128)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Crap.  Where the heck is &#39;ascii&#39; coming from?&lt;/p&gt;
&lt;p&gt;Well, it&#39;s a long story (which I&#39;ve covered over at &lt;a href=&#34;http://blog.codekills.net/archives/38-Encoding-and-Decoding-Text-in-Python-or-I-didnt-ask-you-to-use-the-ascii-codec!.html&#34;&gt;Encoding and Decoding Text
in
Python&lt;/a&gt;),
but basically the &lt;code&gt;__str__&lt;/code&gt; method of the unicode object (u&#39;I\xf1...&#39;) is trying to
encode the unicode object using the system&#39;s default encoding... Which, in this
case, is ascii.&lt;/p&gt;
&lt;p&gt;&#34;Alright...&#34;, you&#39;re probably thinking, &#34;If the problem is with &lt;code&gt;unicode&lt;/code&gt;, maybe
I could just replace that call to &lt;code&gt;str&lt;/code&gt; with a call to &lt;code&gt;unicode&lt;/code&gt;&#34;&lt;/p&gt;
&lt;p&gt;Ok, let&#39;s see what happens.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;from hashlib import sha256
def hash(to_hash):
  to_hash = unicode(to_hash) # Convert the object to a unicode before we hash it
  hash = sha256(to_hash).hexdigest()
  print to_hash, &#34;:&#34;, hash

&amp;gt;&amp;gt;&amp;gt; # this time, though, the person&#39;s name has come from The Internet, so it
&amp;gt;&amp;gt;&amp;gt; # is not yet a unicode object
&amp;gt;&amp;gt;&amp;gt; p.name = &#39;I\xc3\xb1t\xc3\xabrn\xc3\xa2ti\xc3\xb4n\xc3\xa0liz\xc3\xa6ti\xc3\xb8n&#39;
&amp;gt;&amp;gt;&amp;gt; hash(p)
...
UnicodeDecodeError: &#39;ascii&#39; codec can&#39;t decode byte 0xc3 in position 1:
ordinal not in range(128)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Yup, that&#39;s right -- you just can&#39;t win &lt;img src=&#34;/templates/default/img/emoticons/sad.png&#34; alt=&#34;:-(&#34; style=&#34;display: inline; vertical-align: bottom;&#34; class=&#34;emoticon&#34; /&gt;&lt;/p&gt;
&lt;p&gt;What&#39;s happening here?  Well, this time, the &lt;code&gt;unicode&lt;/code&gt; function is trying to
decode the input (&#39;I\xc3...&#39;) into a unicode object... But, because the input
isn&#39;t valid 7-bit ascii (again, the system&#39;s default), it explodes. Crap.&lt;/p&gt;
&lt;p&gt;Confused yet?&lt;/p&gt;
&lt;p&gt;So how can we save ourselves from all this insanity?&lt;/p&gt;
&lt;p&gt;Actually, it&#39;s not too difficult:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Convert every single string, with out exception, to unicode as soon as they
enter the system.  For example, if you are writing a web application,
GET and POST variables should be converted to &lt;code&gt;unicode&lt;/code&gt; as soon as they are
read from the environment:&lt;/p&gt;
&lt;p&gt;for (key, value) in environment.get_vars:
    request.GET[to_unicode(key)] = to_unicode(value)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Only convert the &lt;code&gt;unicode&lt;/code&gt; objects back to &lt;code&gt;str&lt;/code&gt; strings when you absolutly
must.  For example, when they are written to a file:&lt;/p&gt;
&lt;p&gt;log_file.write(&#34;New user &#39;%s&#39; created&#34; %(to_str(p.name)))&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Think hard before you call &lt;code&gt;str&lt;/code&gt; or &lt;code&gt;unicode&lt;/code&gt;. Each time your fingers type
&#34;s&#34;, &#34;t&#34;, flashing lights and sirens should go off in your head, reminding
you to make sure that the object you are &lt;code&gt;str&lt;/code&gt;ing could never, ever, ever
possibly contain unicode.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;And what about those &lt;code&gt;to_unicode&lt;/code&gt; and &lt;code&gt;to_str&lt;/code&gt; functions?  What should they look
like?  Well, probably something like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;import locale
def to_unicode(text):
    &#34;&#34;&#34; Convert, at all consts, &#39;text&#39; to a `unicode` object.

        Note: as a last-ditch effort, this function tries to decode the text
              as latin1... Which will always succeed.  If you expect to get
              text encoded with latin[2-9] or some other character set, this
              may not be desierable.

        &amp;gt;&amp;gt;&amp;gt; to_unicode(u&#39;I\xf1t\xebrn\xe2ti&#39;)
        u&#39;I\xf1t\xebrn\xe2ti&#39;
        &amp;gt;&amp;gt;&amp;gt; to_unicode(&#39;I\xc3\xb1t\xc3\xabrn\xc3\xa2ti&#39;)
        u&#39;I\xf1t\xebrn\xe2ti&#39;
        &amp;gt;&amp;gt;&amp;gt; class Foo:
        ...   def __str__(self):
        ...       return &#39;foo&#39;
        ...
        &amp;gt;&amp;gt;&amp;gt; f = Foo()
        &amp;gt;&amp;gt;&amp;gt; to_unicode(f)
        u&#39;foo&#39;
        &amp;gt;&amp;gt;&amp;gt; f.__unicode__ = u&#39;bar&#39;
        &amp;gt;&amp;gt;&amp;gt; to_unicode(f)
        u&#39;bar&#39;
        &amp;gt;&amp;gt;&amp;gt; &#34;&#34;&#34;

    if isinstance(text, unicode):
        return text

    if hasattr(text, &#39;__unicode__&#39;):
        return text.__unicode__()

    text = str(text)

    try:
        return unicode(text, &#39;utf-8&#39;)
    except UnicodeError:
        pass

    try:
        return unicode(text, locale.getpreferredencoding())
    except UnicodeError:
        pass

    return unicode(text, &#39;latin1&#39;)


def to_str(text):
    &#34;&#34;&#34; Convert &#39;text&#39; to a `str` object.

        &amp;gt;&amp;gt;&amp;gt; to_str(u&#39;I\xf1t\xebrn\xe2ti&#39;)
        &#39;I\xc3\xb1t\xc3\xabrn\xc3\xa2ti&#39;
        &amp;gt;&amp;gt;&amp;gt; to_str(42)
        &#39;42&#39;
        &amp;gt;&amp;gt;&amp;gt; to_str(&#39;ohai&#39;)
        &#39;ohai&#39;
        &amp;gt;&amp;gt;&amp;gt; class Foo:
        ...     def __str__(self):
        ...         return &#39;foo&#39;
        ...
        &amp;gt;&amp;gt;&amp;gt; f = Foo()
        &amp;gt;&amp;gt;&amp;gt; to_str()
        &#39;foo&#39;
        &amp;gt;&amp;gt;&amp;gt; f.__unicode__ = lambda: u&#39;I\xf1t\xebrn\xe2ti&#39;
        &amp;gt;&amp;gt;&amp;gt; to_str(f)
        &#39;I\xc3\xb1t\xc3\xabrn\xc3\xa2ti&#39;
        &amp;gt;&amp;gt;&amp;gt; &#34;&#34;&#34;
    if isinstance(text, str):
        return text

    if hasattr(text, &#39;__unicode__&#39;):
        text = text.__unicode__()

    if hasattr(text, &#39;__str__&#39;):
        return text.__str__()

    return text.encode(&#39;utf-8&#39;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;So there you have it.
A quick and fairly easy way to avoid many of your encoding-related options &lt;img src=&#34;/templates/default/img/emoticons/smile.png&#34; alt=&#34;:-)&#34; style=&#34;display: inline; vertical-align: bottom;&#34; class=&#34;emoticon&#34; /&gt;&lt;/p&gt;
&lt;p&gt;If you&#39;re still not quite feeling comfortable with all of this, though, take a
read over &lt;a href=&#34;http://www.joelonsoftware.com/articles/Unicode.html&#34;&gt;Joel Spolsky&#39;s The Absolute Minimum Every Software Developer
Absolutely, Positively Must Know About Unicode and Character Sets (No
Excuses!)&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;*: My newly installed Debian 4 machine is still running Python 2.4 (released
December 2004)... So that wait might be a while.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <author>
      <name>David Wolever</name>
      <uri>http://blog.codekills.net</uri>
    </author>
    <title type="html">Encoding and Decoding Text in Python (or: &#34;I didn&#39;t ask you to use the &#39;ascii&#39; codec!&#34;)</title>
    <link rel="alternate" type="text/html" href="http://blog.codekills.net/2008/05/01/encoding-and-decoding-text-in-python-(or---i-didn't-ask-you-to-use-the-'ascii'-codec!-)" />
    <id>http://blog.codekills.net/2008/05/01/encoding-and-decoding-text-in-python-(or---i-didn't-ask-you-to-use-the-'ascii'-codec!-)</id>
    <updated>2008-05-01T17:27:00Z</updated>
    <published>2008-05-01T17:27:00Z</published>
    <category scheme="http://blog.codekills.net" term="Unicode" />
    <summary type="html">Encoding and Decoding Text in Python (or: &#34;I didn&#39;t ask you to use the &#39;ascii&#39; codec!&#34;)</summary>
    <content type="html" xml:base="http://blog.codekills.net/2008/05/01/encoding-and-decoding-text-in-python-(or---i-didn't-ask-you-to-use-the-'ascii'-codec!-)">

&lt;p&gt;When dealing with Unicode in Python, it doesn&#39;t take long to get the dreaded &lt;code&gt;&#39;ascii&#39; codec can&#39;t decode byte 0xc3 in position 2: ordinal not in range(128)&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;You never see it coming.  It doesn&#39;t make any sense. You didn&#39;t even ask for &lt;code&gt;ascii&lt;/code&gt;!&lt;/p&gt;
&lt;p&gt;So what&#39;s the deal?&lt;/p&gt;
&lt;p&gt;I&#39;m glad you asked.  I will demonstrate:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; s = file(&#34;data&#34;).read()
&amp;gt;&amp;gt;&amp;gt; s
&#39;SGVsbG8sIHdvcmxkIQ==\n&#39;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If you guessed that &lt;code&gt;s&lt;/code&gt; is a hunk of base64 encoded data, you&#39;d be right! Give yourself a gold star.&lt;/p&gt;
&lt;p&gt;Now, if we want to do anything useful with this data, it needs to be &lt;b&gt;decoded&lt;/b&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; s.decode(&#39;base64&#39;)
&#39;Hello, world!&#39;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We have just taken an &lt;b&gt;encoded&lt;/b&gt; hunk of data and &lt;b&gt;decoded&lt;/b&gt; it to get a useful hunk of data.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; s.decode(&#39;base64&#39;).replace(&#39;world&#39;, &#39;Marguerite&#39;)
&#39;Hello, Marguerite!&#39;
&amp;gt;&amp;gt;&amp;gt; _.encode(&#39;base64&#39;)
&#39;SGVsbG8sIE1hcmd1ZXJpdGUh\n&#39;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now we can take that useful hunk of data (the English in 7-bit ASCII), do something useful with it (in this case, replace &#39;world&#39; with &#39;Marguerite&#39;), and finally &lt;b&gt;encode&lt;/b&gt; the data.&lt;/p&gt;
&lt;p&gt;So how does all this relate back to Unicode and ascii error messages?&lt;/p&gt;
&lt;p&gt;I have used base64 encoded data here, but the same concept applies when dealing with &lt;b&gt;Unicode&lt;/b&gt; data:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Hunk of opaque data comes in (but we know that it contains some sort of Unicode text)&lt;/li&gt;
&lt;li&gt;Hunk of opaque data is &lt;b&gt;decoded&lt;/b&gt;, creating a &lt;code&gt;unicode&lt;/code&gt; object&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;unicode&lt;/code&gt; object is used for something useful&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;unicode&lt;/code&gt; object is &lt;b&gt;encoded&lt;/b&gt; and saved (to disk, to a database, or sent to a browser)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;small&gt;(of course, in the Real World, you&#39;ve got to figure out which encoding was used on the data (UTF-8, Latin1, etc)... But that&#39;s a topic for another post.)&lt;/small&gt;&lt;/p&gt;
&lt;p&gt;Ok, back to the &lt;code&gt;&#39;ascii&#39; codec can&#39;t decode byte 0xc3 in position 2: ordinal not in range(128)&lt;/code&gt; error.  It should be fairly clear that this error is coming up because Python is trying to decode a bunch of bytes as 7-bit ASCII, but some of them are out of that range (eg, they have a value over 127).&lt;/p&gt;
&lt;p&gt;I know what you&#39;re saying, &#34;but I never asked Python to decode anything!  I&#39;m just trying to turn it into unicode!&#34;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; unicode(&#34;Ol\xc3\xa1, mundo!&#34;)
Traceback (most recent call last):
  File &#34;&amp;lt;stdin&amp;gt;&#34;, line 1, in ?
UnicodeDecodeError: &#39;ascii&#39; codec can&#39;t decode byte 0xc3 in position 2: ordinal not in range(128)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Two questions arise here:  First, &#34;Where is the &#39;ascii&#39; coming from?&#34;  Second, &#34;How do I make it work?&#34;&lt;/p&gt;
&lt;p&gt;To answer the first question, it&#39;s important to think about what&#39;s happening when the call to &lt;code&gt;unicode(...)&lt;/code&gt; is made.  The &lt;code&gt;unicode&lt;/code&gt; function accepts an encoded string, &lt;b&gt;decodes&lt;/b&gt; it, and creates a &lt;code&gt;unicode&lt;/code&gt; object.  In this case, though, we haven&#39;t given the function any indication of which decoder it should use, so it falls back to the computer&#39;s default encoding: &lt;code&gt;ascii&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;So how can you make it work?  Tell &lt;code&gt;unicode&lt;/code&gt; which encoding to use:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; unicode(&#34;Ol\xc3\xa1, mundo!&#34;, &#39;utf8&#39;)
u&#39;Ol\xe1, mundo!&#39;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;small&gt;(now, as I mentioned before, figuring out which encoding to use is another huge problem... But I&#39;ll leave that for another day)&lt;/small&gt;&lt;/p&gt;
&lt;p&gt;Another problem I run into quite often is this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; &#34;Ol\xc3\xa1, mundo!&#34;.encode(&#39;utf8&#39;)
Traceback (most recent call last):
  File &#34;&amp;lt;stdin&amp;gt;&#34;, line 1, in ?
UnicodeDecodeError: &#39;ascii&#39; codec can&#39;t decode byte 0xc3 in position 2: ordinal not in range(128)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And, by now, the cause of this should be painfully obvious: I&#39;ve given Python an &lt;i&gt;encoded&lt;/i&gt; string, so I should be &lt;i&gt;decoding&lt;/i&gt; it, not encoding it again.&lt;/p&gt;
&lt;p&gt;But why the confusing error message?  Well, I&#39;m not entirely sure, but my guess is that the UTF-8 encoder expects a &lt;code&gt;unicode&lt;/code&gt; object, so it tries to convert the input (in this case, &#34;Ol\xc3...&#34;) to Unicode before encoding it.&lt;/p&gt;
&lt;h3&gt;Is there any end to this insanity?!&lt;/h3&gt;
&lt;p&gt;Yes!  Python 3000 will have two distinct classes: one for strings, one for hunks of data.  Whenever data is read, it will come in as a &#34;hunk of data&#34;.  It will have to be explicitly decoded to a string before it can be used as such.  Hopefully that will make life a little bit less painful.&lt;/p&gt;
&lt;h3&gt;See also:&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;The obligatory link to &lt;a href=&#34;http://www.joelonsoftware.com/articles/Unicode.html&#34;&gt;Joel Spolsky&#39;s The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;http://www.python.org/dev/peps/pep-3112/&#34;&gt;PEP 3112 - Byte literals in Python 3000&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content>
  </entry>
</feed>

