<?xml version="1.0" encoding="utf-8" ?>

<rss version="2.0" 
   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
   xmlns:admin="http://webns.net/mvcb/"
   xmlns:dc="http://purl.org/dc/elements/1.1/"
   xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
   xmlns:wfw="http://wellformedweb.org/CommentAPI/"
   xmlns:content="http://purl.org/rss/1.0/modules/content/"
   >
<channel>
    <title>Code Kills - Unicode</title>
    <link>http://blog.codekills.net/</link>
    <description></description>
    <dc:language>en</dc:language>
    <admin:errorReportsTo rdf:resource="mailto:david@wolever.net" />
    <generator>Serendipity 1.1.3 - http://www.s9y.org/</generator>
    <pubDate>Sun, 05 Apr 2009 03:03:49 GMT</pubDate>

    <image>
        <url>http://blog.codekills.net/templates/default/img/s9y_banner_small.png</url>
        <title>RSS: Code Kills - Unicode - </title>
        <link>http://blog.codekills.net/</link>
        <width>100</width>
        <height>21</height>
    </image>

<item>
    <title>Testing for Unicode Safety</title>
    <link>http://blog.codekills.net/archives/46-Testing-for-Unicode-Safety.html</link>
            <category>Unicode</category>
    
    <comments>http://blog.codekills.net/archives/46-Testing-for-Unicode-Safety.html#comments</comments>
    <wfw:comment>http://blog.codekills.net/wfwcomment.php?cid=46</wfw:comment>

    <slash:comments>2</slash:comments>
    <wfw:commentRss>http://blog.codekills.net/rss.php?version=2.0&amp;type=comments&amp;cid=46</wfw:commentRss>
    

    <author>david@wolever.net (David Wolever)</author>
    <content:encoded>
    &lt;p&gt;After &lt;a href=&quot;http://blog.codekills.net/archives/45-str...-yer-probably-doin-it-wrong..html&quot;&gt;yesterdays post&lt;/a&gt;, &lt;a href=&quot;http://blog.third-bit.com/&quot;&gt;Greg&lt;/a&gt; suggested I write another on how to test for Unicode safety... And unfortunately I&#039;ve got some bad news: it&#039;s hard.&lt;/p&gt;

&lt;p&gt;You never know when some developer, somewhere, will unintentionally encode or decode something the wrong way (&lt;a href=&quot;https://www.drproject.org/DrProject/ticket/1627&quot;&gt;for example&lt;/a&gt;, &lt;code&gt;log(&quot;request for %s&quot;, unicode(url))&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;But there is hope!&lt;/p&gt;

&lt;p&gt;In my experience, almost all Unicode-related issues follow the same pattern: someone using &lt;code&gt;str&lt;/code&gt; or &lt;code&gt;unicode&lt;/code&gt; incorrectly and code which unexpectedly encodes/decodes a string.&lt;/p&gt;

&lt;p&gt;The first is easy to check for: grep through the code for &lt;code&gt;str(&lt;/code&gt; and &lt;code&gt;unicode(&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The second is harder to check for, and requires an understanding of both the code base: all of the points where the code interacts with other parts of the system (filesystem, database, network) must be found and checked.&lt;/p&gt;

&lt;p&gt;Finally, it isn&#039;t a bad idea to throw some Unicode into the test suite.  Instead of calling mock users &#039;user0&#039;, &#039;user1&#039;, Call them u&#039;\u03bcs\xeb\u044f&#039; (u&quot;μsëя&quot;)*.  Keep a central &quot;database&quot; of these sorts of strings, so it&#039;s easy for developers who don&#039;t normally write in Cyrillic to use Cyrillic characters in their code (I keep my own personal list at &lt;a href=&quot;http://wolever.net/~wolever/wiki/unicode_audit&quot;&gt;http://wolever.net/~wolever/wiki/unicode_audit&lt;/a&gt; -- a url I can now type from memory).&lt;/p&gt;

&lt;p&gt;One word of caution, though: you&#039;re asking for world of pain if you actually think you can  &lt;em&gt;commit&lt;/em&gt; UTF-8 encoded text -- any number of things will break (subversion may helpfully fail, your editor may helpfully re-encode the file, your unenlightened developers will complain about funny question marks in their code, etc...).  Instead, have a central file which defines these &quot;canned test strings&quot; using escaped Python strings (ie, u&#039;\u03bc...&#039;) then import that into your test suite:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;from app.tests import i18n
...
def test_user():
    u = new User(name=i18n.user)
    ...
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Or tar up all the offensive files, then write a script to un-tar them when they are needed**.&lt;/p&gt;

&lt;p&gt;So, to sum it up:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Make sure your developers grok (or, at least, understand) Unicode and encodings&lt;/li&gt;
&lt;li&gt;Make sure your code uses &lt;code&gt;str&lt;/code&gt; and &lt;code&gt;unicode&lt;/code&gt; safely&lt;/li&gt;
&lt;li&gt;Make sure your exit points are covered&lt;/li&gt;
&lt;li&gt;Make it really easy to include Unicode in tests&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then maybe, if you&#039;re lucky, all those inconsiderate people who have the audacity to ask for more than 127 characters will be able to use your application &lt;img src=&quot;http://blog.codekills.net/templates/default/img/emoticons/smile.png&quot; alt=&quot;:-)&quot; style=&quot;display: inline; vertical-align: bottom;&quot; class=&quot;emoticon&quot; /&gt;&lt;/p&gt;

&lt;p&gt;*: A good choice both because it&#039;s easy for ignorant North Americans like myself to see that&#039;s it&#039;s correct.&lt;/p&gt;

&lt;p&gt;**: This is how I tested DrProject&#039;s handling of Unicode filenames which are checked into the Subversion repository.&lt;/p&gt;
 
    </content:encoded>

    <pubDate>Wed, 11 Feb 2009 10:30:47 -0500</pubDate>
    <guid isPermaLink="false">http://blog.codekills.net/archives/46-guid.html</guid>
    
</item>
<item>
    <title>str(...): 'yer probably doin' it wrong.</title>
    <link>http://blog.codekills.net/archives/45-str...-yer-probably-doin-it-wrong..html</link>
            <category>Python</category>
            <category>Unicode</category>
    
    <comments>http://blog.codekills.net/archives/45-str...-yer-probably-doin-it-wrong..html#comments</comments>
    <wfw:comment>http://blog.codekills.net/wfwcomment.php?cid=45</wfw:comment>

    <slash:comments>1</slash:comments>
    <wfw:commentRss>http://blog.codekills.net/rss.php?version=2.0&amp;type=comments&amp;cid=45</wfw:commentRss>
    

    <author>david@wolever.net (David Wolever)</author>
    <content:encoded>
    &lt;p&gt;Unicode is an ugly beast... And until people start standardizing on Python 3k*,
we&#039;re going to have to live with the eccentricities of Python 2&#039;s strings.&lt;/p&gt;

&lt;p&gt;But, fear not! There is (at least some) hope. By changing a few patterns in the
way you code, you can alleviate the bulk of Unicode-related problems.&lt;/p&gt;

&lt;p&gt;First, using the &lt;tt&gt;str&lt;/tt&gt; function.  In just about every case, if you&#039;re
using the &lt;tt&gt;str&lt;/tt&gt; function, you&#039;re probably doing it wrong.&lt;/p&gt;

&lt;p&gt;Let me demonstrate:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;from hashlib import sha256
def hash(to_hash):
  hash = sha256(to_hash).hexdigest()
  print to_hash, &quot;:&quot;, hash
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Cool, we can hash things then print out the hash:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; hash(&#039;ohai&#039;)
ohai : e84712238709398f6d349dc2250b0efca4b72d8c2bfb7b74339d30ba94056b14
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;But, wait... What happens if the thing we&#039;re hashing isn&#039;t a string (even though
it can be represented as a string):&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; class Person:
...   name = &#039;David&#039;
...   def __str__(self):
...     return self.name
...
&amp;gt;&amp;gt;&amp;gt; hash(Person())
...
TypeError: new() argument 1 must be string or read-only buffer, not instance
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Oh no!  Ok, let&#039;s fix the code:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;from hashlib import sha256
def hash(to_hash):
  to_hash = str(to_hash) # Convert the object to a string before we hash it
  hash = sha256(to_hash).hexdigest()
  print to_hash, &quot;:&quot;, hash
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Great -- we can hash numbers now:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; hash(Person())
David : a6b54c20a7b96eeac1a911e6da3124a560fe6dc042ebf270e3676e7095b95652
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;And, for most people who only speak English, this is a perfect place to stop.
After all, everything is a &lt;code&gt;str&lt;/code&gt;, right?&lt;/p&gt;

&lt;p&gt;Well... No.  What happens if the input is a &lt;code&gt;unicode&lt;/code&gt; object?&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; p = Person()
&amp;gt;&amp;gt;&amp;gt; # This person had the audacity to give themselves a name containing
&amp;gt;&amp;gt;&amp;gt; # non-ascii symbols, so we represent it with a unicode object
&amp;gt;&amp;gt;&amp;gt; p.name = u&#039;I\xf1t\xebrn\xe2ti\xf4n\xe0liz\xe6ti\xf8n&#039;
&amp;gt;&amp;gt;&amp;gt; hash(p)
...
UnicodeEncodeError: &#039;ascii&#039; codec can&#039;t encode character u&#039;\xf1&#039; in position
1: ordinal not in range(128)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Crap.  Where the heck is &#039;ascii&#039; coming from?&lt;/p&gt;

&lt;p&gt;Well, it&#039;s a long story (which I&#039;ve covered over at &lt;a href=&quot;http://blog.codekills.net/archives/38-Encoding-and-Decoding-Text-in-Python-or-I-didnt-ask-you-to-use-the-ascii-codec!.html&quot;&gt;Encoding and Decoding Text
in
Python&lt;/a&gt;),
but basically the &lt;code&gt;__str__&lt;/code&gt; method of the unicode object (u&#039;I\xf1...&#039;) is trying to
encode the unicode object using the system&#039;s default encoding... Which, in this
case, is ascii.&lt;/p&gt;

&lt;p&gt;&quot;Alright...&quot;, you&#039;re probably thinking, &quot;If the problem is with &lt;code&gt;unicode&lt;/code&gt;, maybe
I could just replace that call to &lt;code&gt;str&lt;/code&gt; with a call to &lt;code&gt;unicode&lt;/code&gt;&quot;&lt;/p&gt;

&lt;p&gt;Ok, let&#039;s see what happens.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;from hashlib import sha256
def hash(to_hash):
  to_hash = unicode(to_hash) # Convert the object to a unicode before we hash it
  hash = sha256(to_hash).hexdigest()
  print to_hash, &quot;:&quot;, hash

&amp;gt;&amp;gt;&amp;gt; # this time, though, the person&#039;s name has come from The Internet, so it
&amp;gt;&amp;gt;&amp;gt; # is not yet a unicode object
&amp;gt;&amp;gt;&amp;gt; p.name = &#039;I\xc3\xb1t\xc3\xabrn\xc3\xa2ti\xc3\xb4n\xc3\xa0liz\xc3\xa6ti\xc3\xb8n&#039;
&amp;gt;&amp;gt;&amp;gt; hash(p)
...
UnicodeDecodeError: &#039;ascii&#039; codec can&#039;t decode byte 0xc3 in position 1:
ordinal not in range(128)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Yup, that&#039;s right -- you just can&#039;t win &lt;img src=&quot;http://blog.codekills.net/templates/default/img/emoticons/sad.png&quot; alt=&quot;:-(&quot; style=&quot;display: inline; vertical-align: bottom;&quot; class=&quot;emoticon&quot; /&gt;&lt;/p&gt;

&lt;p&gt;What&#039;s happening here?  Well, this time, the &lt;code&gt;unicode&lt;/code&gt; function is trying to
decode the input (&#039;I\xc3...&#039;) into a unicode object... But, because the input
isn&#039;t valid 7-bit ascii (again, the system&#039;s default), it explodes. Crap.&lt;/p&gt;

&lt;p&gt;Confused yet?&lt;/p&gt;

&lt;p&gt;So how can we save ourselves from all this insanity?&lt;/p&gt;

&lt;p&gt;Actually, it&#039;s not too difficult:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Convert every single string, with out exception, to unicode as soon as they
enter the system.  For example, if you are writing a web application,
GET and POST variables should be converted to &lt;code&gt;unicode&lt;/code&gt; as soon as they are
read from the environment:&lt;/p&gt;

&lt;p&gt;for (key, value) in environment.get_vars:
    request.GET[to_unicode(key)] = to_unicode(value)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Only convert the &lt;code&gt;unicode&lt;/code&gt; objects back to &lt;code&gt;str&lt;/code&gt; strings when you absolutly
must.  For example, when they are written to a file:&lt;/p&gt;

&lt;p&gt;log_file.write(&quot;New user &#039;%s&#039; created&quot; %(to_str(p.name)))&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Think hard before you call &lt;code&gt;str&lt;/code&gt; or &lt;code&gt;unicode&lt;/code&gt;. Each time your fingers type
&quot;s&quot;, &quot;t&quot;, flashing lights and sirens should go off in your head, reminding
you to make sure that the object you are &lt;code&gt;str&lt;/code&gt;ing could never, ever, ever
possibly contain unicode.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;And what about those &lt;code&gt;to_unicode&lt;/code&gt; and &lt;code&gt;to_str&lt;/code&gt; functions?  What should they look
like?  Well, probably something like this:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;import locale
def to_unicode(text):
    &quot;&quot;&quot; Convert, at all consts, &#039;text&#039; to a `unicode` object.

        Note: as a last-ditch effort, this function tries to decode the text
              as latin1... Which will always succeed.  If you expect to get
              text encoded with latin[2-9] or some other character set, this
              may not be desierable.

        &amp;gt;&amp;gt;&amp;gt; to_unicode(u&#039;I\xf1t\xebrn\xe2ti&#039;)
        u&#039;I\xf1t\xebrn\xe2ti&#039;
        &amp;gt;&amp;gt;&amp;gt; to_unicode(&#039;I\xc3\xb1t\xc3\xabrn\xc3\xa2ti&#039;)
        u&#039;I\xf1t\xebrn\xe2ti&#039;
        &amp;gt;&amp;gt;&amp;gt; class Foo:
        ...   def __str__(self):
        ...       return &#039;foo&#039;
        ...
        &amp;gt;&amp;gt;&amp;gt; f = Foo()
        &amp;gt;&amp;gt;&amp;gt; to_unicode(f)
        u&#039;foo&#039;
        &amp;gt;&amp;gt;&amp;gt; f.__unicode__ = u&#039;bar&#039;
        &amp;gt;&amp;gt;&amp;gt; to_unicode(f)
        u&#039;bar&#039;
        &amp;gt;&amp;gt;&amp;gt; &quot;&quot;&quot;

    if isinstance(text, unicode):
        return text

    if hasattr(text, &#039;__unicode__&#039;):
        return text.__unicode__()

    text = str(text)

    try:
        return unicode(text, &#039;utf-8&#039;)
    except UnicodeError:
        pass

    try:
        return unicode(text, locale.getpreferredencoding())
    except UnicodeError:
        pass

    return unicode(text, &#039;latin1&#039;)


def to_str(text):
    &quot;&quot;&quot; Convert &#039;text&#039; to a `str` object.

        &amp;gt;&amp;gt;&amp;gt; to_str(u&#039;I\xf1t\xebrn\xe2ti&#039;)
        &#039;I\xc3\xb1t\xc3\xabrn\xc3\xa2ti&#039;
        &amp;gt;&amp;gt;&amp;gt; to_str(42)
        &#039;42&#039;
        &amp;gt;&amp;gt;&amp;gt; to_str(&#039;ohai&#039;)
        &#039;ohai&#039;
        &amp;gt;&amp;gt;&amp;gt; class Foo:
        ...     def __str__(self):
        ...         return &#039;foo&#039;
        ...
        &amp;gt;&amp;gt;&amp;gt; f = Foo()
        &amp;gt;&amp;gt;&amp;gt; to_str()
        &#039;foo&#039;
        &amp;gt;&amp;gt;&amp;gt; f.__unicode__ = lambda: u&#039;I\xf1t\xebrn\xe2ti&#039;
        &amp;gt;&amp;gt;&amp;gt; to_str(f)
        &#039;I\xc3\xb1t\xc3\xabrn\xc3\xa2ti&#039;
        &amp;gt;&amp;gt;&amp;gt; &quot;&quot;&quot;
    if isinstance(text, str):
        return text

    if hasattr(text, &#039;__unicode__&#039;):
        text = text.__unicode__()

    if hasattr(text, &#039;__str__&#039;):
        return text.__str__()

    return text.encode(&#039;utf-8&#039;)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;So there you have it.
A quick and fairly easy way to avoid many of your encoding-related options &lt;img src=&quot;http://blog.codekills.net/templates/default/img/emoticons/smile.png&quot; alt=&quot;:-)&quot; style=&quot;display: inline; vertical-align: bottom;&quot; class=&quot;emoticon&quot; /&gt;&lt;/p&gt;

&lt;p&gt;If you&#039;re still not quite feeling comfortable with all of this, though, take a
read over &lt;a href=&quot;http://www.joelonsoftware.com/articles/Unicode.html&quot;&gt;Joel Spolsky&#039;s The Absolute Minimum Every Software Developer
Absolutely, Positively Must Know About Unicode and Character Sets (No
Excuses!)&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;*: My newly installed Debian 4 machine is still running Python 2.4 (released
December 2004)... So that wait might be a while.&lt;/p&gt;
 
    </content:encoded>

    <pubDate>Tue, 10 Feb 2009 11:13:19 -0500</pubDate>
    <guid isPermaLink="false">http://blog.codekills.net/archives/45-guid.html</guid>
    
</item>
<item>
    <title>Encoding and Decoding Text in Python (or: &quot;I didn't ask you to use the 'ascii' codec!&quot;)</title>
    <link>http://blog.codekills.net/archives/38-Encoding-and-Decoding-Text-in-Python-or-I-didnt-ask-you-to-use-the-ascii-codec!.html</link>
            <category>Unicode</category>
    
    <comments>http://blog.codekills.net/archives/38-Encoding-and-Decoding-Text-in-Python-or-I-didnt-ask-you-to-use-the-ascii-codec!.html#comments</comments>
    <wfw:comment>http://blog.codekills.net/wfwcomment.php?cid=38</wfw:comment>

    <slash:comments>0</slash:comments>
    <wfw:commentRss>http://blog.codekills.net/rss.php?version=2.0&amp;type=comments&amp;cid=38</wfw:commentRss>
    

    <author>david@wolever.net (David Wolever)</author>
    <content:encoded>
    &lt;p&gt;When dealing with Unicode in Python, it doesn&#039;t take long to get the dreaded &lt;code&gt;&#039;ascii&#039; codec can&#039;t decode byte 0xc3 in position 2: ordinal not in range(128)&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;You never see it coming.  It doesn&#039;t make any sense. You didn&#039;t even ask for &lt;code&gt;ascii&lt;/code&gt;!&lt;/p&gt;

&lt;p&gt;So what&#039;s the deal?&lt;/p&gt;

&lt;p&gt;I&#039;m glad you asked.  I will demonstrate:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; s = file(&quot;data&quot;).read()
&amp;gt;&amp;gt;&amp;gt; s
&#039;SGVsbG8sIHdvcmxkIQ==\n&#039;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;If you guessed that &lt;code&gt;s&lt;/code&gt; is a hunk of base64 encoded data, you&#039;d be right! Give yourself a gold star.&lt;/p&gt;

&lt;p&gt;Now, if we want to do anything useful with this data, it needs to be &lt;b&gt;decoded&lt;/b&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; s.decode(&#039;base64&#039;)
&#039;Hello, world!&#039;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;We have just taken an &lt;b&gt;encoded&lt;/b&gt; hunk of data and &lt;b&gt;decoded&lt;/b&gt; it to get a useful hunk of data.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; s.decode(&#039;base64&#039;).replace(&#039;world&#039;, &#039;Marguerite&#039;)
&#039;Hello, Marguerite!&#039;
&amp;gt;&amp;gt;&amp;gt; _.encode(&#039;base64&#039;)
&#039;SGVsbG8sIE1hcmd1ZXJpdGUh\n&#039;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Now we can take that useful hunk of data (the English in 7-bit ASCII), do something useful with it (in this case, replace &#039;world&#039; with &#039;Marguerite&#039;), and finally &lt;b&gt;encode&lt;/b&gt; the data.&lt;/p&gt;

&lt;p&gt;So how does all this relate back to Unicode and ascii error messages?&lt;/p&gt;

&lt;p&gt;I have used base64 encoded data here, but the same concept applies when dealing with &lt;b&gt;Unicode&lt;/b&gt; data:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Hunk of opaque data comes in (but we know that it contains some sort of Unicode text)&lt;/li&gt;
&lt;li&gt;Hunk of opaque data is &lt;b&gt;decoded&lt;/b&gt;, creating a &lt;code&gt;unicode&lt;/code&gt; object&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;unicode&lt;/code&gt; object is used for something useful&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;unicode&lt;/code&gt; object is &lt;b&gt;encoded&lt;/b&gt; and saved (to disk, to a database, or sent to a browser)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;small&gt;(of course, in the Real World, you&#039;ve got to figure out which encoding was used on the data (UTF-8, Latin1, etc)... But that&#039;s a topic for another post.)&lt;/small&gt;&lt;/p&gt;

&lt;p&gt;Ok, back to the &lt;code&gt;&#039;ascii&#039; codec can&#039;t decode byte 0xc3 in position 2: ordinal not in range(128)&lt;/code&gt; error.  It should be fairly clear that this error is coming up because Python is trying to decode a bunch of bytes as 7-bit ASCII, but some of them are out of that range (eg, they have a value over 127).&lt;/p&gt;

&lt;p&gt;I know what you&#039;re saying, &quot;but I never asked Python to decode anything!  I&#039;m just trying to turn it into unicode!&quot;&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; unicode(&quot;Ol\xc3\xa1, mundo!&quot;)
Traceback (most recent call last):
  File &quot;&amp;lt;stdin&amp;gt;&quot;, line 1, in ?
UnicodeDecodeError: &#039;ascii&#039; codec can&#039;t decode byte 0xc3 in position 2: ordinal not in range(128)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Two questions arise here:  First, &quot;Where is the &#039;ascii&#039; coming from?&quot;  Second, &quot;How do I make it work?&quot;&lt;/p&gt;

&lt;p&gt;To answer the first question, it&#039;s important to think about what&#039;s happening when the call to &lt;code&gt;unicode(...)&lt;/code&gt; is made.  The &lt;code&gt;unicode&lt;/code&gt; function accepts an encoded string, &lt;b&gt;decodes&lt;/b&gt; it, and creates a &lt;code&gt;unicode&lt;/code&gt; object.  In this case, though, we haven&#039;t given the function any indication of which decoder it should use, so it falls back to the computer&#039;s default encoding: &lt;code&gt;ascii&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;So how can you make it work?  Tell &lt;code&gt;unicode&lt;/code&gt; which encoding to use:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; unicode(&quot;Ol\xc3\xa1, mundo!&quot;, &#039;utf8&#039;)
u&#039;Ol\xe1, mundo!&#039;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;small&gt;(now, as I mentioned before, figuring out which encoding to use is another huge problem... But I&#039;ll leave that for another day)&lt;/small&gt;&lt;/p&gt;

&lt;p&gt;Another problem I run into quite often is this:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; &quot;Ol\xc3\xa1, mundo!&quot;.encode(&#039;utf8&#039;)
Traceback (most recent call last):
  File &quot;&amp;lt;stdin&amp;gt;&quot;, line 1, in ?
UnicodeDecodeError: &#039;ascii&#039; codec can&#039;t decode byte 0xc3 in position 2: ordinal not in range(128)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;And, by now, the cause of this should be painfully obvious: I&#039;ve given Python an &lt;i&gt;encoded&lt;/i&gt; string, so I should be &lt;i&gt;decoding&lt;/i&gt; it, not encoding it again.&lt;/p&gt;

&lt;p&gt;But why the confusing error message?  Well, I&#039;m not entirely sure, but my guess is that the UTF-8 encoder expects a &lt;code&gt;unicode&lt;/code&gt; object, so it tries to convert the input (in this case, &quot;Ol\xc3...&quot;) to Unicode before encoding it.&lt;/p&gt;

&lt;h3&gt;Is there any end to this insanity?!&lt;/h3&gt;

&lt;p&gt;Yes!  Python 3000 will have two distinct classes: one for strings, one for hunks of data.  Whenever data is read, it will come in as a &quot;hunk of data&quot;.  It will have to be explicitly decoded to a string before it can be used as such.  Hopefully that will make life a little bit less painful.&lt;/p&gt;

&lt;h3&gt;See also:&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;The obligatory link to &lt;a href=&quot;http://www.joelonsoftware.com/articles/Unicode.html&quot;&gt;Joel Spolsky&#039;s The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://www.python.org/dev/peps/pep-3112/&quot;&gt;PEP 3112 - Byte literals in Python 3000&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
 
    </content:encoded>

    <pubDate>Thu, 01 May 2008 13:27:00 -0400</pubDate>
    <guid isPermaLink="false">http://blog.codekills.net/archives/38-guid.html</guid>
    
</item>

</channel>
</rss>