dsaiztc
11/23/2015 - 9:57 AM

unicode

unicode

Unicode in Python

Original source: http://farmdev.com/talks/unicode/

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 10: ordinal not in range(128)

Important methods

s.decode(encoding)
<type 'str'> to <type 'unicode'>
u.encode(encoding)
<type 'unicode'> to <type 'str'>

unicode vs decode: the unicode constructor can take other types apart from strings. For the bytestring case, however, the two forms are mostly equivalent. Stackoverflow.


.encode([encoding], [errors='strict'])

The errors parameter is the same as the parameter of the unicode() constructor, with one additional possibility; as well as 'strict', 'ignore', and 'replace', you can also pass 'xmlcharrefreplace' which uses XML’s character references. Python Docs. Example: u.encode('ascii', 'replace')

Solution

  1. Decode early. Decode to <type 'unicode'> ASAP.
  2. Unicode everywhere
  3. Encode late. Encode to <type 'str'> when you write to disk or print.
# Converts to unicode object if it's string
def to_unicode_or_bust(obj, encoding='utf-8'):
    if isinstance(obj, basestring):
        if not isinstance(obj, unicode):
            obj = unicode(obj, encoding)
    return obj

Shortcuts

import codecs
f = codecs.open('/tmp/ivan_utf8.txt', 'w', encoding='utf-8')
f = codecs.open('/tmp/ivan_utf8.txt', 'r', encoding='utf-8')