Ranting and Roaring

2004/06/09

Encoding and Decoding Python Strings

This a repost of this and this from my day blog. I’m just setting up the stage here for a longer post about internationalized characters in Python.

Here’s how the encode and decode functions work against (octet/normal) strings and Unicode strings in Python:

  • string.encode(x) converts an octet string, assumed to be in 7-bit ASCII, to another octet string, that can be interpreted as codeset ‘x’. This is the weirdo operation: it doesn’t really do much except raise exceptions if it has high-bit data.
  • string.decode(x) converts an octet string, assumed to be codeset ‘x’, to a unicode string.
  • unicode.encode(x) converts a unicode string (which is the universal solvent) into an octet string, that can be interpreted as codeset ‘x’.
  • unicode.decode(x) doesn’t exist

So: encode(x) always produces octets that can can be interpreted as codeset ‘x’; decode(x) reverses the process and converts octets (that are assumed to be in codeset ‘x’) into a unicode string.

In reality, this is what you’ll be doing in your Python programs:

  • When you have octet strings, use “decode” to get the Unicode string.
  • When you have a unicode string, use “encode” to get the octet string.

Powered by WordPress