This a repost of this and this from my day blog. I’m just setting up the stage here for a longer post about internationalized characters in Python.
Here’s how the encode and decode functions work against (octet/normal) strings and Unicode strings in Python:
- string.encode(x) converts an octet string, assumed to be in 7-bit ASCII, to another octet string, that can be interpreted as codeset ‘x’. This is the weirdo operation: it doesn’t really do much except raise exceptions if it has high-bit data.
- string.decode(x) converts an octet string, assumed to be codeset ‘x’, to a unicode string.
- unicode.encode(x) converts a unicode string (which is the universal solvent) into an octet string, that can be interpreted as codeset ‘x’.
- unicode.decode(x) doesn’t exist
So: encode(x) always produces octets that can can be interpreted as codeset ‘x’; decode(x) reverses the process and converts octets (that are assumed to be in codeset ‘x’) into a unicode string.
In reality, this is what you’ll be doing in your Python programs:
- When you have octet strings, use “decode” to get the Unicode string.
- When you have a unicode string, use “encode” to get the octet string.