Unicode in Python is not a beginner level topic. To get a better understanding, proper introduction is required. For the sake of this post, we are only going to provide few code snippets to demonstrate the conversion of unicode to string and visa versa. For detailed information about Unicode in Python, it is recommended to check the following article. Let us get started...
Encoding vs decoding
In text processing, encoding is converting a sequence of characters into a special format for efficient processing, storage and transmission. Decoding is the conversion of encoded data back to the original form. There are various encoding schemes that one can use. Unicode is the king of all as it is the standard encoding used on the web, operating systems, browsers and a lot more.
Unicode defines a unique integer number (called code point) for every character. A code point consists from one or more code units (unit = 1 byte in UTF-8 or 2 bytes in UTF-16). Unicode has its own encoding schemes. The most popular scheme is UTF-8 which is what we are using in the examples.
Python 2 vs Python 3
There is a big difference between Python 2 and 3 when dealing with Unicode so we should be very careful about the terminology we use. When we say a string, that is also Unicode if we are using Python 3. In Python 2 a string (called str) is a dumb stream of bytes that can be in any encoding unless we explicitly mark it as Unicode. Going back to the title of this post as this is a frequently searched term. Does it make sense to say unicode to string? well, in Python 2.x, the encoding process converts a unicode string (ex. u"hello") to str type (str means bytes) but in Python 3.x there is no unicode data type, instead there is an str type which is unicode by default. So, in Python 3.x there is no unicode to string conversion, however there is unicode (str data type) to bytes which is the encoding process. I highly recommend that you check the detailed article about Unicode mentioned earlier (you can also find it in the references as well).
Encoding and decoding in Python
Encoding in Python is converting a code point to a byte stream. We do that using the encode() function. Decoding on the other hand is the opposite process which is converting the byte stream back to code points. We do that using the decode() function. Let us now cheat and copy paste from the article mentioned earlier…
# Define a Unicode string
x = u'A\u03BC\u0394'
# Encode to byte stream
y = x.encode('utf-8')
# Decode byte stream back to Unicode
z = y.decode('utf-8')
If you run the code snippet above, you should get something like...
Note that the length of the original Unicode string (x) is (3) because we have (3) characters in the string however, the length of the output byte stream variable (y) is (5) because we have (5) bytes in the byte stream. It is very important to keep in mind what data type we are dealing with otherwise it is easy to lose track. If we run the same code in Python 3.x, we get something like…
Is it different between Python 2.x and 3.x? YES it is, instead of type Unicode we have str because strings in Python 3.x are Unicode by default. What else? The encoding output (byte stream) is printed in raw format to the screen which makes more sense than that in Python 2.x. If we count the number of characters that start with \x in the above output, they are (4) and we have the one byte letter (A) that is why the length of the encoded output is (5) and not (3).
- Unicode in Python can be confusing because it is handled differently in Python 2.x vs 3.x
- Python 2.x defines immutable strings of type str for bytes data
- Python 3.x strings (type str) are Unicode by default
- To encode a Unicode string to a byte stream use the built in function encode(). The built in function decode() does the opposite operation
That is all for today. For questions and feedback, please use the comments section below.