Python unicode strings tutorial

Introduction

Unicode is an important topic in computing but it is a little bit confusing. In particular, it can be more confusing in the context of Python programming language. For that reason, the aim of our article today is to clarify the ambiguity and hopefully put things into perspective. Beginners may get confused because they approach Unicode without proper understanding of the underlying concepts. In my opinion, understanding Unicode is not hard, however one has to be familiar with some basic concepts. For example:

  • Be comfortable with number representations (e.x binary, hexadecimal)
  • Be familiar with raw storage units (ex. bits, bytes, integers)
  • Understand the abstract meaning of encoding and decoding (ex. byte streams, code points)
  • Realize that there are multiple encoding schemes out there (ex. ASCII, UTF-8)
  • Be aware of language specifics (ex. Python 2.x, Python 3.x)
  • Pay attention to the terms used (ex. character, code point, glyph)

As you can see, this is just a short list from the top of my head and there could be more. Unless you are crystal clear in regard of what all that means, it can be confusing and frustrating as well. With that said, we are going to start with the basics and build on top of that to achieve the desired understanding.

Let us get started...

Globalization (G11n) vs Internationalization (i18n) vs Localization (l10n)

Globalization in software development is a complex topic on its own. This is not the right venue to talk about international ready applications, however we are only going to provide a macroscopic view so that we can see how Unicode fits into the big picture.

Software globalization is the process of adapting a product to regional and language differences. It is the combination of internationalization and localization. Internationalization refers to software design that enables a product to be world ready without major engineering changes primarily through maintaining one source code base for all supported locales or markets. On the other hand, Localization is the process of adapting a software product to a particular region or language. It involves translating the application GUI and taking cultural differences into account.

So, how is g11n related to Unicode? In short, Unicode is the foundational standard for software globalization as it enables us to represent complex writing systems of the world. To better understand how Unicode works, let us briefly discuss few relevant concepts…

Raw storage units

Talking about bits and bytes may sound like entry level computer science. I do not mind that simply because it can be difficult for beginners to connect the dots if they do not refresh their memory about computer raw storage units.

For now, let us just recall that a bit is a single binary digit (0 or 1 i.e. a boolean), a byte is 8 bits, a word is a 32 bits number (an integer fits in a word) and a double is 8 bytes (long data type). Why do we need to know that? Unicode uses integer numbers as code points. What is a code point ? We will talk about that later.

Just keep tuned...

Numbering systems

We humans use decimal or base 10 numbering system. No body knows for sure why did humans adopt that. Some think it is because we have 10 fingers, some think it is due to ease of use. Anyway, modern computers use the binary numbering system (no wonder, a transistor has two states on or off).

Let us take an example...

Binary number 1101 = 1 x 2^0 + 0x2^1 + 1x2^2 + 1x2^3 = 1x1 + 0x2 + 1x4 + 1x8 = 13

As you can see, the value of each digit depends on its relative position. Binary for computers is a piece of cake. Unfortunately for programmers, dealing with long strings of ones and zeros is confusing. The solution is to use hexadecimal or base 16 as a shorthand. In hexadecimal notation, digits go from 0 to 9 and from A to F

Let us take an example...

Binary 1101011101011010    
# split it into 4 groups each of which 4 digits
1101    0111    0101    1010
# Decimal representation for each group    
13        7          5           10
# In hexadecimal 13 is D and 10 is A
# So binary 1101011101011010 is D75A in hexadecimal
# It is often written as 0xD75A

As we will see later, Unicode uses code points which are 1-4 byte numbers. Representing a 32 digit number in binary is not convenient, instead we use hexadecimal for the aforementioned reason. That is all what we need from numbering systems for now.

Let us continue...

Encoding vs decoding

Encoding and decoding are general terms not only used in text processing but also in other areas such as communication, networking and storage. You have probably dealt with encoding without paying attention, for example image, compression, audio and video formats are all different types of encoding one way or another.

So, what is encoding and decoding in the context of text processing?

It is converting a sequence of characters (a character is any symbol) into a special format (ex. byte streams in ASCII or Unicode) for efficient processing, storage and transmission. Decoding is the opposite process. It is the conversion of encoded data back to the original form.

Encoding schemes

There are various ways to convert text into a byte stream. Such conversion in a particular way is called an encoding scheme. The focus of this article is about Unicode encoding schemes as it is very powerful and widespread, however there other legacy encoding schemes that are still used.

For example...

  • ASCII an encoding scheme used in text files where 7 bits are used to represent a character with a total of 128
  • ISO-8859-1 a legacy standard that can represent 256 characters. It is suitable for some western languages
  • Windows-1252 a one byte character encoding of the Latin alphabet

At this point, we are ready to dive into our main topic that is Unicode (in Python)

Unicode

We mentioned in the previous section that there are legacy encoding schemes that are still widely used but there was no single ugreed upon scheme to map characters to numbers of any sort. Existing schemes were incompatible with each other and have limitation in terms of representing many international writing systems and languages. On top of that, there was no central control to manage any standard.

Unicode (i.e. universal code) was designed to clean all the mess of legacy encoding by creating a universal character set instead of trying to extend ASCII and adding to the chaos. Today, Unicode is widely adopted to the extent that it is now the basis of the world wide web. All browsers, search engines, modern operating systems, laptops and mobile devices use Unicode internally to process text. Without any exaggeration, Unicode is one of the most significant recent global software technology trends. So what is the basic idea behind Unicode ?

Unicode defines a unique integer number (called code point) for every character regardless of platform, device, application or language. A code point consists from one or more code units (unit = 1 byte in UTF-8 or 2 bytes in UTF-16). A sequence of one or more code points displayed as a single graphical unit is called grapheme (i.e characters that consist from multiple symbols). A glyph is an image typically stored in a font to represent graphemes (or simply characters). Enough Unicode terminology for now, just keep these terms in mind when dealing with Unicode for better understanding.

Unicode has its own encoding schemes. The most popular scheme is UTF-8 which is what we are focusing on in this article...

What is UTF-8

UTF stands for Unicode transformation format and number 8 means 8 bit blocks (also called code units) are used to represent a character. UTF-8 is a variable length encoding scheme which means a character is represented by 1 or 2 or 3 or 4 bytes integer number called a code point. The first few bits in the code point tells the decoder if it is a single byte or multibyte or a continuation of a multibyte character. UTF-8 is one of the most used encoding schemes. UTF-8 encoding has many advantages, for example...

  • Backward compatibility with ASCII (when using one byte)
  • Variable length code points for efficient storage
  • Code length 1-4 bytes so we can represent any character in the world (10% is only occupied)

Converting a code point into a byte stream is not covered in this article. If you are interested, you may check the following article.

Unicode prefix characters

While dealing with Unicode in Python, you may encounter the following symbols (U+, u’, b’, \x, \u, \U). These symbols are used to define Unicode strings. Let us quickly clarify them...

  • U+ is followed by a hex number to denote a given Unicode code point. For example U+0041 is the Unicode code point for letter A
  • u' prefix to denote a Unicode string in Python 2
  • b' prefix to denote a byte stream
  • \x followed by 2 hex numbers (1 byte)
  • \u followed by 4 hex numbers (2 bytes)
  • \U followed by 8 hex numbers (4 bytes)

Python script file encoding

In order to be able to include non ASCII characters in the source code of the Python script, the first line of the script should declare the type of encoding the source code is using for example…

# -*- coding: utf-8 -*-

# The line above is needed because 
# we are going to use a non ASCII 
# character in the Python source code file

Long introduction, right ? but it is worth it, let us now jump into Unicode in Python...

UTF-8 in Python 2.x

Python 2.x defines the following types...

  • str data (also called bytes data or ASCII data) this is immutable (cannot be modified)
  • bytearray data like str but it is mutable
  • Unicode data which is a stream of code points

Let us demonstrate that with few examples

Example 1

u = "Python"
# This should print: <type 'str'>
print(type(u))

v = b"Python"
# This should print: <type 'str'>
print(type(v))

w = u"Python"
# This should print: <type 'unicode'>
print(type(w))

x = unicode("Python")
# This should print: <type 'unicode'>
print(type(x))

y = bytearray("Python")
# This should print: <type 'bytearray’>
print(type(y))

z = bytearray(b"Python")
# This should print: <type 'bytearray'>
print(type(z))

Example 2

# -*- coding: utf-8 -*-

x = 'Some characters : \u03BC : \u0394'
# This should print: Some characters : \u03BC : \u0394
# because x is a string of type str which is bytes data
# in Python 2. \u03BC and \u0394 are not gonig to be
# interpreted as Unicode characters
print(x)
# This should print: <type 'str'>
print(type(x))

# Prefix the string with u
w = u"Some characters : \u03BC : \u0394"
# This should print: Some characters : μ : Δ
print(w)
# This should print: <type 'unicode'>
print(type(w))

y = b'Python'
# This should print: Python and the b prefix has no
# effect in Python 2
print(y)
# This should print: <type 'str'>
print(type(y))

z = bytearray(b'Python')
# This should print: Python
print(z)
# This should print: <type 'bytearray'>
print(type(z))
# This should print: Python2
print("Python" + b"2")

UTF-8 in Python 3.x

Similarly, Python 3 defines the following data types...

  • byte: immutable bytes data
  • bytearray: mutable bytes data
  • str: Unicode data

As you can see, Python 3.x is more strict about the difference between bytes and Unicode. The developer is forced to be clear about the data type he or she is using. Let us take few examples...

# -*- coding: utf-8 -*-

x = 'Some characters : \u03BC : \u0394’
# This should print: Some characters : μ : Δ
# As you can see, in Python 3 strings are
# Unicode by default so  \u03BC and \u0394
# got interpreted as Unicode characters
print(x)
# This should print: <class 'str'>
print(type(x))

y = b’Python'
# This should print: b'Python'
print(y)
# This should print: <class 'bytes'>
print(type(y))

z = bytearray(b'Python')
# This should print: bytearray(b'Python')
print(z)
# This should print: <class 'bytearray'>
print(type(z))

# This should generate an error: 
# TypeError: must be str, not bytes
# This is not ok in Python 3 because 
# we can not concatenate strings and bytes
# recall that strings in Python 3 are Unicode
print("Python" + b"3")

Difference between byte and byte array in python

There is no real difference between byte strings and byte arrays except the fact that byte strings are immutable and byte arrays are mutable. If that is the case, then why does Python have both. One explanation is that some applications perform poorly with immutable strings. For example, IO operations that involve buffering, for each addition to the buffer, a new memory has to be allocated for the concatenation and copying. This is a slow process that is why a mutable byte array comes to the rescue.

Encoding and decoding in Python

Encoding in Python is converting a code point to a byte stream. We do that using the encode() function. Decoding on the other hand is the opposite process which is converting the byte stream back to code points. We do that using the decode() function. Take a look at the following example…

# Define a Unicode string
x = u'A\u03BC\u0394'
print(x)
print(type(x))
print(len(x))

# Encode to byte stream
y = x.encode('utf-8')
print(y)
print(type(y))
print(len(y))

# Decode byte stream back to Unicode
z = y.decode('utf-8')
print(z)
print(type(z))
print(len(z))

If you run the code snippet above, you should get something like...

AμΔ
<type 'unicode'>
3
AμΔ
<type 'str'>
5
AμΔ
<type 'unicode'>
3

Note that the length of the original Unicode string (x) is (3) because we have (3) characters in the string however, the length of the output byte stream variable (y) is (5) because we have (5) bytes in the byte stream. It is very important to keep in mind what data type we are dealing with otherwise it is easy to lose track. If we run the same code in Python 3.x, we get something like…

AμΔ
<class 'str'>
3
b'A\xce\xbc\xce\x94'
<class 'bytes'>
5
AμΔ
<class 'str'>
3

What is the difference? instead of type Unicode we have str because strings in Python 3.x are Unicode by default. What else? The encoding output (byte stream) is printed in raw format to the screen which makes more sense than that in Python 2.x. If we count the number of characters that start with \x in the above output, they are (4) and we have the one byte letter (A) that is why the length of the encoded output is (5) and not (3).

Difference between unicode() and encode() in Python

The built in function unicode() does the opposite of the built in function encode(). It is another way to decode a byte stream back to Unicode code points. Here is an example in Python 2.x

# Define a Unicode string x
x = u'Python is cool'
print(type(x))

# Encode x to str
y = x.encode('utf-8')
print(type(y))

# Decode y back to Unicode
z = y.decode('utf-8')
print(type(z))

# We can also decode y as follows
w = unicode(y, 'utf-8')
print(type(w))

# This should print: True
print (z == x)
# This should print: True
print (w == x)

In Python 3.x we can do the same as follows...

# Define a Unicode string x
x = 'Python is cool'
print(type(x))

# Encode x to str
y = x.encode('utf-8')
print(type(y))

# Decode y back to Unicode
z = y.decode('utf-8')
print(type(z))

# We can also decode y as follows
w = str(y, 'utf-8')
print(type(w))

# This should print: True
print (z == x)
# This should print: True
print (w == x)

Implicit conversion in Python 2.x

Operations that involve both bytes and Unicode data can have undesired effects due to implicit conversion. Python in this case, converts the bytes data into Unicode by decoding the byte stream using the default encoding scheme (ASCII).

In the example below, we are concatenating a Python 2.x str string with a Unicode string. The str string is only ASCII so the concatenation should work fine

# -*- coding: utf-8 -*-
x = "Python" + u"is cool"
# This should print: Python is cool
print(x)

Now, we are going to use a non ASCII character in the str string. In the example below, an exception is going to be raised

# -*- coding: utf-8 -*-
x = "μ" + u"is cool"
# This is going to raise an exception:
# UnicodeDecodeError: 
# 'ascii' codec can't decode byte 0xce in position 0: 
# ordinal not in range(128)
print(x)

String comparison can also lead to implicit conversion. Let us take an example…

# -*- coding: utf-8 -*-
x = "Python2" == u"Python3"
# This should print: False
print(x)

Trying a non ASCII in the comparison…

# -*- coding: utf-8 -*-
x = "μ" == u"μ"
# This will print a False and a warning is generated:
# UnicodeWarning: Unicode equal comparison failed 
# to convert both arguments to Unicode - interpreting them as being unequal
print(x)

In Python 2.x, a Unicode string can be encoded to a byte stream using the encode() built in function. What happens if we try to call this function on an ASCII string? The Python interpreter in this case, tries to convert the ASCII string to Unicode then calls the function. This implicit conversion can cause errors let us see how…

# -*- coding: utf-8 -*-
x = "Python"
y = x.encode('utf-8')
# This should print: Python
print(y)

# -*- coding: utf-8 -*-
x = "μ"
y = x.encode('utf-8')
# This is going to generate an error:
# UnicodeDecodeError: 
# 'ascii' codec can't decode byte 0xce in position 0: 
# ordinal not in range(128)
print(y)

Reading and writing Unicode files in Python

Reading and writing from/to files becomes a challenge when dealing with Unicode if we do not pay attention to the type of data or the Python version. Let us take some examples to clarify that...

Python 2.x reading and writing file

The following code snippet is going to fail because the (y) variable cannot be encoded using ASCII (default encoding)

# -*- coding: utf-8 -*-

# Define str string
x = " Python is cool "
# Define Unicode string
y = u"μ"

# open file for writing
f = open('file.dat', 'w')

# This statement is going to fail with 
# the following error message:
# UnicodeEncodeError: 
# 'ascii' codec can't encode character u'\u03bc' in position 0: 
# ordinal not in range(128)
f.write(y)
f.write(x)
f.write(y)

# Close file
f.close()

Here is how to solve that…

# -*- coding: utf-8 -*-

# Define str string
x = " Python is cool "
# Define Unicode string
y = u"μ"

# open file for writing
f = open('file.dat', 'w')

# Write y x y to the file
f.write(y.encode('utf-8'))
f.write(x)
f.write(y.encode('utf-8'))

# Close file
f.close()

To read the file that we just created earlier…

# Open the file for reading
f = open('file.dat', 'r')
# Read the entire file to a string
z = f.read()
# This should print the content of the file
print(z)
# The type of the created string is str and NOT Unicode
print(type(z))

Let us try to do the same thing in Python 3

Python 3.x reading and writing file

# -*- coding: utf-8 -*-

# Define a Unicode string
x = u"μ Python is cool μ"
# open file for writing
f = open('file.dat', 'w')
# Write the string to file
f.write(x)
f.close()

# Open the file for reading
f = open('file.dat', 'r')
# Read the file to string
z = f.read()
# This should print: μ Python is cool μ
print(z)
f.close()

# Open the file for reading in binary mode
f = open('file.dat', 'rb')
# Read the file to string
z = f.read()
# This should print: b'\xce\xbc Python is cool \xce\xbc'
print(z)
f.close()

That is all for today, now it is time to summarize the key points...

Summary

  • Unicode in Python can be confusing because it is handled differently in Python 2.x vs 3.x
  • To understand Unicode, the underlying concepts need to be crystal clear
  • Unicode is foundational when it comes to software globalization
  • Encoding in Unicode is converting character code points to byte streams for efficient storage, processing and transmission while decoding is the opposite process
  • UTF8 is one of the most popular Unicode encoding schemes. It is backward compatible, efficient for storage and can represent any character in the world. It is the basis for modern software internationalization
  • Python 2.x defines immutable strings of type str for bytes data. It also defines byte array data type that can be modified for better performance in some cases
  • Python 3.x strings are Unicode by default. It also defines byte and byte arrays
  • To encode a Unicode string to a byte stream use the built in function encode(). The built in function decode() does the opposite operation
  • In Python 2.x one has to be careful when dealing with string operations such as concatenation and comparison due to implicit conversion. In Python 3.x incompatible string types are not allowed
  • Reading and writing Unicode data from to disk files requires due attention

References

Thanks for visiting. Questions? please use the comments section below.

Leave a Reply