Python compare strings

Introduction

It is hard (even impossible) to find a real world application not using string comparison. It can be used in database lookup, searching for files on disk, sorting contacts and many other scenarios. String comparison in Python is not hard as we will demonstrate in today's code snippets. Let us get started...

Python string identity vs string equality

Like any other data, a string occupies a memory location, so we need to differentiate between its content and address. A string variable in Python has a value and memory address or a pointer called ID. We can use both (value or id) to compare strings. In other words, two strings can be equal but not exactly the same, meaning stored in two different memory locations. Let us clarify that...

# Define 3 string variables
x = "hello"
y = "hello"
z = "".join(['h','e','l','l','o'])

# Get id of each string
xid = id(x)
yid = id(y)
zid = id(z)

# Print string id
print("x id   : {}".format(xid))
print("y id   : {}".format(yid))
print("z id   : {}".format(zid))

# Check for equality
print("x == y : {}".format(x == y))
print("x == z : {}".format(x == z))

# Check if they are identical
print("x is y : {}".format(x is y))
print("x is z : {}".format(x is z))

If you run the code snippet above, you should get something like...

x id   : 4433136640
y id   : 4433136640
z id   : 4433137120
x == y : True
x == z : True
x is y : True
x is z : False

As you can see, x and y both have the same id. That means both point to the same memory location. In other words, they are both equal in value and they refer to the same object. On the other hand, x and z are equal but they are not identical. They point to two different memory locations. In short...

  • Use (is) for identity testing
  • Use == for equality testing

Python string comparison operators

Python provides various operators to perform string comparison. Let us review some of those...

  • Equal ==
  • Not equal != (you can also use <>)
  • Greater than >
  • Less than <
  • Greater than or equal >=
  • Less than or equal <=

But wait a minute, how come we compare strings as if they are numbers ? The answer is that Python compares string lexicographically (i.e using ASCII value). Take a look at the following examples...

# False
print("Python2" == "Python3")
# True
print("Red" != "Green")
# True
print("Sun" <> "Moon")
# True
print("Cat" < "Dog")
# False
print("Code" > "Coder")

In Python, if you want to get the ASCII code of a given character, you can use the function ord(character) or chr(code) to get the character for a given code. Here is an example…

# This should print: A
print(chr(65))
# This should print: 65
print(ord('A'))

Modifying a string

If a string is modified, its object id changes as well. Take a look at the following example…

# Define a string
string = "Python"
# Print ID
print("xid = {}".format(hex(id(string))))
# Modify string
string = string + "3"
# Print ID
print("xid = {}".format(hex(id(string))))

If you run the code snippet above, you should get two different string ids.

Let us now talk about unicode strings...

Strings are cool until we deal with unicode strings, at that point we need to be careful. Please note that Unicode is beyond the scope of this article and I advise the reader to refer to the Python reference for more details. For the sake of this article, we are going to include the bare minimum to get you started...

Python unicode string comparison

Let us define few terms...

  • Encoding is converting a sequence of characters into a special format for efficient processing, storage and transmission. Decoding is the opposite process. It is the conversion of encoded data back to the original form
  • There are various ways to convert text into a byte stream. Such conversion in a particular way is called an encoding scheme (ex. ASCII, UTF-8)
  • Unicode defines a unique integer number (called code point) for every character regardless of platform, device, application or language. Unicode has its own encoding schemes. The most popular one is UTF8
  • There is a big difference in handling unicode strings between Python 2.x and 3.x. A string in Python 2.x can be of type str (bytes or ASCII) or unicode. Here is an example...
# Strings in Python 2.x are of type: str
# In other words, a stream of bytes or
# ASCII codes
x = 'text'
# This should print: <type 'str'>
print(type(x))

# In Python 2.x, You can define 
# a unicode string as follows
y = unicode('text')
# This shoud print: <type 'unicode'>
print(type(y))

# You can also define a unicode string
# by prefixing the string wiht u
z = u'text'
# This shoud print: <type 'unicode'>
print(type(z))

# Encoding a unicode string using some 
# encoding scheme (UTF-8 in case) converts 
# the string into an array of code points
# or a stream of bytes (i.e str)
w = z.encode('UTF-8')
# This should print: <type 'str'>
print(type(w))

If that is the case, then how can we compare strings in Python 2.x? Take a look at the following example...

# -*- coding: utf-8 -*-

# The line above is needed because 
# we are going to use a non ASCII 
# character in the Python source code file

# Define a unicode string
x = u"Python"
# Define an ASCII string
y = "Python"

# Comparing apples and oranges (Unicode vs ASCII)
# This should print True but still that is not
# the right way to do the comparison
print(x == y)
# This is the right way to do the comparison
# because we are comparing bytes with bytes
# This should print True
print(x.encode('UTF-8') == y)

# Now x and y contain a non ASCII character
x = u"Pythonę"
y = "Pythonę"
# This is going to generate the follwoing warning:
# Unicode equal comparison failed to convert both 
# arguments to Unicode - interpreting them as being unequal
print(x == y)
# This should print True
print(x.encode('UTF-8') == y)

In Python 2.x, try not to compare strings of type str with strings of type unicode. Make sure you are comparing strings of the same type.

  • On the other hand, a string in Python 3.x is unicode by default. Let us take an example...

# -*- coding: utf-8 -*-

# The line above is needed because 
# we are going to use a non ASCII 
# character in the Python source code file

# Variables in Python 3.x are unicode by default
x = "Pythonę"
y = "Pythonę"

# This should print: True because x is the same as y
print(x == y)
# This should print: False because the UTF-8 
# encoding of x is not the same as y
print(x.encode('UTF-8') == y)

# Let us see why the above comparison statement
# returns False. Just print the compared values

# This should print: b'Python\xc4\x99'
print(x.encode('UTF-8'))
# This should print: Pythonę
print(y)

Python multiline string comparison

Try to compare directly, it should work exactly the same as regular strings. If for some reason, it doesn't work then you may need to use a loop with a split on new lines. Here is an example...

x = """Python is 
cool"""

y = """Python is 
cool"""

# This should print: True
print(x == y)

Time to summarize...

Summary

  • String comparison is a commonly used Python language feature. It can be used in database lookup, searching for files on disk, sorting contacts, etc.
  • String can be equal in value but stored in different memory locations. To test for equality use == and to test if they are identical use the (is) operator
  • Be extra careful when comparing unicode strings. Python 2.x and Python 3.x are different in handling unicode. For example, in Python 2.x it is not a good practice to compare a string of type str with a string of type unicode

That is it for today, thanks for visiting. Please leave a comment if you have a question

Leave a Reply