Difference between re.search and re.match in Python
Table of contents
- Introduction
- What is a regular expression?
- What is Python raw string?
- Python RE module
- Python regular expressions syntax summary
- Regular expressions flags
- re.match
- re.search
- Why use re.match?
- Regular expression compilation
- re.match and re.search examples
- re.fullmatch
- Match objects
- re.findall
- re.finditer
- re.split
- re.sub
- Summary
- References
Introduction
Welcome to a new Python code snippets post. Today, we are going to talk about the regular expressions module (i.e RE module). The goal of this post is to clarify the main difference between re.search and re.match operations. Also, we will briefly discuss other operations provided by the RE module. Let us get started…
What is a regular expression?
A regular expression is a text string that describes a search pattern. You are probably familiar with wildcards when listing or searching for files on Unix or Windows. A regular expression is a similar concept but it is more powerful than just wildcards. Regular expressions can be used in search (ex. extracting information from log files) and validation as well. For example, one of the most popular scenarios is email validation on a web form. Before we dive into the topic, we will explain what a Python raw string means as we are going to use it in our examples.
What is Python raw string?
When dealing with regular expression patterns in Python, you may encounter patterns that start with the letter (r) as in the following example:
1 |
Pat = r'\d$' |
The pattern above refers to any text that ends with a digit but what does the letter (r) denote for? Well, both Python and regular expression strings use backslash (\) to escape special (i.e. meta) characters. (r) means a raw string in Python. Raw strings do not apply any special treatment to backslashes. This Python feature is convenient when used in regular expressions as it makes the patterns easy to read and less error prone. For example, instead of writing the pattern \\d we just use it as is in a regular expression \d. Very handy ! Let us proceed…
Python RE module
Python RE module offers regular expressions primitive operations for text matching and searching. The RE module provides more than just matching and searching as we will see later. re.match and re.search are one of the most important ones. re.match checks for a match only at the beginning of a string, while re.search checks for a match anywhere in the string.
Python regular expressions syntax summary
Regular expression patterns use control characters to indicate special meaning. If a control character needs to be used as is, we escape it with a backslash. Here is a short list of control characters. You can check the reference section for more details…
- \d a digit
- \D a non digit
- \s a space
- \S a non space
- \w letters
- \W anything but letters
- . any character except a newline
- \b any character except for new line
- + 1 or more
- ? 0 or 1
- * 0 or more
- $ end of string
- ^ beginning of string
- | either or
- [] range
- {x} this amount of preceding code
- \n new line
- \s space
- \t tab
Regular expressions flags
When searching or matching, regular expression operations can take optional flags or modifiers. The following two modifiers are the most used ones…
- re.M multiline
- re.I ignore case
For full list you may check the reference section. Let us now begin with the first regular expression operation re.match…
re.match
re.match matches an expression at the beginning of a string. If a match is found, a match object is returned, otherwise None is returned. If the input is a multiline string (i.e. starts and ends with three double quotes) that does not change the behavior of the match operation. re.match always tries to match the beginning of the string. In regular expressions syntax, the control character (^) is used to match the beginning of a string. If this character is used with re.match, it has no effect. The syntax for re.match operation is as follows…
1 |
mobj = re.match(pat, str, flags) |
where
- pat: regular expression pattern to match
- str: string in which to search for the pattern
- flags: one or more modifiers, for example re.M|re.I
re.search
re.search attempts to find the first occurrence of the pattern anywhere in the input string as opposed to the beginning. If the search is successful, re.search returns a match object, otherwise it returns None. The syntax for re.search operation is as follows…
1 |
mobj = re.search(pat, str, flags) |
where
- pat: regular expression pattern to search for
- str: string in which to search for the pattern
- flags: one or more modifiers, for example re.M|re.I
Why use re.match?
Now we know the difference between re.match and re.search but the question is: why do we need to use re.match if we can achieve the same result using re.search? There is no specific answer for this question, however re.match is provided as a convenience and explicitly tells the intention of the match operation.
Regular expression compilation
Regular expression compilation produces a Python object that can be used to do all sort of regular expression operations. What is the benefit of that as long as we can use re.match and re.search directly? This technique is convenient in case we want to use a regular expression more than once. It makes our code efficient and more readable. The syntax is as follows…
1 2 3 4 5 |
obj = re.compile(pattern, flags) # Now you can call mat = obj.match(string) # The above call is equivalent to re.match(pattern, string, flags) |
Here is an example…
1 2 3 4 5 6 7 8 9 10 11 12 |
# Import RE module import re # Compile regular expression obj = re.compile('hi', re.M) # Use the object to match mat = obj.match('hi how are you?') print(mat) # Do the match directly using re.match mat = re.match('hi', 'hi how are you?', re.M) print(mat) |
If you run the code snippet above, you should get a match in both cases. Now, let us jump into more match and search examples…
re.match and re.search examples
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 |
# Import regular expressions module import re # Example multi line string lines = """Hello HelloWorld""" # This will match Hello at the beginning of the string # The print statement will print something like # <_sre.SRE_Match object at 0x10c4d2850> print(re.match('Hello', lines)) # Even though the second line starts with HelloWorld # The following match statement will not return # a match because the string lines does not start with # HellowWorld print(re.match('HelloWorld', lines)) # The following two statements will not match HelloWorld # at the beginning of the second line even if we use the # multiline modifier. Using ^ is not going to make it # any different print(re.match('HelloWorld', lines, re.MULTILINE)) print(re.match('^HelloWorld', lines, re.MULTILINE)) # This will find HelloWorld in lines as re.search # scans the entire string print(re.search('HelloWorld', lines)) # This will also match HelloWorld at the beginning # of the second line. Note that if we remove the # multiline modifier it will not find anything print(re.search('^HelloWorld', lines, re.MULTILINE)) # Compile a regular string pattern (ending with ello) # using multiline modifier m = re.compile('ello$', re.MULTILINE) # This will not return a match. The first line ends # with ello but it does nto start with it print(m.match(lines)) # The match operation can take an optional position # parameter indicating where the string starts # The following match statement should return # a match object because if we start at position 1 # instead of the default of 0 then the first line # does start with and end with ello print(m.match(lines, pos=1)) # This will not return a match even though the # first line ends with ello. You need to use # the multiline modifier in order to get a match print(re.search('ello$', lines)) # This will return a match because the first # line ends with ello and we are using the # multiline modifier print(re.search('ello$', lines, re.MULTILINE)) # Recall that m was compiled with multiline # This is the same as the previous example # so it will return a match print(m.search(lines)) # This will not return a match. I am not 100% # sure why. Please let me know in the comments # section. My guess it is because m was already # compiled with multiline. Using multiline again # will mess things up print(m.search(lines, re.MULTILINE)) |
re.fullmatch
re.fullmatch function was added in Python 3.4 to match the entire string. If the pattern matches the input string, a match object is returned otherwise, None is returned. We can easily implement re.fullmatch in terms of re.match however, Python provides this function for convenience. It is useful in validating user input. The intention behind the addition of this function is to be explicit about the goal of the match. Let us take an example…
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
# Regular expression module import re # Example strings text1 = "Python is awesome" text2 = "Python is awesome like crazy" # Full matching mat = re.fullmatch("Python is awesome", text1) print(mat) mat = re.fullmatch("Python is awesome", text2) print(mat) # Regular matching mat = re.search("^Python is awesome$", text1) print(mat) mat = re.search("^Python is awesome$", text2) print(mat) |
If you run the code snippet above, the output should look like…
1 2 3 4 |
<_sre.SRE_Match object; span=(0, 17), match='Python is awesome'> None <_sre.SRE_Match object; span=(0, 17), match='Python is awesome'> None |
Match objects
If a match is found when using re.match or re.search, we can use some useful methods provided by the match object. Here is a short list of such methods, you may check the reference section for more details…
- group() returns the part of the string matched by the entire regular expression
- group(1) returns the text matched by the second capturing group
- start() and end() return the indices of the start and end of the substring matched by the capturing group
Here is an example…
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
# Import regular expressions module import re # Example string string = "Python is a cool language" # In the pattern below, we are trying to match two text # groups. One to the left of "is a" and the second to # the right. \w+ means one or more characters. re.M|re.I # means multiline ignoring case. In this example, the # first group match to the left is going to be: Python # and the second group match to the right is going to be: # cool. Note that the word: language is not going to be # included in the second group because a space separates # it from the word: cool. Recall we are using \w # which means only characters mat = re.match(r'(\w+) is a (\w+)', string, re.M|re.I) # Print if there is a match if mat: # This will print: Python is a cool print "matchObj.group() : ", mat.group() # This will print: Python print "matchObj.group(1) : ", mat.group(1) # This will print: cool print "matchObj.group(2) : ", mat.group(2) else: print "No match was found" |
re.findall
re.findall returns a list of non overlapping matches in a string. The syntax is as follows…
1 |
matches = re.findall(pat, str, flags) |
where
- pat: regular expression pattern to search for
- str: string in which to search for the pattern
- flags: one or more modifiers, for example re.M|re.I
Here is an example…
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
# Import regular expression module import re # Example multiline string text = """This is short multi line string """ # Finds all words in the string # ['This', 'is', 'short', 'multi', 'line', 'string'] print re.findall('(\w+)', text, re.MULTILINE) # Finds the words at the beginning of each line # ['This', 'short', 'string'] print re.findall('^(\w+)', text, re.MULTILINE) # Finds the words at the end of each line # ['is', 'line', 'string'] print re.findall('(\w+)$', text, re.MULTILINE) # Finds single word lines # ['string'] print re.findall('^(\w+)$', text, re.MULTILINE) |
re.finditer
Instead of returning a complete list of all matches (as in re.findall), re.finditer returns an iterator object that allows us to go through all match object instances one by one. The string is scanned left to right, and matches are returned in the order found. Returning a complete list versus an iterator is a separate topic in Python. You can read more about this in the following article.
re.finditer has the following syntax
1 |
matches = re.finiter(pat, str, flags) |
where
- pat: regular expression pattern to search for
- str: string in which to search for the pattern
- flags: one or more modifiers, for example re.M|re.I
Here is an example…
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
# Import RE module import re # A phone number string = "408-219-3107" # This will print: ['408', '216', '3207'] res = re.findall("\d+", string) print(res) # This will print: # 408 # 219 # 3107 for mat in re.finditer("\d+", string): print(mat.group()) |
if you run the code snippet above, the output should look like…
1 2 3 4 |
['408', '219', '3107'] 408 219 3107 |
re.split
We can use re.split to split a string into tokens based on a pattern. The syntax looks like…
1 |
tokens = re.split(pattern, string) |
Where
- pattern to search for. Used as a delimiter
- string to split
- tokens is the split output list
Here is an example..
1 2 3 4 5 6 7 8 9 |
# Import regular expressions module import re # Split using space as delimeter string = 'Welcome to the Python programming language' print(re.split(r'\s', string)) # The code snippet above should print # ['Welcome', 'to', 'the', 'Python', 'programming', 'language'] |
re.sub
We can use re.sub to search and replace in a string. The syntax looks like…
1 |
re.sub(pattern, repl, string, max=0) |
Where
- pattern to search for
- replace with repl
- in string
- all occurrences except when max is used
Here is an example..
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
# Import regular expressions module import re # Social security number string ssn = "323-16-8740 # Social security number" # Remove the comment from the SSN # The pattern '#.*$' means find # a hash character followed by zero # or more characters anchored at the # end of the string. r means raw string # meaning we do not need to escape the # control characters in the pattern SSN = re.sub(r'#.*$', "", ssn) print("SSN : {}".format(SSN)) # Remove non digit characters SSN = re.sub(r'\D', "", SSN) print("SSN : {}".format(SSN)) # The code snippet above should print # the following # SSN : 323-16-8740 # SSN : 323168740 |
Summary
- A regular expression is a text string that describes a search pattern
- Raw strings in Python do not apply any special treatment to backslashes
- Python RE module offers regular expression primitive operations for text matching and searching
- re.match matches an expression at the beginning of a string
- re.search attempts to find the first occurrence of the pattern anywhere in the input string
- Regular expression compilation produces a Python object that can be used to do all sort of regular expression operations
- re.fullmatch function was added in Python 3.4 to match the entire string
- re.findall returns a list of non overlapping matches in a string
- re.finditer returns an iterator object that allows us to go through all match object instances
- re.split is used to split a string into tokens based on a pattern
- re.sub is used to search and replace in a string
References
Thanks for reading. Please use the comments section for feedback.