Base64 encoding in Python
Introduction
Today, we are going to talk about base64 encoding in Python. Understanding how characters are represented is very important to make sense of base64. For beginners, I highly recommend that you check the following article. It explains how Python handles Unicode and string data types.
Let us get started…
What is base64 encoding?
Base64 encoding starts with a stream of bytes that could represent any data. The input data can be a string of characters (ex. ASCII, UTF-8), image, email attachment, etc. The goal is to convert the 8 bit stream into 6 bit characters. Note that with only 6 bits we can represent a maximum of 64 characters, hence it is called base64. Base64 maps each 6 bits in the input stream into an ASCII character from the following set (A-Z a-z 0-9 + / =) where = is used for padding (will see what padding means later). Note that for every 3 bytes of data there are at least 4 bytes of base64 data.
Why is base64 needed?
Converting bytes to plain text can have many benefits. For example, there are some systems that only work with text data such as SMTP protocol. We can use this trick to convert email attachments to text and send the email as if it is completely text. If we are building a web service that communicates using JSON format which is text based then we can attach binary data encoded in base64. Not only we can send the binary data disguising in text form but also allows us to view it easily in text editing software.
Byte data in Python
Depending on which Python version you are using (2.x vs 3.x) strings are handled differently. Python 2.x defines the following types…
- str data (also called bytes data or ASCII data). This is immutable (cannot be modified)
- bytearray data like str but it is mutable (changeable)
- Unicode data which is a stream of code points (Check references for more information about Unicode)
On the other hand, Python 3 defines the following data types…
- byte: immutable bytes data
- bytearray: mutable bytes data
- str: Unicode data
Long story short, when dealing with base64, make sure your input data is a stream of bytes. For example, if the input is ASCII then base64 is pointless. If you are using Unicode strings then you need to call the encode function to convert them to a byte stream before doing any base64 encoding. No intention here to confuse the reader by mentioning UTF encoding then base64 encoding in a row but the point is that you need to start with a byte stream regardless of what it represents in order to encode it in base64. Let us see how base64 works then will provide example code…
How base64 works?
We are going to demonstrate base64 using a simple example. Follow the steps below…
- Given the following 2 bytes as input (0xFB, 0xFF)
- In binary F=1111 B=1011 F=1111 F=1111 or (11111011, 11111111)
- Using only 6 bytes (111110, 111111, 111100)
- Note that we added 00 to the last chunk to make 6 bits
- Converting these numbers to decimal (62, 63, 60)
- Doing simple base64 lookup (62 is +, 63 is /, 60 is 8)
- Final encoding string is +/8
- = is added for padding
Let us now see how to perform base64 encoding in Python…
Base64 encoding decoding example
Below is a simple encoding decoding example…
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
# Import base64 module import base64 # Input string data = 'Hello world' # Base64 encode encoded = base64.b64encode(data) # Base64 decode decoded = base64.b64decode(encoded) # Print original data print("Original data : {}".format(data)) # Print encoded data print("Encoded data : {}".format(encoded)) # Print decoded data print("Decoded data : {}".format(decoded)) |
If you run the code snippet above, you should get the following output…
1 2 3 |
Original data : Hello world Encoded data : SGVsbG8gd29ybGQ= Decoded data : Hello world |
Base64 padding
As we indicated earlier, the output stream is a sequence of 6 bits segments. Since the input stream consists from bytes, the last segment in the output stream can possibly be 2 or 4 or 6 bits. If it is 4 we add = and == if 2 otherwise no padding is added. Recall that padding is not necessary to decode the data back to its original form. Padding is only needed when the encoded data is concatenated. Without padding, it is not possible to separate the individual strings.
URL safe base64
The default base64 alphabet may use + and / which are used in URLs. This may cause side effects so using an alternate encoding can solve the problem. The + is replaced with a -, and / is replaced with underscore (_). Otherwise, the alphabet is the same. Here is an example…
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
# Import base64 module import base64 # Input binary data # \x is used to denote a byte # In this example we have 2 bytes data = '\xfb\xff' # Note that we are printing the representation # of the data because the data is not printable print("original data bytes : {}".format(repr(data))) print("original data in binary : {}".format(bin(0xfbff))) print("6 bit numbers : 111110 111111 1111") print("Add 2 zeros to 1111 : 111100") print("In decimal : 62 63 60") print("Lookup these numbers : + / 8") # Encode data using standard base64 standard_encoded_data = base64.standard_b64encode(data) print("standard encoded data : {}".format(repr(standard_encoded_data))) # Decode data using standard base64 standard_decoded_data = base64.standard_b64decode(standard_encoded_data) print("standard decoded data : {}".format(repr(standard_decoded_data))) # Encode data using url safe base64 urlsafe_encoded_data = base64.urlsafe_b64encode(data) print("url safe encoded data : {}".format(repr(urlsafe_encoded_data))) # Decode data using url safe base64 urlsafe_decoded_data = base64.urlsafe_b64decode(urlsafe_encoded_data) print("url safe decoded data : {}".format(repr(urlsafe_decoded_data))) # Encode data using url safe base64 with custom characters custom_encoded_data = base64.b64encode(data, ['*', '~']) print("custom encoded data : {}".format(repr(custom_encoded_data))) # Decode data using url safe base64 with custom characters custom_decoded_data = base64.b64decode(custom_encoded_data, ['*', '~']) print("custom decoded data : {}".format(repr(custom_decoded_data))) |
If you run the code snippet above, you should get the following output…
1 2 3 4 5 6 7 8 9 10 11 12 |
original data bytes : '\xfb\xff' original data in binary : 0b1111101111111111 6 bit numbers : 111110 111111 1111 Add 2 zeros to 1111 : 111100 In decimal : 62 63 60 Lookup these numbers : + / 8 standard encoded data : '+/8=' standard decoded data : '\xfb\xff' url safe encoded data : '-_8=' url safe decoded data : '\xfb\xff' custom encoded data : '*~8=' custom decoded data : '\xfb\xff' |
Summary
- Base64 encoding converts data in binary format into text
- Exchanging data in text format has many benefits. For example, sending an image using a JSON based web service
- Base64 output stream can be longer than the input stream because bytes are split into 6 bits segments
- Dealing with Base64 encoding in Python is as easy as importing the base64 module then calling the appropriate function
- The standard base64 encoding contains + and / in the output stream. These characters are used in web URLs which may cause problems. To fix this issue, + and / characters are replaced with other characters. Python supports URL safe base64 encoding. You just need to call the right function
References