Base64 encoding in Python

Introduction

Today, we are going to talk about base64 encoding in Python. Understanding how characters are represented is very important to make sense of base64. For beginners, I highly recommend that you check the following article. It explains how Python handles Unicode and string data types.

Let us get started…

What is base64 encoding?

Base64 encoding starts with a stream of bytes that could represent any data. The input data can be a string of characters (ex. ASCII, UTF-8), image, email attachment, etc. The goal is to convert the 8 bit stream into 6 bit characters. Note that with only 6 bits we can represent a maximum of 64 characters, hence it is called base64. Base64 maps each 6 bits in the input stream into an ASCII character from the following set (A-Z a-z 0-9 + / =) where = is used for padding (will see what padding means later). Note that for every 3 bytes of data there are at least 4 bytes of base64 data.

Why is base64 needed?

Converting bytes to plain text can have many benefits. For example, there are some systems that only work with text data such as SMTP protocol. We can use this trick to convert email attachments to text and send the email as if it is completely text. If we are building a web service that communicates using JSON format which is text based then we can attach binary data encoded in base64. Not only we can send the binary data disguising in text form but also allows us to view it easily in text editing software.

Byte data in Python

Depending on which Python version you are using (2.x vs 3.x) strings are handled differently. Python 2.x defines the following types…

  • str data (also called bytes data or ASCII data). This is immutable (cannot be modified)
  • bytearray data like str but it is mutable (changeable)
  • Unicode data which is a stream of code points (Check references for more information about Unicode)

On the other hand, Python 3 defines the following data types…

  • byte: immutable bytes data
  • bytearray: mutable bytes data
  • str: Unicode data

Long story short, when dealing with base64, make sure your input data is a stream of bytes. For example, if the input is ASCII then base64 is pointless. If you are using Unicode strings then you need to call the encode function to convert them to a byte stream before doing any base64 encoding. No intention here to confuse the reader by mentioning UTF encoding then base64 encoding in a row but the point is that you need to start with a byte stream regardless of what it represents in order to encode it in base64. Let us see how base64 works then will provide example code…

How base64 works?

We are going to demonstrate base64 using a simple example. Follow the steps below…

  • Given the following 2 bytes as input (0xFB, 0xFF)
  • In binary F=1111 B=1011 F=1111 F=1111 or (11111011, 11111111)
  • Using only 6 bytes (111110, 111111, 111100)
  • Note that we added 00 to the last chunk to make 6 bits
  • Converting these numbers to decimal (62, 63, 60)
  • Doing simple base64 lookup (62 is +, 63 is /, 60 is 8)
  • Final encoding string is +/8
  • = is added for padding

Let us now see how to perform base64 encoding in Python…

Base64 encoding decoding example

Below is a simple encoding decoding example…

If you run the code snippet above, you should get the following output…

Base64 padding

As we indicated earlier, the output stream is a sequence of 6 bits segments. Since the input stream consists from bytes, the last segment in the output stream can possibly be 2 or 4 or 6 bits. If it is 4 we add = and == if 2 otherwise no padding is added. Recall that padding is not necessary to decode the data back to its original form. Padding is only needed when the encoded data is concatenated. Without padding, it is not possible to separate the individual strings.

URL safe base64

The default base64 alphabet may use + and / which are used in URLs. This may cause side effects so using an alternate encoding can solve the problem. The + is replaced with a -, and / is replaced with underscore (_). Otherwise, the alphabet is the same. Here is an example…

If you run the code snippet above, you should get the following output…

Summary

  • Base64 encoding converts data in binary format into text
  • Exchanging data in text format has many benefits. For example, sending an image using a JSON based web service
  • Base64 output stream can be longer than the input stream because bytes are split into 6 bits segments
  • Dealing with Base64 encoding in Python is as easy as importing the base64 module then calling the appropriate function
  • The standard base64 encoding contains + and / in the output stream. These characters are used in web URLs which may cause problems. To fix this issue, + and / characters are replaced with other characters. Python supports URL safe base64 encoding. You just need to call the right function

References

Add a Comment

Your email address will not be published. Required fields are marked *