LZW Algorithm: Introduction
It is beyond our imagination to specify a ceiling for the amount of information that an individual or a group of people potentially need to store and/or exchange. Using computer systems, almost all kinds of information can be digitized and therefore saved to storage media or sent across channels all over the globe. Despite the fact that storage has become relatively cheap and has the ability to handle huge amounts of data, similarly communication networks are now much faster and have wider bandwidths, still there is a dire need to reduce data storage size as information is continuously increasing. In this tutorial we will develop a piece of software called LZW Compactor which is capable of compressing/decompressing text files based on LZW algorithm one of most well know methods for data compression such as that in TIFF image format. As mentioned earlier only text files are supported however extending the software to compress binary files should be straight forward. The educational nature of this tutorial requires us to focus on the demonstration side of the algorithm while addressing some performance aspects. In few words, the software is intended to compress/decompress text files then prints out a detailed tabular log of the internal workings of the algorithm. We will be using Visual C# programming language on the MS Windows operating system.
The basic idea behind LZW compression is to replace strings of characters with single code symbols. The algorithm does not do any analysis of the incoming text. It adds every new string of characters it recognizes to a table of strings. Compression is performed when a single code is transmitted instead of a string of characters. It is not necessary to preserve the encoder's dictionary in order to decode the compressed data stream. LZW Compactor compresses text according to the following flow chart:
The decompression algorithm builds its own dictionary directly from input symbol stream. There is no need to transmit any translation table along with the encoded data. Similar to the compression algorithm, it adds a new string to its own dictionary when reading a new code. The decoding algorithm then translates each incoming code into a string of characters and sends it to the output. Note that the first 256 codes are already reserved to translate single character strings. LZW Compactor extracts compressed text as specified in the following flow chart:
Dictionary Size and Index Size
There are two methods to choose a bit size for encoded output symbols. The first one is dynamic LZW (used in the GIF image format). This method writes as many bits as the current dictionary size requires. For example, if the current dictionary size is 300 elements, the index value is written as a 9 bit integer. When the element with index 512 is added to the dictionary, the output bit size is increased by one. The second method which is used in this tutorial is to have a fixed length of 2 bytes for output symbols. It is easier to implement on the expense of space and low compression ratios for small size files. This tends to be of less importance as the file size grows. Choosing a specific output symbol index size automatically limits the maximum size of the dictionary. In our case, a 2 byte index yields a maximum dictionary size of 65536 bytes. This means that whenever the dictionary is filled up we need to reset it again and start building a brand new one. Flushing the dictionary solves our problem on the expense of compression ratio drop but eventually the dictionary size grows back again and compensate the loss in compression.
Compressed File Format
Whenever we add a new text file to the created compaction we append four chunks of data. The first chunk is the file name which consists of 32 characters. Obviously, this is not enough for longer file names but at least it satisfies the minimum requirements we set for this tutorial. The next two chunks are the original file size before compression and new file size after compression. The first size value is used when opening a saved compaction while the second size value tells the software how many bytes to seek before entering the next file in the buffer. The last chuck is nothing but the compressed file buffer which basically contains the encoded data. The schematic diagram below shows the format that we just described:
If we take a quick look at the compression flow chart listed earlier, it is clear that the most critical component of the encoding algorithm performnace wise is the dictionary look-up. Lookup implementation determines whether the compression of a few megabytes of input text will take tens of minutes or just a few seconds. There are several ways to implement the encoder's dictionary but our concern here is how fast we can look up a search string. Putting that in mind, we will be using a search efficient data structure such as a hash table in both the encoder and the decoder.
Compression and decompression is a time consuming job which can cause the main application crawl. With that in mind, we decided to put the compression and decompression jobs in separate execution threads. This gives us the flexibility to update operations progress and prevent the main application from freezing.
Application Main Interface
LZW Compactor incorporates a main toolbar from which all supported operations can be initiated. Almost all functions are self explanatory, for example one can create a new compaction using the New button, Open an existing compaction using the Open button, add files to an existing compaction using the Add button, remove selected files from compaction using the Remove button, Extract selected files to an output folder using the Extract button and display the compression and decompression algorithm log using the Algorithm button. In case of extraction, if there are no files selected then the default behavior is to extract all files. In addition to the main toolbar, we have a main menu and a context menu for more flexibility. The screenshot below shows the main interface in action:
LZW Compactor provides a special interface to display the internal workings of the compression and extraction algorithms. This becomes handy in debugging or if one likes to learn how compression and extraction takes place. This feature displays the first 150 characters from compression and decompression logs as it is not practical to display thousands or millions of characters in case of large files. The diagram below demonstrates encoding for the following example input text: TOBEORNOTTOBEORTOBEORNOT
and here is the decoding log for the example mentioned earlier
Future Work and Enhancements
As we stated before, the purpose behind LZW Compactor was to study and learn LZW compression technique. It was not intended to be the next WinZip competitor but this does not prevent us from improving its functionality. Here is a list of enhancements that can be added to the project:
- Support binary files by implementing strings of bytes instead of strings of characters.
- Speed up compression by integer based search rather than string based
- Speed up even more by writing chunks of data to the output stream instead of single bytes
- Implement data encryption.
In this tutorial we developed a fairly fast and easy to use compression/decompression piece of software based on the LZW algorithm. The software has also the ability to display the internal workings of the compression and decompression operations. In addition to speed and ease of use, we did our best to make it more convenient to end users by supporting multi threaded operations. We put some effort as well to come out with an nice looking and easy to use graphical user interface. Thanks for visiting.
The full source code can be downloaded by clicking the link below