Pasi 'Albert' Ojala, a1bert@iki.fi, http://www.iki.fi/a1bert/

Compression Basics

Introduction

Because real-world files usually are quite redundant, compression can often reduce the file sizes considerably. This in turn reduces the needed storage size and transfer channel capacity. Especially in systems where memory is at premium compression can make the difference between impossible and implementable. Commodore 64 and its relatives are good examples of this kind of a system.

The most used 5.25-inch disk drive for Commodore 64 only holds 170 kB of data, which is only about 2.5 times the total random access memory of the machine. With compression, many more programs can fit on a disk. This is especially true for programs containing flashy graphics or sampled sound. Compression also reduces the loading times from the notoriously slow 1541 drive, whether you use the original slow serial bus routines or some kind of a disk turbo loader routine.

Dozens of compression programs are available for Commodore 64. I leave the work to chronicle the history of the C64 compression programs to others and concentrate on a general overview of the different compression algorithms. Later we'll take a closer look on how the compression algorithms actually work and in the next article I will introduce my own creation: pucrunch.

Pucrunch is a compression program written in ANSI-C which generates files that automatically decompress and execute themselves when run on a C64 (or C128 in C64-mode, VIC20, C16/+4). A cross-compressor, if you will, allowing you to do the real work on any machine, like a cross-assembler.

Our target environment (Commodore 64 and VIC20) restricts us somewhat when designing the 'ideal' compression system. We would like it to be able to decompress as big a program as possible. Therefore the decompression code must be located in low memory and be as short as possible and must use very small amounts of extra memory.

Another requirement is that the decompression should be relatively fast, which means that the arithmetic used should be mostly 8- or 9-bit which is much faster than e.g. 16-bit arithmetic. Processor- and memory-intensive algorithms are pretty much out of the question. A part of the decompressor efficiency depends on the format of the compressed data. Byte-aligned codes can be accessed very quickly; non-byte-aligned codes are much slower to handle, but provide better compression.

This is not meant to be the end-all document for data compression. My intention is to only scratch the surface and give you a crude overview. Also, I'm mainly talking about lossless compression here, although some lossy compression ideas are briefly mentioned. A lot of compression talk is available in the world wide web, although it may not be possible to understand everything on the first reading. To build the knowledge, you have to read many documents and understand something from each one so that when you return to a document, you can understand more than the previous time. It's a lot like watching Babylon 5. :-)

Some words of warning: I try to give something interesting to read to both advanced and not so advanced readers. It is perfectly all right for you to skip all uninteresting details. I start with a Huffman and LZ77 example so you can get the basic idea before flooding you with equations, complications, and trivia.

Huffman and LZ77 Example

Let's say I had some simple language like "Chippish" containing only the letters CDISV. How would a string like

	SIDVICIIISIDIDVI

compress using a) Huffman encoding, and b) LZ77? How do compression concepts such as information entropy enter into this?

A direct binary code would map the different symbols to consequtive bit patterns, such as:

	Symbol  Code
	'C'	000
	'D'	001
	'I'	010
	'S'	011
	'V'	100

Because there are five symbols, we need 3 bits to represent all of the possibilities, but we also don't use all the possibilities. Only 5 values are used out of the maximum 8 that can be represented in 3 bits. With this code the original message takes 48 bits:

	SIDVICIIISIDIDVI ==
	011 010 001 100 010 000 010 010 010 011 010 001 010 001 100 010

For Huffman and for entropy calculation (entropy is explained in the next chapter) we first need to calculate the symbol frequencies from the message. The probability for each symbol is the frequency divided by the message length. When we reduce the number of bits needed to represent the probable symbols (their code lengths) we can also reduce the average code length and thus the number of bits we need to send.

	'C'	1/16	0.0625
	'D'	3/16	0.1875
	'I'	8/16	0.5
	'S'	2/16	0.125
	'V'	2/16	0.125

The entropy gives the lower limit for statistical compression method's average codelength. Using the equation from the next section, we can calculate it as 1.953. This means that however cleverly you select a code to represent the symbols, in average you need at least 1.953 bits per symbol. In this case you can't do better than 32 bits.

Next we create the Huffman tree. We first rank the symbols in decreasing probability order and then at each step combine two lowest-probability symbols into a single composite symbol (C1, C2, ..). The probability of this new symbol is therefore the sum of the two original probabilities. The process is then repeated until a single composite symbol remains:

	Step 1         Step 2	     Step 3         Step 4
	'I' 0.5	       'I' 0.5       'I' 0.5        C3  0.5\C4
	'D' 0.1875     C1  0.1875    C2  0.3125\C3  'I' 0.5/
	'S' 0.125      'D' 0.1875\C2 C1  0.1875/
	'V' 0.125 \C1  'S' 0.125 /
	'C' 0.0625/

Note that the composite symbols are inserted as high as possible, to get the shortest maximum code length (compare C1 and 'D' at Step 2).

At each step two lowest-probability nodes are combined until we have only one symbol left. Without knowing it we have already created a Huffman tree. Start at the final symbol (C4 in this case), break up the composite symbol assigning 0 to the first symbol and 1 to the second one. The following tree just discards the probabilities as we don't need them anymore.

		 C4
	      0 / \ 1
	       /  'I'
	      C3
	   0 / \ 1
	    /   \
	   C2   C1
	0 / \1 0/ \ 1
	'D''S' 'V''C'

	Symbol  Code  Code Length
	'C'	011   3
	'D'	000   3
	'I'	1     1
	'S'	001   3
	'V'	010   3

When we follow the tree from to top to the symbol we want to encode and remember each decision (which branch to follow), we get the code: {'C', 'D', 'I', 'S', 'V'} = {011, 000, 1, 001, 010}. For example when we see the symbol 'C' in the input, we output 011. If we see 'I' in the input, we output a single 1. The code for 'I' is very short because it occurs very often in the input.

Now we have the code lengths and can calculate the average code length: 0.0625*3+0.1875*3+0.5*1+0.125*3+0.125*3 = 2. We did not quite reach the lower limit that entropy gave us. Well, actually it is not so surprising because we know that Huffman code is optimal only if all the probabilities are negative powers of two.

Encoded, the message becomes:

	SIDVICIIISIDIDVI == 001 1 000 010 1 011 1 1 1 001 1 000 1 000 010 1

The spaces are only to make the reading easier. So, the compressed output takes 32 bits and we need at least 10 bits to transfer the Huffman tree by sending the code lengths (more on this later). The message originally took 48 bits, now it takes at least 42 bits.

Huffman coding is an example of a "variable length code" with a "defined word" input. Inputs of fixed size -- a single, three-bit letter above -- are replaced by a variable number of bits. At the other end of the scale are routines which break the input up into variably sized chunks, and replace those chunks with an often fixed-length output. The most popular schemes of this type are Lempel-Ziv, or LZ, codes.

Of these, LZ77 is probably the most straightforward. It tries to replace recurring patterns in the data with a short code. The code tells the decompressor how many symbols to copy and from where in the output to copy them. To compress the data, LZ77 maintains a history buffer which contains the data that has been processed and tries to match the next part of the message to it. If there is no match, the next symbol is output as-is. Otherwise an (offset,length) -pair is output.

	Output	         History Lookahead
			         SIDVICIIISIDIDVI
	S		       S IDVICIIISIDIDVI
	I		      SI DVICIIISIDIDVI
	D		     SID VICIIISIDIDVI
	V		    SIDV ICIIISIDIDVI
	I		   SIDVI CIIISIDIDVI
	C		  SIDVIC IIISIDIDVI
	I		 SIDVICI IISIDIDVI
	I		SIDVICII ISIDIDVI
	I	       SIDVICIII SIDIDVI
		       ---       ---	match length: 3
		       |----9---|	match offset: 9
	(9, 3)	    SIDVICIIISID IDVI
			      -- --	match length: 2
			      |2|	match offset: 2
	(2, 2)	  SIDVICIIISIDID VI
		     --          --	match length: 2
		     |----11----|	match offset: 11
	(11, 2)	SIDVICIIISIDIDVI

At each stage the string in the lookahead buffer is searched from the history buffer. The longest match is used and the distance between the match and the current position is output, with the match length. The processed data is then moved to the history buffer. Note that the history buffer contains data that has already been output. In the decompression side it corresponds to the data that has already been decompressed. The message becomes:

	S I D V I C I I I (9,3) (2,2) (11,2)

The following describes what the decompressor does with this data.

	History			Input
				S
	S			I
	SI			D
	SID			V
	SIDV			I
	SIDVI			C
	SIDVIC			I
	SIDVICI			I
	SIDVICII		I
	SIDVICIII		(9,3)	-> SID
	|----9---|
	SIDVICIIISID		(2,2)	-> ID
		  |2|
	SIDVICIIISIDID		(11,2)	-> VI
	   |----11----|
	SIDVICIIISIDIDVI

In the decompressor the history buffer contains the data that has already been decompressed. If we get a literal symbol code, it is added as-is. If we get an (offset,length) pair, the offset tells us from where to copy and the length tells us how many symbols to copy to the current output position. For example (9,3) tells us to go back 9 locations and copy 3 symbols to the current output position. The great thing is that we don't need to transfer or maintain any other data structure than the data itself.

Compare this to the BASIC interpreter, where all tokens have the high bit set and all normal characters don't (PETSCII codes 0-127). So when the LIST routine sees a normal character it just prints it as-is, but when it hits a special character (PETSCII >= 128) it looks up the corresponding keyword in a table. LZ77 is similar, but an LZ77 LIST would look up the keyword in the data already LISTed to the screen! LZ78 uses a separate table which is expanded as the data is processed.

The number of bits needed to encode the message (~52 bits) is somewhat bigger than the Huffman code used (42 bits). This is mainly because the message is too short for LZ77. It takes quite a long time to build up a good enough dictionary (the history buffer).

Introduction to Information Theory

Symbol Sources

Information theory traditionally deals with symbol sources that have certain properties. One important property is that they give out symbols that belong to a finite, predefined alphabet A. An alphabet can consist of for example all upper-case characters (A = {'A','B','C',..'Z',..}), all byte values (A = {0,1,..255}) or both binary digits (A = {0,1}).

As we are dealing with file compression, the symbol source is a file and the symbols (characters) are byte values from 0 to 255. A string or a phrase is a concatenation of symbols, for example 011101, "AAACB". Quite intuitive, right?

When reading symbols from a symbol source, there is some probability for each of the symbols to appear. For totally random sources each symbol is equally likely, but random sources are also incompressible, and we are not interested in them here. Equal probabilities or not, probabilities give us a means of defining the concept of symbol self-information, i.e. the amount of information a symbol carries.

Simply, the more probable an event is, the less bits of information it contains. If we denote the probability of a symbol A[i] occurring as p(A[i]), the expression -log2(p(A[i])) (base-2 logarithm) gives the amount of information in bits that the source symbol A[i] carries. You can calculate base-2 logarithms using base-10 or natural logarithms if you remember that log2(n) = log(n)/log(2).

A real-world example would be a comparison between two statements:

it is raining
the moon of earth has exploded.

The first case happens every once in a while (assuming we are not living in a desert area). Its probability may change around the world, but may be something like 0.3 during bleak autumn days. You would not be very surprised to hear that it is raining outside. It is not so for the second case. The second case would be big news, as it has never before happened, as far as we know. Although it seems very unlikely we could decide a very small probability for it, like 1E-30. The equation gives the self-information for the first case as 1.74 bits, and 99.7 bits for the second case.

Message Entropy

So, the more probable a symbol is, the less information it carries. What about the whole message, i.e. the symbols read from the input stream? What is the information contents a specific message carries? This brings us to another concept: the entropy of a source. The measure of entropy gives us the amount of information in a message and is calculated like this: H = sum{ -p(A[i])*log2(p(A[i])) }. For completeness we note that 0*log2(0) gives the result 0 although log2(0) is not defined in itself. In essence, we multiply the information a symbol carries by the probability of the symbol and then sum all multiplication results for all symbols together.

The entropy of a message is a convenient measure of information, because it sets the lower limit for the average codeword length for a block-variable code, for example Huffman code. You can not get better compression with a statistical compression method which only considers single-symbol probabilities. The average codeword length is calculated in an analogous way to the entropy. Average code length is L = sum{-l(i)*log2(p(A[i])) }, where l(i) is the codeword length for the ith symbol in the alphabet. The difference between L and H gives an indication about the efficiency of a code. Smaller difference means more efficient code.

It is no coincidence that the entropy and average code length are calculated using very similar equations. If the symbol probabilities are not equal, we can get a shorter overall message, i.e. shorter average codeword length (i.e. compression), if we assign shorter codes for symbols that are more likely to occur. Note that entropy is only the lower limit for statistical compression systems. Other methods may perform better, although not for all files.

Codes

A code is any mapping from an input alphabet to an output alphabet. A code can be e.g. {a, b, c} = {0, 1, 00}, but this code is obviously not uniquely decodable. If the decoder gets a code message of two zeros, there is no way it can know whether the original message had two a's or a c.

A code is instantaneous if each codeword (a code symbol as opposed to source symbol) in a message can be decoded as soon as it is received. The binary code {a, b} = {0, 01} is uniquely decodable, but it isn't instantaneous. You need to peek into the future to see if the next bit is 1. If it is, b is decoded, if not, a is decoded. The binary code {a, b, c} = {0, 10, 11} on the other hand is an instantaneous code.

A code is a prefix code if and only if no codeword is a prefix of another codeword. A code is instantaneous if and only if it is a prefix code, so a prefix code is always a uniquely decodable instantaneous code. We only deal with prefix codes from now on. It can be proven that all uniquely decodable codes can be changed into prefix codes of equal code lengths.

'Classic' Code Classification

Compression algorithms can be crudely divided into four groups:

Block-to-block codes
Block-to-variable codes
Variable-to-block codes
Variable-to-variable codes

Block-to-block codes

These codes take a specific number of bits at a time from the input and emit a specific number of bits as a result. If all of the symbols in the input alphabet (in the case of bytes, all values from 0 to 255) are used, the output alphabet must be the same size as the input alphabet, i.e. uses the same number of bits. Otherwise it could not represent all arbitrary messages.

Obviously, this kind of code does not give any compression, but it allows a transformation to be performed on the data, which may make the data more easily compressible, or which separates the 'essential' information for lossy compression. For example the discrete cosine transform (DCT) belongs to this group. It doesn't really compress anything, as it takes in a matrix of values and produces a matrix of equal size as output, but the resulting values hold the information in a more compact form.

In lossless audio compression the transform could be something along the lines of delta encoding, i.e. the difference between successive samples (there is usually high correlation between successive samples in audio data), or something more advanced like Nth order prediction. Only the prediction error is transmitted. In lossy compression the prediction error may be transmitted in reduced precision. The reproduction in the decompression won't then be exact, but the number of bits needed to transmit the prediction error may be much smaller.

One block-to-block code relevant to Commodore 64, VIC 20 and their relatives is nybble packing that is performed by some C64 compression programs. As nybbles by definition only occupy 4 bits of a byte, we can fit two nybbles into each byte without throwing any data away, thus getting 50% compression from the original which used a whole byte for every nybble. Although this compression ratio may seem very good, in reality very little is gained globally. First, only very small parts of actual files contain nybble-width data. Secondly, better methods exist that also take advantage of the patterns in the data.

Block-to-variable codes

Block-to-variable codes use a variable number of output bits for each input symbol. All statistical data compression systems, such as symbol ranking, Huffman coding, Shannon-Fano coding, and arithmetic coding belong to this group (these are explained in more detail later). The idea is to assign shorter codes for symbols that occur often, and longer codes for symbols that occur rarely. This provides a reduction in the average code length, and thus compression.

There are three types of statistical codes: fixed, static, and adaptive. Static codes need two passes over the input message. During the first pass they gather statistics of the message so that they know the probabilities of the source symbols. During the second pass they perform the actual encoding. Adaptive codes do not need the first pass. They update the statistics while encoding the data. The same updating of statistics is done in the decoder so that they keep in sync, making the code uniquely decodable. Fixed codes are 'static' static codes. They use a preset statistical model, and the statistics of the actual message has no effect on the encoding. You just have to hope (or make certain) that the message statistics are close to the one the code assumes.

However, 0-order statistical compression (and entropy) don't take advantage of inter-symbol relations. They assume symbols are disconnected variables, but in reality there is considerable relation between successive symbols. If I would drop every third character from this text, you would probably be able to decipher it quite well. First order statistical compression uses the previous character to predict the next one. Second order compression uses two previous characters, and so on. The more characters are used to predict the next character the better estimate of the probability distribution for the next character. But more is not only better, there are also prices to pay.

The first drawback is the amount of memory needed to store the probability tables. The frequencies for each character encountered must be accounted for. And you need one table for each 'previous character' value. If we are using an adaptive code, the second drawback is the time needed to update the tables and then update the encoding accordingly. In the case of Huffman encoding the Huffman tree needs to be recreated. And the encoding and decoding itself certainly takes time also.

We can keep the memory usage and processing demands tolerable by using a 0-order static Huffman code. Still, the Huffman tree takes up precious memory and decoding Huffman code on a 1-MHz 8-bit processor is slow and does not offer very good compression either. Still, statistical compression can still offer savings as a part of a hybrid compression system. For example:

	'A'	1/2	0
	'B'	1/4	10
	'C'	1/8	110
	'D'	1/8	111

	"BACADBAABAADABCA"				total: 32 bits
	10 0 110 0 111 10 0 0 10 0 0 111 0 10 110 0	total: 28 bits

This is an example of a simple statistical compression. The original symbols each take two bits to represent (4 possibilities), thus the whole string takes 32 bits. The variable-length code assigns the shortest code to the most probable symbol (A) and it takes 28 bits to represent the same string. The spaces between symbols are only there for clarity. The decoder still knows where each symbol ends because the code is a prefix code. On the other hand, I am simplifying things a bit here, because I'm omitting one vital piece of information: the length of the message. The file system normally stores the information about the end of file by storing the length of the file. The decoder also needs this information. We have two basic methods: reserve one symbol to represent the end of file condition or send the length of the original file. Both have their virtues.

The best compressors available today take into account intersymbol probabilities. Dynamic Markov Coding (DMC) starts with a zero-order Markov model and gradually extends this initial model as compression progresses. Prediction by Partial Matching (PPM), although it really is a variable-to-block code, looks for a match of the text to be compressed in an order-n context and if there is no match drops back to an order n-1 context until it reaches order 0.

Variable-to-block codes

The previous compression methods handled a specific number of bits at a time. A group of bits were read from the input stream and some bits were written to the output. Variable-to-block codes behave just the opposite. They use a fixed-length output code to represent a variable-length part of the input. Variable-to-block codes are also called free-parse methods, because there is no pre-defined way to divide the input message into encodable parts (i.e. strings that will be replaced by shorter codes). Substitutional compressors belong to this group.

Substitutional compressors work by trying to replace strings in the input data with shorter codes. Lempel-Ziv methods (named after the inventors) contain two main groups: LZ77 and LZ78.

Lempel-Ziv 1977

In 1977 Ziv and Lempel proposed a lossless compression method which replaces phrases in the data stream by a reference to a previous occurrance of the phrase. As long as it takes fewer bits to represent the reference and the phrase length than the phrase itself, we get compression. Kind-of like the way BASIC substitutes tokens for keywords.

LZ77-type compressors use a history buffer, which contains a fixed amount of symbols output/seen so far. The compressor reads symbols from the input to a lookahead buffer and tries to find as long as possible match from the history buffer. The length of the string match and the location in the buffer (offset from the current position) is written to the output. If there is no suitable match, the next input symbol is sent as a literal symbol.

Of course there must be a way to identify literal bytes and compressed data in the output. There are lot of different ways to accomplish this, but a single bit to select between a literal and compressed data is the easiest.

The basic scheme is a variable-to-block code. A variable-length piece of the message is represented by a constant amount of bits: the match length and the match offset. Because the data in the history buffer is known to both the compressor and decompressor, it can be used in the compression. The decompressor simply copies part of the already decompressed data or a literal byte to the current output position.

Variants of LZ77 apply additional compression to the output of the compressor, which include a simple variable-length code (LZB), dynamic Huffman coding (LZH), and Shannon-Fano coding (ZIP 1.x)), all of which result in a certain degree of improvement over the basic scheme. This is because the output values from the first stage are not evenly distributed, i.e. their probabilities are not equal and statistical compression can do its part.

Lempel-Ziv 1978

One large problem with the LZ77 method is that it does not use the coding space efficiently, i.e. there are length and offset values that never get used. If the history buffer contains multiple copies of a string, only the latest occurrance is needed, but they all take space in the offset value space. Each duplicate string wastes one offset value.

To get higher efficiency, we have to create a real dictionary. Strings are added to the codebook only once. There are no duplicates that waste bits just because they exist. Also, each entry in the codebook will have a specific length, thus only an index to the codebook is needed to specify a string (phrase). In LZ77 the length and offset values were handled more or less as disconnected variables although there is correlation. Because they are now handled as one entity, we can expect to do a little better in that regard also.

LZ78-type compressors use this kind of a dictionary. The next part of the message (the lookahead buffer contents) is searched from the dictionary and the maximum-length match is returned. The output code is an index to the dictionary. If there is no suitable entry in the dictionary, the next input symbol is sent as a literal symbol. The dictionary is updated after each symbol is encoded, so that it is possible to build an identical dictionary in the decompression code without sending additional data.

Essentially, strings that we have seen in the data are added to the dictionary. To be able to constantly adapt to the message statistics, the dictionary must be trimmed down by discarding the oldest entries. This also prevents the dictionary from becaming full, which would decrease the compression ratio. This is handled automatically in LZ77 by its use of a history buffer (a sliding window). For LZ78 it must be implemented separately. Because the decompression code updates its dictionary in sychronization with the compressor the code remains uniquely decodable.

Run-Length Encoding

Run length encoding also belongs to this group. If there are consecutive equal valued symbols in the input, the compressor outputs how many of them there are, and their value. Again, we must be able to identify literal bytes and compressed data. One of the RLE compressors I have seen outputs two equal symbols to indentify a run of symbols. The next byte(s) then tell how many more of these to output. If the value is 0, there are only two consecutive equal symbols in the original stream. Depending on how many bits are used to represent the value, this is the only case when the output is expanded.

Run-length encoding has been used since day one in C64 compression programs because it is very fast and very simple. Part of this is because it deals with byte-aligned data and is essentially just copying bytes from one place to another. The drawback is that RLE can only compress identical bytes into a shorter representation. On the C64 only graphics and music data contain large runs of identical bytes. Program code rarely contains more than a couple of successive identical bytes. We need something better.

That "something better" seems to be LZ77, which has been used in C64 compression programs for a long time. LZ77 can take advantage of repeating code/graphic/music data fragments and thus achieves better compression. The drawback is that practical LZ77 implementations tend to became variable-to-variable codes (more on that later) and need to handle data bit by bit, which is quite a lot slower than handling bytes.

LZ78 is not practical for C64, because the decompressor needs to create and update the dictionary. A big enough dictionary would take too much memory and updating the dictionary would need its share of processor cycles.

Variable-to-variable codes

The compression algorithms in this category are mostly hybrids or concatenations of the previously described compressors. For example a variable-to-block code such as LZ77 followed by a statistical compressor like Huffman encoding falls into this category and is used in Zip, LHa, Gzip and many more. They use fixed, static, and adaptive statistical compression, depending on the program and the compression level selected.

Randomly concatenating algorithms rarely produces good results, so you have to know what you are doing and what kind of files you are compressing. Whenever a novice asks the usual question: 'What compression program should I use?', they get the appropriate response: 'What kind of data are you compressing?'

Borrowed from Tom Lane's article in comp.compression:
It's hardly ever worthwhile to take the compressed output of one compression method and shove it through another compression method. Especially not if the second method is a general-purpose compressor that doesn't have specific knowledge of the first compression step. Compression is effective in direct proportion to the extent that it eliminates obvious patterns in the data. So if the first compression step is any good, it will leave little traction for the second step. Combining multiple compression methods is only helpful when the methods are specifically chosen to be complementary.

A small sidetrack I want to take:
This also brings us conveniently to another truth in lossless compression. There isn't a single compressor which would be able to losslessly compress all possible files (you can see the comp.compression FAQ for information about the counting proof). It is our luck that we are not interested in compressing all files. We are only interested in compressing a very small subset of all files. The more accurately we can describe the files we would encounter, the better. This is called modelling, and it is what all compression programs do and must do to be successful.

Audio and graphics compression algorithm may assume a continuous signal, and a text compressor may assume that there are repeated strings in the data. If the data does not match the assumptions (the model), the algorithm usually expands the data instead of compressing it.

Representing Integers

Many compression algorithms use integer values for something or another. Pucrunch is no exception as it needs to represent RLE repeat counts and LZ77 string match lengths and offsets. Any algorithm that needs to represent integer values can benefit very much if we manage to reduce the number of bits needed to do that. This is why efficient coding of these integers is very important. What encoding method to select depends on the distribution and the range of the values.

Fixed, Linear

If the values are evenly distributed throughout the whole range, a direct binary representation is the optimal choice. The number of bits needed of course depends on the range. If the range is not a power of two, some tweaking can be done to the code to get nearer the theoretical optimum log2(range) bits per value.

	Value	Binary	Adjusted 1&2
	---------------------------
	0	000	00	000	H = 2.585
	1	001	01	001	L = 2.666
	2	010	100	010	(for flat distribution)
	3	011	101	011
	4	100	110	10
	5	101	111	11

The previous table shows two different versions of how the adjustment could be done for a code that has to represent 6 different values with the minimum average number of bits. As can be seen, they are still both prefix codes, i.e. it's possible to (easily) decode them.

If there is no definite upper limit to the integer value, direct binary code can't be used and one of the following codes must be selected.

Elias Gamma Code

The Elias gamma code assumes that smaller integer values are more probable. In fact it assumes (or benefits from) a proportionally decreasing distribution. Values that use n bits should be twice as probable as values that use n+1 bits.

In this code the number of zero-bits before the first one-bit (a unary code) defines how many more bits to get. The code may be considered a special fixed Huffman tree. You can generate a Huffman tree from the assumed value distribution and you'll get a very similar code. The code is also directly decodable without any tables or difficult operations, because once the first one-bit is found, the length of the code word is instantly known. The bits following the zero bits (if any) are directly the encoded value.

	Gamma Code   Integer  Bits
	--------------------------
	1                  1     1
	01x              2-3     3
	001xx            4-7     5
	0001xxx         8-15     7
	00001xxxx      16-31     9
	000001xxxxx    32-63    11
	0000001xxxxxx 64-127    13
	...

Elias Delta Code

The Elias Delta Code is an extension of the gamma code. This code assumes a little more 'traditional' value distribution. The first part of the code is a gamma code, which tells how many more bits to get (one less than the gamma code value).

	Delta Code   Integer  Bits
	--------------------------
	1                  1     1
	010x             2-3     4
	011xx            4-7     5
	00100xxx        8-15     8
	00101xxxx      16-31     9
	00110xxxxx     32-63    10
	00111xxxxxx   64-127    11
	...

The delta code is better than gamma code for big values, as it is asymptotically optimal (the expected codeword length approaches constant times entropy when entropy approaches infinity), which the gamma code is not. What this means is that the extra bits needed to indicate where the code ends become smaller and smaller proportion of the total bits as we encode bigger and bigger numbers. The gamma code is better for greatly skewed value distributions (a lot of small values).

Fibonacci Code

The fibonacci code is another variable length code where smaller integers get shorter codes. The code ends with two one-bits, and the value is the sum of the corresponding Fibonacci values for the bits that are set (except the last one-bit, which ends the code).

	1  2  3  5  8 13 21 34 55 89
	----------------------------
	1 (1)                         =  1
	0  1 (1)                      =  2
	0  0  1 (1)                   =  3
	1  0  1 (1)                   =  4
	0  0  0  1 (1)                =  5
	1  0  0  1 (1)                =  6
	0  1  0  1 (1)                =  7
	0  0  0  0  1 (1)             =  8
	:  :  :  :  :  :                 :
	1  0  1  0  1 (1)             = 12
	0  0  0  0  0  1 (1)          = 13
	:  :  :  :  :  :  :              :
	0  1  0  1  0  1 (1)          = 20
	0  0  0  0  0  0  1 (1)       = 21
	:  :  :  :  :  :  :  :           :
	1  0  0  1  0  0  1 (1)       = 27

Note that because the code does not have two successive one-bits until the end mark, the code density may seem quite poor compared to the other codes, and it is, if most of the values are small (1-3). On the other hand, it also makes the code very robust by localizing and containing possible errors. Although, if the Fibonacci code is used as a part of a larger system, this robustness may not help much, because we lose the synchronization in the upper level anyway. Most adaptive methods can't recover from any errors, whether they are detected or not. Even in LZ77 the errors can be inherited infinitely far into the future.

Comparison between delta, gamma and Fibonacci code lengths.

	       Gamma Delta Fibonacci
	     1     1     1       2.0
	   2-3     3     4       3.5
	   4-7     5     5       4.8
	  8-15     7     8       6.4
	 16-31     9     9       7.9
	 32-63    11    10       9.2
	64-127    13    11      10.6

The comparison shows that if even half of the values are in the range 1..7 (and other values relatively near this range), the Elias gamma code wins by a handsome margin.

Golomb and Rice Codes

Golomb (and Rice) codes are prefix codes that are suboptimal (compared to Huffman), but very easy to implement. Golomb codes are distinguished from each other by a single parameter m. This makes it very easy to adjust the code dynamically to adapt to changes in the values to encode.

	Golomb	m=1	m=2	m=3	m=4	m=5	m=6
	Rice  	k=0	k=1		k=2
	---------------------------------------------------
	n = 0	0	00	00	000	000	000
	    1	10	01	010	001	001	001
	    2	110	100	011	010	010	0100
	    3	1110	101	100	011	0110	0101
	    4	11110	1100	1010	1000	0111	0110
	    5	111110	1101	1011	1001	1000	0111
	    6   1111110	11100	1100	1010	1001	1000
	    7	:	11101	11010	1011	1010	1001
	    8	:	111100	11011	11000	10110	10100

To encode an integer n (starting from 0 this time, not from 1 as for Elias codes and Fibonacci code) using the Golomb code with parameter m, we first compute floor( n/m ) and output this using a unary code. Then we compute the remainder n mod m and output that value using a binary code which is adjusted so that we sometimes use floor( log2(m) ) bits and sometimes ceil( log2(m) ) bits.

Rice coding is the same as Golomb coding except that only a subset of parameters can be used, namely the powers of 2. In other words, a Rice code with the parameter k is equal to Golomb code with parameter m = 2^k. Because of this the Rice codes are much more efficient to implement on a computer. Division becomes a shift operation and modulo becomes a bit mask operation.

Hybrid/Mixed Codes

Sometimes it may be advantageous to use a code that combines two or more of these codes. In a way the Elias codes are already hybrid codes. The gamma code has a fixed huffman tree (a unary code) and a binary code part, the delta code has a gamma code and a binary code part. The same applies to Golomb and Rice codes because they consist of a unary code part and a linear code (adjusted) part.

So now we have several alternatives to choose from. Now we simply have to do a little real-life research to determine how the values we want to encode are distributed so that we can select the optimum code to represent them. Of course we still have to keep in mind that we intend to decode the thing with a 1-MHz 8-bit processor. As always, compromises loom on the horizon. Pucrunch uses Elias Gamma Code, because it is the best alternative for that task and is very close to static Huffman code. The best part is that the Gamma Code is much simpler to decode and doesn't need additional memory.

Closer Look

Because the decompression routines are usually much easier to understand than the corresponding compression routines, I will primarily describe only them here. This also ensures that there really is a decompressor for a compression algorithm. Many are those people who have developed a great new compression algorithm that outperforms all existing versions, only to later discover that their algorithm doesn't save enough information to be able to recover the original file from the compressed data.

Also, the added bonus is that once we have a decompressor, we can improve the compressor without changing the file format. At least until we have some statistics to develop a better system. Many lossy video and audio compression systems only document and standardize the decompressor and the file or stream format, making it possible to improve the encoding part of the process when faster and better hardware (or algorithms) become available.

RLE

	void DecompressRLE() {
	    int oldChar = -1;
	    int newChar;

	    while(1) {
		newChar = GetByte();
		if(newChar == EOF)
		    return;
		PutByte(newChar);
		if(newChar == oldChar) {
		    int len = GetLength();

		    while(len > 0) {
			PutByte(newChar);
			len = len - 1;
		    }
		}
		oldChar = newChar;
	    }
	}

This RLE algorithm uses two successive equal characters to mark a run of bytes. I have in purpose left open the question of how the length is encoded (1 or 2 bytes or variable-length code). The decompressor also allows chaining/extension of RLE runs, for example 'a', 'a', 255, 'a', 255 would output 513 'a'-characters.

In this case the compression algorithm is almost as simple.

	void CompressRLE() {
	    int oldChar = -1;
	    int newChar;

	    while(1) {
		newChar = GetByte();
		if(newChar==oldChar) {
		    int length = 0;

		    if(newChar == EOF)
			return;
		    PutByte(newChar); /* RLE indicator */

		    /* Get all equal characters */
		    while((newChar = GetByte()) == oldChar) {
			length++;
		    }
		    PutLength(length);
		}
		if(newChar == EOF)
		    return;
		PutByte(newChar);
		oldChar = newChar;
	    }
	}

If there are two equal bytes, the compression algorithm reads more bytes until it gets a different byte. If there was only two equal bytes, the length value will be zero and the compression algorithm expands the data. A C64-related example would be the compression of the BASIC ROM with this RLE algorithm. Or actually expansion, as the new file size is 8200 bytes instead of the original 8192 bytes. Those equal byte runs that the algorithm needs just aren't there. For comparison, pucrunch manages to compress the BASIC ROM into 7288 bytes, the decompression code included. Even Huffman coding manages to compress it into 7684 bytes.

	"BAAAAAADBBABBBBBAAADABCD"		total: 24*8=192 bits
	"BAA",4,"DBB",0,"ABB",3,"AA",1,"DABCD"	total: 16*8+4*8=160 bits

This is an example of how the presented RLE encoder would work on a string. The total length calculations assume that we are handling 8-bit data, although only values from 'A' to 'D' are present in the string. After seeing two equal characters the decoder gets a repeat count and then adds that many more of them. Notice that the repeat count is zero if there are only two equal characters.

Huffman Code

	int GetHuffman() {
	    int index = 0;

	    while(1) {
		if(GetBit() == 1) {
		    index = LeftNode(index);
		} else {
		    index = RightNode(index);
		}
		if(LeafNode(index)) {
		    return LeafValue(index);
		}
	    }
	}

My pseudo code of the Huffman decode function is a very simplified one, so I should probably describe how the Huffman code and the corresponding binary tree is constructed first.

First we need the statistics for all the symbols occurring in the message, i.e. the file we are compressing. Then we rank them in decreasing probability order. Then we combine the smallest two probabilities and assign 0 and 1 to the binary tree branches, i.e. the original symbols. We do this until there is only one composite symbol left.

Depending on where we insert the composite symbols we get different Huffman trees. The average code length is equal in both cases (and so is the compression ratio), but the length of the longest code changes. The implementation of the decoder is usually more efficient if we keep the longest code as short as possible. This is achieved by inserting the composite symbols (new nodes) before all symbols/nodes that have equal probability.

	"BAAAAAADBBABBBBBAAADABCD"
	A (11)  B (9)  D (3)  C (1)

	Step 1		    Step 2		Step 3
	'A' 0.458	    'A' 0.458		C2  0.542 0\ C3
	'B' 0.375	    'B' 0.375 0\ C2	'A' 0.458 1/
	'D' 0.125 0\ C1	    C1  0.167 1/
	'C' 0.042 1/

	        C3
	     0 /  \ 1
	      /   'A'
	     C2
	  0 /  \ 1
	  'B'   \
	        C1
	     0 /  \ 1
	     'D'  'C'

So, in each step we combine two lowest-probability nodes or leaves into a new node. When we are done, we have a Huffman tree containing all the original symbols. The Huffman codes for the symbols can now be gotten by starting at the root of the tree and collecting the 0/1-bits on the way to the desired leaf (symbol). We get:

	'A' = 1    'B' = 00    'C' = 011    'D' = 010

These codes (or the binary tree) are used when encoding the file, but the decoder also needs this information. Sending the binary tree or the codes would take a lot of bytes, thus taking away all or most of the compression. The amount of data needed to transfer the tree can be greatly reduced by sending just the symbols and their code lengths. If the tree is traversed in a canonical (predefined) order, this is all that is needed to recreate the tree and the Huffman codes. By doing a 0-branch-first traverse we get:

	Symbol	Code	Code Length
	'B'	00	2
	'D'	010	3
	'C'	011	3
	'A'	1	1

So we can just send 'B', 2, 'D', 3, 'C', 3, 'A', 1 and the decoder has enough information (when it also knows how we went through the tree) to recreate the Huffman codes and the tree. Actually you can even drop the symbol values if you handle things a bit differently (see the Deflate specification in RFC1951), but my arrangement makes the algorithm much simpler and doesn't need to transfer data for symbols that are not present in the message.

Basically we start with a code value of all zeros and the appropriate length for the first symbol. For other symbols we first add the code value with 1 and then shift the value left or right to get it to be the right size. In the example we first assign 00 to 'B', then add one to get 01, shift left to get a 3-bit codeword for 'D' making it 010 like it should. For 'C' add 1, you get 011, no shift because the codewords is the right size already. And for 'A' add one and get 100, shift 2 places to right and get 1.

The Deflate algorithm in essence attaches a counting sort algorithm to this algorithm, feeding in the symbols in increasing code length order. Oh, don't worry if you don't understand what the counting sort has to do with this. I just wanted to give you some idea about it if you some day read the deflate specification or the gzip source code.

Actually, the decoder doesn't necessarily need to know the Huffman codes at all, as long as it has created the proper internal representation of the Huffman tree. I developed a special table format which I used in the C64 Huffman decode function and may present it in a separate article someday. The decoding works by just going through the tree by following the instructions given by the input bits as shown in the example Huffman decode code. Each bit in the input makes us go to either the 0-branch or the 1-branch. If the branch is a leaf node, we have decoded a symbol and just output it, return to the root node and repeat the procedure.

A technique related to Huffman coding is Shannon-Fano coding. It works by first dividing the symbols into two equal-probability groups (or as close to as possible). These groups are then further divided until there is only one symbol in each group left. The algorithm used to create the Huffman codes is bottom-up, while the Shannon-Fano codes are created top-down. Huffman encoding always generates optimal codes (in the entropy sense), Shannon-Fano sometimes uses a few more bits.

There are also ways of modifying the statistical compression methods so that we get nearer to the entropy. In the case of 'A' having the probability 0.75 and 'B' 0.25 we can decide to group several symbols together, producing a variable-to-variable code.

	"AA"	0.5625		0
	"B"	0.25		10
	"AB"	0.1875		11

If we separately transmit the length of the file, we get the above probabilities. If a file has only one 'A', it can be encoded as length=1 and either "AA" or "AB". The entropy of the source is H = 0.8113, and the average code length (per source symbol) is approximately L = 0.8518, which is much better than L = 1.0, which we would get if we used a code {'A','B'} = {0,1}. Unfortunately this method also expands the number of symbols we have to handle, because each possible source symbol combination is handled as a separate symbol.

Arithmetic Coding

Huffman and Shannon-Fano codes are only optimal if the probabilities of the symbols are negative powers of two. This is because all prefix codes work in the bit level. Decisions between tree branches always take one bit, whether the probabilities for the branches are 0.5/0.5 or 0.9/0.1. In the latter case it would theoretically take only 0.15 bits (-log2(0.9)) to select the first branch and 3.32 bits (-log2(0.1)) to select the second branch, making the average code length 0.467 bits (0.9*0.15 + 0.1*3.32). The Huffman code still needs one bit for each decision.

Arithmetic coding does not have this restriction. It works by representing the file by an interval of real numbers between 0 and 1. When the file size increases, the interval needed to represent it becomes smaller, and the number of bits needed to specify that interval increases. Successive symbols in the message reduce this interval in accordance with the probability of that symbol. The more likely symbols reduce the range by less, and thus add fewer bits to the message.

	 1                                             Codewords
	+-----------+-----------+-----------+
	|           |8/9 YY     |  Detail   |�- 31/32    .11111
	|           +-----------+-----------+�- 15/16    .1111
	|    Y      |           | too small |�- 14/16    .1110
	|2/3        |    YX     | for text  |�- 6/8      .110
	+-----------+-----------+-----------+
	|           |           |16/27 XYY  |�- 10/16    .1010
	|           |           +-----------+
	|           |    XY     |           |
	|           |           |   XYX     |�- 4/8      .100
	|           |4/9        |           |
	|           +-----------+-----------+
	|           |           |           |
	|    X      |           |   XXY     |�- 3/8      .011
	|           |           |8/27       |
	|           |           +-----------+
	|           |    XX     |           |
	|           |           |           |�- 1/4      .01
	|           |           |   XXX     |
	|           |           |           |
	|0          |           |           |
	+-----------+-----------+-----------+

As an example of arithmetic coding, lets consider the example of two symbols X and Y, of probabilities 2/3 and 1/3. To encode a message, we examine the first symbol: If it is a X, we choose the lower partition; if it is a Y, we choose the upper partition. Continuing in this manner for three symbols, we get the codewords shown to the right of the diagram above. They can be found by simply taking an appropriate location in the interval for that particular set of symbols and turning it into a binary fraction. In practice, it is also necessary to add a special end-of-data symbol, which is not represented in this simple example.

This explanation may not be enough to help you understand arithmetic coding. There are a lot of good articles about arithmetic compression in the net, for example by Mark Nelson.

Arithmetic coding is not practical for C64 for many reasons. The biggest reason being speed, especially for adaptive arithmetic coding. The close second reason is of course memory.

Symbol Ranking

Symbol ranking is comparable to Huffman coding with a fixed Huffman tree. The compression ratio is not very impressive (reaches Huffman code only is some cases), but the decoding algorithm is very simple, does not need as much memory as Huffman and is also faster.

	int GetByte() {
	    int index = GetUnaryCode();

	    return mappingTable[index];
	}

The main idea is to have a table containing the symbols in descending probability order (rank order). The message is then represented by the table indices. The index values are in turn represented by a variable-length integer representation (these are studied in the next article). Because more probable symbols (smaller indices) take less bits than less probable symbols, in average we save bits. Note that we have to send the rank order, i.e. the symbol table too.

	"BAAAAAADBBABBBBBAAADABCD"		total: 24*8=192 bits
	Rank Order: A (11)  B (9)  D (3)  C (1)		4*8=32 bits
	Unary Code: 0       10     110    1110
	"100000001101010010101010100001100101110110"   42 bits
						total: 74 bits

The statistics rank the symbols in the order ABDC (most probable first), which takes approximately 32 bits to transmit (we assume that any 8-bit value is possible). The indices are represented as a code {0, 1, 2, 3} = {0, 10, 110, 1110}. This is a simple unary code where the number of 1-bits before the first 0-bit directly give the integer value. The first 0-bit also ends a symbol. When this code and the rank order table are combined in the decoder, we get the reverse code {0, 10, 110, 1110} = {A, B, D, C}. Note that in this case the code is very similar to the Huffman code we created in a previous example.

LZ78

LZ78-based schemes work by entering phrases into a dictionary and then, when a repeat occurrence of that particular phrase is found, outputting the dictionary index instead of the phrase. For example, LZW (Lempel-Ziv-Welch) uses a dictionary with 4096 entries. In the beginning the entries 0-255 refer to individual bytes, and the rest 256-4095 refer to longer strings. Each time a new code is generated it means a new string has been selected from the input stream. New strings that are added to the dictionary are created by appending the current character K to the end of an existing string w. The algorithm for LZW compression is as follows:

	set w = NIL
	loop
	    read a character K
	    if wK exists in the dictionary
		w = wK
	    else
		output the code for w
		add wK to the string table
		w = K
	endloop

	Input string: /WED/WE/WEE/WEB

	Input	Output	New code and string
	/W	/	256 = /W
	E	W	257 = WE
	D	E	258 = ED
	/	D	259 = D/
	WE	256	260 = /WE
	/	E	261 = E/
	WEE	260	262 = /WEE
	/W	261	263 = E/W
	EB	257	264 = WEB
	<END>	B

A sample run of LZW over a (highly redundant) input string can be seen in the diagram above. The strings are built up character-by-character starting with a code value of 256. LZW decompression takes the stream of codes and uses it to exactly recreate the original input data. Just like the compression algorithm, the decompressor adds a new string to the dictionary each time it reads in a new code. All it needs to do in addition is to translate each incoming code into a string and send it to the output. A sample run of the LZW decompressor is shown in below.

	Input code: /WED<256>E<260><261><257>B

	Input	Output	New code and string
	/	/	
	W	W	256 = /W
	E	E	257 = WE
	D	D	258 = ED
	256	/W	259 = D/
	E	E	260 = /WE
	260	/WE	261 = E/
	261	E/	262 = /WEE
	257	WE	263 = E/W
	B	B	264 = WEB

The most remarkable feature of this type of compression is that the entire dictionary has been transmitted to the decoder without actually explicitly transmitting the dictionary. The decoder builds the dictionary as part of the decoding process.

See also the article "LZW Compression" by Bill Lucier in C=Hacking issue 6 and "LZW Data Compression" by Mark Nelson mentioned in the references section.

Conclusions

That's more than enough for one article. What did we get out of it ? Statistical compression works with uneven symbol probabilities to reduce the average code length. Substitutional compressors replace strings with shorter representations. All popular compression algorithms use LZ77 or LZ78 followed by some sort of statistical compression. And you can't just mix and match different algorithms and expect good results.

There are no shortcuts in understanding data compression. Some things you only understand when trying out them yourself. However, I hope that this article has given you at least a vague grasp of how different compression methods really work.

I would like to send special thanks to Stephen Judd for his comments. Without him this article would've been much more unreadable than it is now. On the other hand, that's what the magazine editor is for :-)

The second part of the story is a detailed talk about pucrunch. I also go through the corresponding C64 decompression code in detail.

References

To the homepage of a1bert@iki.fi