The most used 5.25-inch disk drive for Commodore 64 only holds 170 kB of data, which is only about 2.5 times the total random access memory of the machine. With compression, many more programs can fit on a disk. This is especially true for programs containing flashy graphics or sampled sound. Compression also reduces the loading times from the notoriously slow 1541 drive, whether you use the original slow serial bus routines or some kind of a disk turbo loader routine.
Dozens of compression programs are available for Commodore 64. I leave the work to chronicle the history of the C64 compression programs to others and concentrate on a general overview of the different compression algorithms. Later we'll take a closer look on how the compression algorithms actually work and in the next article I will introduce my own creation: pucrunch.
Pucrunch is a compression program written in ANSI-C which generates files that automatically decompress and execute themselves when run on a C64 (or C128 in C64-mode, VIC20, C16/+4). A cross-compressor, if you will, allowing you to do the real work on any machine, like a cross-assembler.
Our target environment (Commodore 64 and VIC20) restricts us somewhat when designing the 'ideal' compression system. We would like it to be able to decompress as big a program as possible. Therefore the decompression code must be located in low memory and be as short as possible and must use very small amounts of extra memory.
Another requirement is that the decompression should be relatively fast, which means that the arithmetic used should be mostly 8- or 9-bit which is much faster than e.g. 16-bit arithmetic. Processor- and memory-intensive algorithms are pretty much out of the question. A part of the decompressor efficiency depends on the format of the compressed data. Byte-aligned codes can be accessed very quickly; non-byte-aligned codes are much slower to handle, but provide better compression.
This is not meant to be the end-all document for data compression. My intention is to only scratch the surface and give you a crude overview. Also, I'm mainly talking about lossless compression here, although some lossy compression ideas are briefly mentioned. A lot of compression talk is available in the world wide web, although it may not be possible to understand everything on the first reading. To build the knowledge, you have to read many documents and understand something from each one so that when you return to a document, you can understand more than the previous time. It's a lot like watching Babylon 5. :-)
Some words of warning: I try to give something interesting to read to both advanced and not so advanced readers. It is perfectly all right for you to skip all uninteresting details. I start with a Huffman and LZ77 example so you can get the basic idea before flooding you with equations, complications, and trivia.
SIDVICIIISIDIDVIcompress using a) Huffman encoding, and b) LZ77? How do compression concepts such as information entropy enter into this?
A direct binary code would map the different symbols to consequtive bit patterns, such as:
Symbol Code 'C' 000 'D' 001 'I' 010 'S' 011 'V' 100Because there are five symbols, we need 3 bits to represent all of the possibilities, but we also don't use all the possibilities. Only 5 values are used out of the maximum 8 that can be represented in 3 bits. With this code the original message takes 48 bits:
SIDVICIIISIDIDVI == 011 010 001 100 010 000 010 010 010 011 010 001 010 001 100 010
For Huffman and for entropy calculation (entropy is explained in the next chapter) we first need to calculate the symbol frequencies from the message. The probability for each symbol is the frequency divided by the message length. When we reduce the number of bits needed to represent the probable symbols (their code lengths) we can also reduce the average code length and thus the number of bits we need to send.
'C' 1/16 0.0625 'D' 3/16 0.1875 'I' 8/16 0.5 'S' 2/16 0.125 'V' 2/16 0.125The entropy gives the lower limit for statistical compression method's average codelength. Using the equation from the next section, we can calculate it as 1.953. This means that however cleverly you select a code to represent the symbols, in average you need at least 1.953 bits per symbol. In this case you can't do better than 32 bits.
Next we create the Huffman tree. We first rank the symbols in decreasing probability order and then at each step combine two lowest-probability symbols into a single composite symbol (C1, C2, ..). The probability of this new symbol is therefore the sum of the two original probabilities. The process is then repeated until a single composite symbol remains:
Step 1 Step 2 Step 3 Step 4 'I' 0.5 'I' 0.5 'I' 0.5 C3 0.5\C4 'D' 0.1875 C1 0.1875 C2 0.3125\C3 'I' 0.5/ 'S' 0.125 'D' 0.1875\C2 C1 0.1875/ 'V' 0.125 \C1 'S' 0.125 / 'C' 0.0625/Note that the composite symbols are inserted as high as possible, to get the shortest maximum code length (compare C1 and 'D' at Step 2).
At each step two lowest-probability nodes are combined until we have only one symbol left. Without knowing it we have already created a Huffman tree. Start at the final symbol (C4 in this case), break up the composite symbol assigning 0 to the first symbol and 1 to the second one. The following tree just discards the probabilities as we don't need them anymore.
C4 0 / \ 1 / 'I' C3 0 / \ 1 / \ C2 C1 0 / \1 0/ \ 1 'D''S' 'V''C' Symbol Code Code Length 'C' 011 3 'D' 000 3 'I' 1 1 'S' 001 3 'V' 010 3When we follow the tree from to top to the symbol we want to encode and remember each decision (which branch to follow), we get the code: {'C', 'D', 'I', 'S', 'V'} = {011, 000, 1, 001, 010}. For example when we see the symbol 'C' in the input, we output 011. If we see 'I' in the input, we output a single 1. The code for 'I' is very short because it occurs very often in the input.
Now we have the code lengths and can calculate the average code length: 0.0625*3+0.1875*3+0.5*1+0.125*3+0.125*3 = 2. We did not quite reach the lower limit that entropy gave us. Well, actually it is not so surprising because we know that Huffman code is optimal only if all the probabilities are negative powers of two.
Encoded, the message becomes:
SIDVICIIISIDIDVI == 001 1 000 010 1 011 1 1 1 001 1 000 1 000 010 1The spaces are only to make the reading easier. So, the compressed output takes 32 bits and we need at least 10 bits to transfer the Huffman tree by sending the code lengths (more on this later). The message originally took 48 bits, now it takes at least 42 bits.
Huffman coding is an example of a "variable length code" with a "defined word" input. Inputs of fixed size -- a single, three-bit letter above -- are replaced by a variable number of bits. At the other end of the scale are routines which break the input up into variably sized chunks, and replace those chunks with an often fixed-length output. The most popular schemes of this type are Lempel-Ziv, or LZ, codes.
Of these, LZ77 is probably the most straightforward. It tries to replace recurring patterns in the data with a short code. The code tells the decompressor how many symbols to copy and from where in the output to copy them. To compress the data, LZ77 maintains a history buffer which contains the data that has been processed and tries to match the next part of the message to it. If there is no match, the next symbol is output as-is. Otherwise an (offset,length) -pair is output.
Output History Lookahead SIDVICIIISIDIDVI S S IDVICIIISIDIDVI I SI DVICIIISIDIDVI D SID VICIIISIDIDVI V SIDV ICIIISIDIDVI I SIDVI CIIISIDIDVI C SIDVIC IIISIDIDVI I SIDVICI IISIDIDVI I SIDVICII ISIDIDVI I SIDVICIII SIDIDVI --- --- match length: 3 |----9---| match offset: 9 (9, 3) SIDVICIIISID IDVI -- -- match length: 2 |2| match offset: 2 (2, 2) SIDVICIIISIDID VI -- -- match length: 2 |----11----| match offset: 11 (11, 2) SIDVICIIISIDIDVIAt each stage the string in the lookahead buffer is searched from the history buffer. The longest match is used and the distance between the match and the current position is output, with the match length. The processed data is then moved to the history buffer. Note that the history buffer contains data that has already been output. In the decompression side it corresponds to the data that has already been decompressed. The message becomes:
S I D V I C I I I (9,3) (2,2) (11,2)
The following describes what the decompressor does with this data.
History Input S S I SI D SID V SIDV I SIDVI C SIDVIC I SIDVICI I SIDVICII I SIDVICIII (9,3) -> SID |----9---| SIDVICIIISID (2,2) -> ID |2| SIDVICIIISIDID (11,2) -> VI |----11----| SIDVICIIISIDIDVI
In the decompressor the history buffer contains the data that has already been decompressed. If we get a literal symbol code, it is added as-is. If we get an (offset,length) pair, the offset tells us from where to copy and the length tells us how many symbols to copy to the current output position. For example (9,3) tells us to go back 9 locations and copy 3 symbols to the current output position. The great thing is that we don't need to transfer or maintain any other data structure than the data itself.
Compare this to the BASIC interpreter, where all tokens have the high bit set and all normal characters don't (PETSCII codes 0-127). So when the LIST routine sees a normal character it just prints it as-is, but when it hits a special character (PETSCII >= 128) it looks up the corresponding keyword in a table. LZ77 is similar, but an LZ77 LIST would look up the keyword in the data already LISTed to the screen! LZ78 uses a separate table which is expanded as the data is processed.
The number of bits needed to encode the message (~52 bits) is somewhat bigger than the Huffman code used (42 bits). This is mainly because the message is too short for LZ77. It takes quite a long time to build up a good enough dictionary (the history buffer).
As we are dealing with file compression, the symbol source is a file and the symbols (characters) are byte values from 0 to 255. A string or a phrase is a concatenation of symbols, for example 011101, "AAACB". Quite intuitive, right?
When reading symbols from a symbol source, there is some probability for each of the symbols to appear. For totally random sources each symbol is equally likely, but random sources are also incompressible, and we are not interested in them here. Equal probabilities or not, probabilities give us a means of defining the concept of symbol self-information, i.e. the amount of information a symbol carries.
Simply, the more probable an event is, the less bits of information it contains. If we denote the probability of a symbol A[i] occurring as p(A[i]), the expression -log2(p(A[i])) (base-2 logarithm) gives the amount of information in bits that the source symbol A[i] carries. You can calculate base-2 logarithms using base-10 or natural logarithms if you remember that log2(n) = log(n)/log(2).
A real-world example would be a comparison between two statements:
The entropy of a message is a convenient measure of information, because it sets the lower limit for the average codeword length for a block-variable code, for example Huffman code. You can not get better compression with a statistical compression method which only considers single-symbol probabilities. The average codeword length is calculated in an analogous way to the entropy. Average code length is L = sum{-l(i)*log2(p(A[i])) }, where l(i) is the codeword length for the ith symbol in the alphabet. The difference between L and H gives an indication about the efficiency of a code. Smaller difference means more efficient code.
It is no coincidence that the entropy and average code length are calculated using very similar equations. If the symbol probabilities are not equal, we can get a shorter overall message, i.e. shorter average codeword length (i.e. compression), if we assign shorter codes for symbols that are more likely to occur. Note that entropy is only the lower limit for statistical compression systems. Other methods may perform better, although not for all files.
A code is instantaneous if each codeword (a code symbol as opposed to source symbol) in a message can be decoded as soon as it is received. The binary code {a, b} = {0, 01} is uniquely decodable, but it isn't instantaneous. You need to peek into the future to see if the next bit is 1. If it is, b is decoded, if not, a is decoded. The binary code {a, b, c} = {0, 10, 11} on the other hand is an instantaneous code.
A code is a prefix code if and only if no codeword is a prefix of another codeword. A code is instantaneous if and only if it is a prefix code, so a prefix code is always a uniquely decodable instantaneous code. We only deal with prefix codes from now on. It can be proven that all uniquely decodable codes can be changed into prefix codes of equal code lengths.
Obviously, this kind of code does not give any compression, but it allows a transformation to be performed on the data, which may make the data more easily compressible, or which separates the 'essential' information for lossy compression. For example the discrete cosine transform (DCT) belongs to this group. It doesn't really compress anything, as it takes in a matrix of values and produces a matrix of equal size as output, but the resulting values hold the information in a more compact form.
In lossless audio compression the transform could be something along the lines of delta encoding, i.e. the difference between successive samples (there is usually high correlation between successive samples in audio data), or something more advanced like Nth order prediction. Only the prediction error is transmitted. In lossy compression the prediction error may be transmitted in reduced precision. The reproduction in the decompression won't then be exact, but the number of bits needed to transmit the prediction error may be much smaller.
One block-to-block code relevant to Commodore 64, VIC 20 and their relatives is nybble packing that is performed by some C64 compression programs. As nybbles by definition only occupy 4 bits of a byte, we can fit two nybbles into each byte without throwing any data away, thus getting 50% compression from the original which used a whole byte for every nybble. Although this compression ratio may seem very good, in reality very little is gained globally. First, only very small parts of actual files contain nybble-width data. Secondly, better methods exist that also take advantage of the patterns in the data.
There are three types of statistical codes: fixed, static, and adaptive. Static codes need two passes over the input message. During the first pass they gather statistics of the message so that they know the probabilities of the source symbols. During the second pass they perform the actual encoding. Adaptive codes do not need the first pass. They update the statistics while encoding the data. The same updating of statistics is done in the decoder so that they keep in sync, making the code uniquely decodable. Fixed codes are 'static' static codes. They use a preset statistical model, and the statistics of the actual message has no effect on the encoding. You just have to hope (or make certain) that the message statistics are close to the one the code assumes.
However, 0-order statistical compression (and entropy) don't take advantage of inter-symbol relations. They assume symbols are disconnected variables, but in reality there is considerable relation between successive symbols. If I would drop every third character from this text, you would probably be able to decipher it quite well. First order statistical compression uses the previous character to predict the next one. Second order compression uses two previous characters, and so on. The more characters are used to predict the next character the better estimate of the probability distribution for the next character. But more is not only better, there are also prices to pay.
The first drawback is the amount of memory needed to store the probability tables. The frequencies for each character encountered must be accounted for. And you need one table for each 'previous character' value. If we are using an adaptive code, the second drawback is the time needed to update the tables and then update the encoding accordingly. In the case of Huffman encoding the Huffman tree needs to be recreated. And the encoding and decoding itself certainly takes time also.
We can keep the memory usage and processing demands tolerable by using a 0-order static Huffman code. Still, the Huffman tree takes up precious memory and decoding Huffman code on a 1-MHz 8-bit processor is slow and does not offer very good compression either. Still, statistical compression can still offer savings as a part of a hybrid compression system. For example:
'A' 1/2 0 'B' 1/4 10 'C' 1/8 110 'D' 1/8 111 "BACADBAABAADABCA" total: 32 bits 10 0 110 0 111 10 0 0 10 0 0 111 0 10 110 0 total: 28 bitsThis is an example of a simple statistical compression. The original symbols each take two bits to represent (4 possibilities), thus the whole string takes 32 bits. The variable-length code assigns the shortest code to the most probable symbol (A) and it takes 28 bits to represent the same string. The spaces between symbols are only there for clarity. The decoder still knows where each symbol ends because the code is a prefix code. On the other hand, I am simplifying things a bit here, because I'm omitting one vital piece of information: the length of the message. The file system normally stores the information about the end of file by storing the length of the file. The decoder also needs this information. We have two basic methods: reserve one symbol to represent the end of file condition or send the length of the original file. Both have their virtues.
The best compressors available today take into account intersymbol probabilities. Dynamic Markov Coding (DMC) starts with a zero-order Markov model and gradually extends this initial model as compression progresses. Prediction by Partial Matching (PPM), although it really is a variable-to-block code, looks for a match of the text to be compressed in an order-n context and if there is no match drops back to an order n-1 context until it reaches order 0.
Substitutional compressors work by trying to replace strings in the input data with shorter codes. Lempel-Ziv methods (named after the inventors) contain two main groups: LZ77 and LZ78.
LZ77-type compressors use a history buffer, which contains a fixed amount of symbols output/seen so far. The compressor reads symbols from the input to a lookahead buffer and tries to find as long as possible match from the history buffer. The length of the string match and the location in the buffer (offset from the current position) is written to the output. If there is no suitable match, the next input symbol is sent as a literal symbol.
Of course there must be a way to identify literal bytes and compressed data in the output. There are lot of different ways to accomplish this, but a single bit to select between a literal and compressed data is the easiest.
The basic scheme is a variable-to-block code. A variable-length piece of the message is represented by a constant amount of bits: the match length and the match offset. Because the data in the history buffer is known to both the compressor and decompressor, it can be used in the compression. The decompressor simply copies part of the already decompressed data or a literal byte to the current output position.
Variants of LZ77 apply additional compression to the output of the compressor, which include a simple variable-length code (LZB), dynamic Huffman coding (LZH), and Shannon-Fano coding (ZIP 1.x)), all of which result in a certain degree of improvement over the basic scheme. This is because the output values from the first stage are not evenly distributed, i.e. their probabilities are not equal and statistical compression can do its part.
To get higher efficiency, we have to create a real dictionary. Strings are added to the codebook only once. There are no duplicates that waste bits just because they exist. Also, each entry in the codebook will have a specific length, thus only an index to the codebook is needed to specify a string (phrase). In LZ77 the length and offset values were handled more or less as disconnected variables although there is correlation. Because they are now handled as one entity, we can expect to do a little better in that regard also.
LZ78-type compressors use this kind of a dictionary. The next part of the message (the lookahead buffer contents) is searched from the dictionary and the maximum-length match is returned. The output code is an index to the dictionary. If there is no suitable entry in the dictionary, the next input symbol is sent as a literal symbol. The dictionary is updated after each symbol is encoded, so that it is possible to build an identical dictionary in the decompression code without sending additional data.
Essentially, strings that we have seen in the data are added to the dictionary. To be able to constantly adapt to the message statistics, the dictionary must be trimmed down by discarding the oldest entries. This also prevents the dictionary from becaming full, which would decrease the compression ratio. This is handled automatically in LZ77 by its use of a history buffer (a sliding window). For LZ78 it must be implemented separately. Because the decompression code updates its dictionary in sychronization with the compressor the code remains uniquely decodable.
Run length encoding also belongs to this group. If there are consecutive equal valued symbols in the input, the compressor outputs how many of them there are, and their value. Again, we must be able to identify literal bytes and compressed data. One of the RLE compressors I have seen outputs two equal symbols to indentify a run of symbols. The next byte(s) then tell how many more of these to output. If the value is 0, there are only two consecutive equal symbols in the original stream. Depending on how many bits are used to represent the value, this is the only case when the output is expanded.
Run-length encoding has been used since day one in C64 compression programs because it is very fast and very simple. Part of this is because it deals with byte-aligned data and is essentially just copying bytes from one place to another. The drawback is that RLE can only compress identical bytes into a shorter representation. On the C64 only graphics and music data contain large runs of identical bytes. Program code rarely contains more than a couple of successive identical bytes. We need something better.
That "something better" seems to be LZ77, which has been used in C64 compression programs for a long time. LZ77 can take advantage of repeating code/graphic/music data fragments and thus achieves better compression. The drawback is that practical LZ77 implementations tend to became variable-to-variable codes (more on that later) and need to handle data bit by bit, which is quite a lot slower than handling bytes.
LZ78 is not practical for C64, because the decompressor needs to create and update the dictionary. A big enough dictionary would take too much memory and updating the dictionary would need its share of processor cycles.
Randomly concatenating algorithms rarely produces good results, so you have to know what you are doing and what kind of files you are compressing. Whenever a novice asks the usual question: 'What compression program should I use?', they get the appropriate response: 'What kind of data are you compressing?'
Borrowed from Tom Lane's article in comp.compression:
It's hardly ever worthwhile to take the compressed output of one
compression method and shove it through another compression method.
Especially not if the second method is a general-purpose compressor
that doesn't have specific knowledge of the first compression step.
Compression is effective in direct proportion to the extent that
it eliminates obvious patterns in the data. So if the first
compression step is any good, it will leave little traction for the
second step. Combining multiple compression methods is only helpful
when the methods are specifically chosen to be complementary.
A small sidetrack I want to take:
This also brings us conveniently to another truth in lossless compression.
There isn't a single compressor which would be able to losslessly compress
all possible files (you can see the comp.compression FAQ for information
about the counting proof). It is our luck that we are not interested in
compressing all files. We are only interested in compressing a very small
subset of all files. The more accurately we can describe the files we
would encounter, the better. This is called modelling, and it is what all
compression programs do and must do to be successful.
Audio and graphics compression algorithm may assume a continuous signal, and a text compressor may assume that there are repeated strings in the data. If the data does not match the assumptions (the model), the algorithm usually expands the data instead of compressing it.
Value Binary Adjusted 1&2 --------------------------- 0 000 00 000 H = 2.585 1 001 01 001 L = 2.666 2 010 100 010 (for flat distribution) 3 011 101 011 4 100 110 10 5 101 111 11
The previous table shows two different versions of how the adjustment could be done for a code that has to represent 6 different values with the minimum average number of bits. As can be seen, they are still both prefix codes, i.e. it's possible to (easily) decode them.
If there is no definite upper limit to the integer value, direct binary code can't be used and one of the following codes must be selected.
In this code the number of zero-bits before the first one-bit (a unary code) defines how many more bits to get. The code may be considered a special fixed Huffman tree. You can generate a Huffman tree from the assumed value distribution and you'll get a very similar code. The code is also directly decodable without any tables or difficult operations, because once the first one-bit is found, the length of the code word is instantly known. The bits following the zero bits (if any) are directly the encoded value.
Gamma Code Integer Bits -------------------------- 1 1 1 01x 2-3 3 001xx 4-7 5 0001xxx 8-15 7 00001xxxx 16-31 9 000001xxxxx 32-63 11 0000001xxxxxx 64-127 13 ...
Delta Code Integer Bits -------------------------- 1 1 1 010x 2-3 4 011xx 4-7 5 00100xxx 8-15 8 00101xxxx 16-31 9 00110xxxxx 32-63 10 00111xxxxxx 64-127 11 ...
The delta code is better than gamma code for big values, as it is asymptotically optimal (the expected codeword length approaches constant times entropy when entropy approaches infinity), which the gamma code is not. What this means is that the extra bits needed to indicate where the code ends become smaller and smaller proportion of the total bits as we encode bigger and bigger numbers. The gamma code is better for greatly skewed value distributions (a lot of small values).
1 2 3 5 8 13 21 34 55 89 ---------------------------- 1 (1) = 1 0 1 (1) = 2 0 0 1 (1) = 3 1 0 1 (1) = 4 0 0 0 1 (1) = 5 1 0 0 1 (1) = 6 0 1 0 1 (1) = 7 0 0 0 0 1 (1) = 8 : : : : : : : 1 0 1 0 1 (1) = 12 0 0 0 0 0 1 (1) = 13 : : : : : : : : 0 1 0 1 0 1 (1) = 20 0 0 0 0 0 0 1 (1) = 21 : : : : : : : : : 1 0 0 1 0 0 1 (1) = 27
Note that because the code does not have two successive one-bits until the end mark, the code density may seem quite poor compared to the other codes, and it is, if most of the values are small (1-3). On the other hand, it also makes the code very robust by localizing and containing possible errors. Although, if the Fibonacci code is used as a part of a larger system, this robustness may not help much, because we lose the synchronization in the upper level anyway. Most adaptive methods can't recover from any errors, whether they are detected or not. Even in LZ77 the errors can be inherited infinitely far into the future.
Comparison between delta, gamma and Fibonacci code lengths.
Gamma Delta Fibonacci 1 1 1 2.0 2-3 3 4 3.5 4-7 5 5 4.8 8-15 7 8 6.4 16-31 9 9 7.9 32-63 11 10 9.2 64-127 13 11 10.6The comparison shows that if even half of the values are in the range 1..7 (and other values relatively near this range), the Elias gamma code wins by a handsome margin.
Golomb m=1 m=2 m=3 m=4 m=5 m=6 Rice k=0 k=1 k=2 --------------------------------------------------- n = 0 0 00 00 000 000 000 1 10 01 010 001 001 001 2 110 100 011 010 010 0100 3 1110 101 100 011 0110 0101 4 11110 1100 1010 1000 0111 0110 5 111110 1101 1011 1001 1000 0111 6 1111110 11100 1100 1010 1001 1000 7 : 11101 11010 1011 1010 1001 8 : 111100 11011 11000 10110 10100
To encode an integer n (starting from 0 this time, not from 1 as for Elias codes and Fibonacci code) using the Golomb code with parameter m, we first compute floor( n/m ) and output this using a unary code. Then we compute the remainder n mod m and output that value using a binary code which is adjusted so that we sometimes use floor( log2(m) ) bits and sometimes ceil( log2(m) ) bits.
Rice coding is the same as Golomb coding except that only a subset of parameters can be used, namely the powers of 2. In other words, a Rice code with the parameter k is equal to Golomb code with parameter m = 2^k. Because of this the Rice codes are much more efficient to implement on a computer. Division becomes a shift operation and modulo becomes a bit mask operation.
So now we have several alternatives to choose from. Now we simply have to do a little real-life research to determine how the values we want to encode are distributed so that we can select the optimum code to represent them. Of course we still have to keep in mind that we intend to decode the thing with a 1-MHz 8-bit processor. As always, compromises loom on the horizon. Pucrunch uses Elias Gamma Code, because it is the best alternative for that task and is very close to static Huffman code. The best part is that the Gamma Code is much simpler to decode and doesn't need additional memory.
Also, the added bonus is that once we have a decompressor, we can improve the compressor without changing the file format. At least until we have some statistics to develop a better system. Many lossy video and audio compression systems only document and standardize the decompressor and the file or stream format, making it possible to improve the encoding part of the process when faster and better hardware (or algorithms) become available.
void DecompressRLE() { int oldChar = -1; int newChar; while(1) { newChar = GetByte(); if(newChar == EOF) return; PutByte(newChar); if(newChar == oldChar) { int len = GetLength(); while(len > 0) { PutByte(newChar); len = len - 1; } } oldChar = newChar; } }This RLE algorithm uses two successive equal characters to mark a run of bytes. I have in purpose left open the question of how the length is encoded (1 or 2 bytes or variable-length code). The decompressor also allows chaining/extension of RLE runs, for example 'a', 'a', 255, 'a', 255 would output 513 'a'-characters.
In this case the compression algorithm is almost as simple.
void CompressRLE() { int oldChar = -1; int newChar; while(1) { newChar = GetByte(); if(newChar==oldChar) { int length = 0; if(newChar == EOF) return; PutByte(newChar); /* RLE indicator */ /* Get all equal characters */ while((newChar = GetByte()) == oldChar) { length++; } PutLength(length); } if(newChar == EOF) return; PutByte(newChar); oldChar = newChar; } }If there are two equal bytes, the compression algorithm reads more bytes until it gets a different byte. If there was only two equal bytes, the length value will be zero and the compression algorithm expands the data. A C64-related example would be the compression of the BASIC ROM with this RLE algorithm. Or actually expansion, as the new file size is 8200 bytes instead of the original 8192 bytes. Those equal byte runs that the algorithm needs just aren't there. For comparison, pucrunch manages to compress the BASIC ROM into 7288 bytes, the decompression code included. Even Huffman coding manages to compress it into 7684 bytes.
"BAAAAAADBBABBBBBAAADABCD" total: 24*8=192 bits "BAA",4,"DBB",0,"ABB",3,"AA",1,"DABCD" total: 16*8+4*8=160 bitsThis is an example of how the presented RLE encoder would work on a string. The total length calculations assume that we are handling 8-bit data, although only values from 'A' to 'D' are present in the string. After seeing two equal characters the decoder gets a repeat count and then adds that many more of them. Notice that the repeat count is zero if there are only two equal characters.
int GetHuffman() { int index = 0; while(1) { if(GetBit() == 1) { index = LeftNode(index); } else { index = RightNode(index); } if(LeafNode(index)) { return LeafValue(index); } } }My pseudo code of the Huffman decode function is a very simplified one, so I should probably describe how the Huffman code and the corresponding binary tree is constructed first.
First we need the statistics for all the symbols occurring in the message, i.e. the file we are compressing. Then we rank them in decreasing probability order. Then we combine the smallest two probabilities and assign 0 and 1 to the binary tree branches, i.e. the original symbols. We do this until there is only one composite symbol left.
Depending on where we insert the composite symbols we get different Huffman trees. The average code length is equal in both cases (and so is the compression ratio), but the length of the longest code changes. The implementation of the decoder is usually more efficient if we keep the longest code as short as possible. This is achieved by inserting the composite symbols (new nodes) before all symbols/nodes that have equal probability.
"BAAAAAADBBABBBBBAAADABCD" A (11) B (9) D (3) C (1) Step 1 Step 2 Step 3 'A' 0.458 'A' 0.458 C2 0.542 0\ C3 'B' 0.375 'B' 0.375 0\ C2 'A' 0.458 1/ 'D' 0.125 0\ C1 C1 0.167 1/ 'C' 0.042 1/ C3 0 / \ 1 / 'A' C2 0 / \ 1 'B' \ C1 0 / \ 1 'D' 'C'So, in each step we combine two lowest-probability nodes or leaves into a new node. When we are done, we have a Huffman tree containing all the original symbols. The Huffman codes for the symbols can now be gotten by starting at the root of the tree and collecting the 0/1-bits on the way to the desired leaf (symbol). We get:
'A' = 1 'B' = 00 'C' = 011 'D' = 010
These codes (or the binary tree) are used when encoding the file, but the decoder also needs this information. Sending the binary tree or the codes would take a lot of bytes, thus taking away all or most of the compression. The amount of data needed to transfer the tree can be greatly reduced by sending just the symbols and their code lengths. If the tree is traversed in a canonical (predefined) order, this is all that is needed to recreate the tree and the Huffman codes. By doing a 0-branch-first traverse we get:
Symbol Code Code Length 'B' 00 2 'D' 010 3 'C' 011 3 'A' 1 1So we can just send 'B', 2, 'D', 3, 'C', 3, 'A', 1 and the decoder has enough information (when it also knows how we went through the tree) to recreate the Huffman codes and the tree. Actually you can even drop the symbol values if you handle things a bit differently (see the Deflate specification in RFC1951), but my arrangement makes the algorithm much simpler and doesn't need to transfer data for symbols that are not present in the message.
Basically we start with a code value of all zeros and the appropriate length for the first symbol. For other symbols we first add the code value with 1 and then shift the value left or right to get it to be the right size. In the example we first assign 00 to 'B', then add one to get 01, shift left to get a 3-bit codeword for 'D' making it 010 like it should. For 'C' add 1, you get 011, no shift because the codewords is the right size already. And for 'A' add one and get 100, shift 2 places to right and get 1.
The Deflate algorithm in essence attaches a counting sort algorithm to this algorithm, feeding in the symbols in increasing code length order. Oh, don't worry if you don't understand what the counting sort has to do with this. I just wanted to give you some idea about it if you some day read the deflate specification or the gzip source code.
Actually, the decoder doesn't necessarily need to know the Huffman codes at all, as long as it has created the proper internal representation of the Huffman tree. I developed a special table format which I used in the C64 Huffman decode function and may present it in a separate article someday. The decoding works by just going through the tree by following the instructions given by the input bits as shown in the example Huffman decode code. Each bit in the input makes us go to either the 0-branch or the 1-branch. If the branch is a leaf node, we have decoded a symbol and just output it, return to the root node and repeat the procedure.
A technique related to Huffman coding is Shannon-Fano coding. It works by first dividing the symbols into two equal-probability groups (or as close to as possible). These groups are then further divided until there is only one symbol in each group left. The algorithm used to create the Huffman codes is bottom-up, while the Shannon-Fano codes are created top-down. Huffman encoding always generates optimal codes (in the entropy sense), Shannon-Fano sometimes uses a few more bits.
There are also ways of modifying the statistical compression methods so that we get nearer to the entropy. In the case of 'A' having the probability 0.75 and 'B' 0.25 we can decide to group several symbols together, producing a variable-to-variable code.
"AA" 0.5625 0 "B" 0.25 10 "AB" 0.1875 11If we separately transmit the length of the file, we get the above probabilities. If a file has only one 'A', it can be encoded as length=1 and either "AA" or "AB". The entropy of the source is H = 0.8113, and the average code length (per source symbol) is approximately L = 0.8518, which is much better than L = 1.0, which we would get if we used a code {'A','B'} = {0,1}. Unfortunately this method also expands the number of symbols we have to handle, because each possible source symbol combination is handled as a separate symbol.
Arithmetic coding does not have this restriction. It works by representing the file by an interval of real numbers between 0 and 1. When the file size increases, the interval needed to represent it becomes smaller, and the number of bits needed to specify that interval increases. Successive symbols in the message reduce this interval in accordance with the probability of that symbol. The more likely symbols reduce the range by less, and thus add fewer bits to the message.
1 Codewords +-----------+-----------+-----------+ | |8/9 YY | Detail |«- 31/32 .11111 | +-----------+-----------+«- 15/16 .1111 | Y | | too small |«- 14/16 .1110 |2/3 | YX | for text |«- 6/8 .110 +-----------+-----------+-----------+ | | |16/27 XYY |«- 10/16 .1010 | | +-----------+ | | XY | | | | | XYX |«- 4/8 .100 | |4/9 | | | +-----------+-----------+ | | | | | X | | XXY |«- 3/8 .011 | | |8/27 | | | +-----------+ | | XX | | | | | |«- 1/4 .01 | | | XXX | | | | | |0 | | | +-----------+-----------+-----------+As an example of arithmetic coding, lets consider the example of two symbols X and Y, of probabilities 2/3 and 1/3. To encode a message, we examine the first symbol: If it is a X, we choose the lower partition; if it is a Y, we choose the upper partition. Continuing in this manner for three symbols, we get the codewords shown to the right of the diagram above. They can be found by simply taking an appropriate location in the interval for that particular set of symbols and turning it into a binary fraction. In practice, it is also necessary to add a special end-of-data symbol, which is not represented in this simple example.
This explanation may not be enough to help you understand arithmetic coding. There are a lot of good articles about arithmetic compression in the net, for example by Mark Nelson.
Arithmetic coding is not practical for C64 for many reasons. The biggest reason being speed, especially for adaptive arithmetic coding. The close second reason is of course memory.
int GetByte() { int index = GetUnaryCode(); return mappingTable[index]; }The main idea is to have a table containing the symbols in descending probability order (rank order). The message is then represented by the table indices. The index values are in turn represented by a variable-length integer representation (these are studied in the next article). Because more probable symbols (smaller indices) take less bits than less probable symbols, in average we save bits. Note that we have to send the rank order, i.e. the symbol table too.
"BAAAAAADBBABBBBBAAADABCD" total: 24*8=192 bits Rank Order: A (11) B (9) D (3) C (1) 4*8=32 bits Unary Code: 0 10 110 1110 "100000001101010010101010100001100101110110" 42 bits total: 74 bitsThe statistics rank the symbols in the order ABDC (most probable first), which takes approximately 32 bits to transmit (we assume that any 8-bit value is possible). The indices are represented as a code {0, 1, 2, 3} = {0, 10, 110, 1110}. This is a simple unary code where the number of 1-bits before the first 0-bit directly give the integer value. The first 0-bit also ends a symbol. When this code and the rank order table are combined in the decoder, we get the reverse code {0, 10, 110, 1110} = {A, B, D, C}. Note that in this case the code is very similar to the Huffman code we created in a previous example.
set w = NIL loop read a character K if wK exists in the dictionary w = wK else output the code for w add wK to the string table w = K endloop
Input string: /WED/WE/WEE/WEB Input Output New code and string /W / 256 = /W E W 257 = WE D E 258 = ED / D 259 = D/ WE 256 260 = /WE / E 261 = E/ WEE 260 262 = /WEE /W 261 263 = E/W EB 257 264 = WEB <END> BA sample run of LZW over a (highly redundant) input string can be seen in the diagram above. The strings are built up character-by-character starting with a code value of 256. LZW decompression takes the stream of codes and uses it to exactly recreate the original input data. Just like the compression algorithm, the decompressor adds a new string to the dictionary each time it reads in a new code. All it needs to do in addition is to translate each incoming code into a string and send it to the output. A sample run of the LZW decompressor is shown in below.
Input code: /WED<256>E<260><261><257>B Input Output New code and string / / W W 256 = /W E E 257 = WE D D 258 = ED 256 /W 259 = D/ E E 260 = /WE 260 /WE 261 = E/ 261 E/ 262 = /WEE 257 WE 263 = E/W B B 264 = WEBThe most remarkable feature of this type of compression is that the entire dictionary has been transmitted to the decoder without actually explicitly transmitting the dictionary. The decoder builds the dictionary as part of the decoding process.
See also the article "LZW Compression" by Bill Lucier in C=Hacking issue 6 and "LZW Data Compression" by Mark Nelson mentioned in the references section.
There are no shortcuts in understanding data compression. Some things you only understand when trying out them yourself. However, I hope that this article has given you at least a vague grasp of how different compression methods really work.
I would like to send special thanks to Stephen Judd for his comments. Without him this article would've been much more unreadable than it is now. On the other hand, that's what the magazine editor is for :-)
The second part of the story is a detailed talk about pucrunch. I also go through the corresponding C64 decompression code in detail.