The Burrows-Wheeler Block Sorting Algorithm

A high-quality implementation of the Burrows-Wheeler block-sorting-based lossless compression algorithm is available at http://www.cs.man.ac.uk/arch/people/j-seward/bzip-0.21.tar.gz

Mark Nelson wrote an excellent article "Data Compression with the Burrows-Wheeler Transform" for Dr. Dobb's Journal, September 1996. A copy of the article is at http://www.dogma.net/markn/articles/bwt/bwt.htm

Another introduction written by Sampo Syreeni tmaaedu@nexus.edu.lahti.fi

The Burrows-Wheeler block sorting compression algorithm is described in "A Block-sorting Lossless Data Compression Algorithm" by M. Burrows and D.J. Wheeler, dated in May 10, 1994. A postscript copy of this paper has been made available by Digital on the Systems Research Center (SRC) FTP site at ftp://ftp.digital.com/pub/DEC/SRC/research-reports/SRC-124.ps.Z

The method was originally discovered by one of the authors (Wheeler) back in 1983, but has not been published before. As such, the method is fairly new and hasn't yet gained popularity.

The method described in the original paper is really a composite of three different algorithms: the block sorting main engine (a lossless, very slightly expansive preprocessor), the move-to-front coder (a byte-for-byte simple, fast, locally adaptive noncompressive coder) and a simple statistical compressor (first order Huffman is mentioned as a candidate) eventually doing the compression. Of these three methods only the first two are discussed here as they are what constitutes the heart of the algorithm. These two algorithms combined form a completely reversible (lossless) transformation that - with typical input - skews the first order symbol distributions to make the data more compressible with simple methods. Intuitively speaking, the method transforms slack in the higher order probabilities of the input block (thus making them more even, whitening them) to slack in the lower order statistics. This effect is what is seen in the histogram of the resulting symbol data.

The block sorting preprocessor operates purely on a block basis. One way to understand the idea is to think of the input block arranged as a circular array where, for every symbol, the succeeding symbols }re used as a predictor. This predictor is then used to group the symbols with similar right neighbors together. This predictor is realized (conceptually) as a two phase process. The first phase forms all cyclic shifts of the input block whose size is usually a power of two. Note here that the original string is always present intact on some row of the resulting matrix. If the block length is n then there exist n unique rotations of the original string (to the left). These rotations are now viewed as the rows of an N x N matrix of symbols. The second phase consists of sorting this resulting conceptual matrix. This phase results in the rows coming into order based on their first few symbols. If there is some commonly repeated string in the input block (the original paper gives "the" as an example), the sorting phase brings all those rotations that have a part of this string as the row start very close to each other. The preceding symbol in this common string is then found in the last column of the sorted matrix. This way common strings result in short bursts of just a few distinct characters being formed in the last column of the matrix. The last column is what is then output from the second phase. One further bit of information is derived from the input data. This is an integer with enough bits to tell the size of the input string (that is, log_2(n)). The number is used to note the row position into which the original input block got in the sorting algorithm. This integer always results in expansion of the data, but is necessary for us to be able successfully decompress the string. The absolute amount of overhead increases as the logarithm of the input block size so its percentage of the output data becomes negligible with useful block sizes anyway.

The characteristics of the transformation process make the output from the sort ideal for certain kinds of further manipulation. The extreme local fluctuations in the first order statistics of the output string lead one to use a transformation that boosts and flattens the local fluttering of the statistics. The best example (and, of course, the one given in the original paper) is move-to-front coding. This coder codes a symbol as the number of distinct symbols seen since the symbol's last occurrence. Basically this means that the coder outputs the index of an input symbol in a dynamic LIFO stack and then updates the stack by moving the symbol to the top. This is easy,and efficient to implement and results in fast local adaptation. As just a few common symbols will (locally) govern the input to the coder, these symbols will be kept on the top of the stack and thus the output will mainly consist of low numbers. This makes it highly susceptible to first order statistical compression methods which are, in case, easy and efficient to implement.

The transform matrix described above would require enormous amounts of storage space and would not result in a usable algorithm as such. The method can, however, be realized very efficiently by suffix and quick sort methods. Thus the whole transformation together with the eventual simple compression engine is extremely fast but still achieves impressive compression on typical input data. When implemented well, the speeds achieved can be in the order of pure LZ and the compression ratios can still approach state-of-the-art Markov modeling coders. The engine also responds well to increasing block sizes - the longer the input block, the more space there is for the patterns to form and the more similar input strings there will be in it. This results in almost monotonously increasing compression ratios even as the block length goes well into the megabyte range.

The decompression cascade is basically just the compression cascade backwards. More logic is needed to reverse the main sorting stage, however. This logic involves reasoning around the order of the first the last column of the conceptual coding matrix. The reader is referred to the original paper for an in depth treatment of the subject. The original paper also contains a more thorough discussion of why the method works and how to implement it.

And now a little demonstration. The original block to be compressed is chosen to be the (rather pathological) string "good, jolly good". This was taken as an example because it has high redundancy and it is exactly 16 bytes long. The first picture shows the cyclic shifts (rotations) of the input string. The second shows the matrix after sorting. Note that the last column now has many double characters in it. Note also that the original string has been placed into the 6th row now. The third picture shows the output for this input block. The index integer has been packed to a full byte although 4 bits would suffice in this case (log_2(16)=4). The fourth and fifth pictures show the transformed string after move-to-front-coding. The sixth picture shows the statistical distribution of the characters in the output string. Notice the disproportionately large amount of ones and zeros, even with a very short string like this. This is the output that is then routed through the simple statistical encoder. It should compress very well, as the distribution of the characters in the intut block is now very uneven.

0 1 2 3 4 5 6 7 8 9 A B C D E F        0 1 2 3 4 5 6 7 8 9 A B C D E F
-------------------------------        -------------------------------
0 | g o o d ,   j o l l y   g o o d    0 |   g o o d g o o d ,   j o l l y
1 | o o d ,   j o l l y   g o o d g    1 |   j o l l y   g o o d g o o d ,
2 | o d ,   j o l l y   g o o d g o    2 | ,   j o l l y   g o o d g o o d
3 | d ,   j o l l y   g o o d g o o    3 | d ,   j o l l y   g o o d g o o
4 | ,   j o l l y   g o o d g o o d    4 | d g o o d ,   j o l l y   g o o
5 |   j o l l y   g o o d g o o d ,    5 | g o o d ,   j o l l y   g o o d
6 | j o l l y   g o o d g o o d ,      6 | g o o d g o o d ,   j o l l y
7 | o l l y   g o o d g o o d ,   j    7 | j o l l y   g o o d g o o d ,
8 | l l y   g o o d g o o d ,   j o    8 | l l y   g o o d g o o d ,   j o
9 | l y   g o o d g o o d ,   j o l    9 | l y   g o o d g o o d ,   j o l
A | y   g o o d g o o d ,   j o l l    A | o d ,   j o l l y   g o o d g o
B |   g o o d g o o d ,   j o l l y    B | o d g o o d ,   j o l l y   g o
C | g o o d g o o d ,   j o l l y      C | o l l y   g o o d g o o d ,   j
D | o o d g o o d ,   j o l l y   g    D | o o d ,   j o l l y   g o o d g
E | o d g o o d ,   j o l l y   g o    E | o o d g o o d ,   j o l l y   g
F | d g o o d ,   j o l l y   g o o    F | y   g o o d g o o d ,   j o l l

     1. The shifts                    2. In lexicographic order


					121,45,102,114,0,1,36,0,
 "y,dood  oloojggl",5                   1,113,1,0,112,110,0,3,5

3. The output from block sort          4. After move-to-front-coding


				     00: 4;  01: 3;  03: 1;  05: 1;
  79,2D,66,72,0,1,24,0,              24: 1;  2D: 1;  66: 1;  6E: 1;
  1,71,1,0,70,6E,0,3,5               70: 1;  71: 1;  72: 1;  79: 1

   5. In hexadecimal                      6. The statistics

------------------------------------------------------------------------------
Send corrections/additions to the FAQ Maintainer: Last Update June 03 1999 @ 02:46 AM faq-admin@faqs.org