Note also that compressing all of the data together as a tar file makes compressionworse. As mentioned, BWT is poorly suited for mixed data types.
A straightforward sort using a fast sorting algorithm like a or can be very slow on highly redundant or compressible data. Quicksort requires onaverage (n log n) string comparisons, but worst case string comparison is(n), as in the case of a block of identical bytes. This would result in(n2 log n) run time. There are several ways to avoid the problemof long common prefixes.
is a program by ChristianSchnaader that searches its input for segments of data compressed in deflate (zip)format and uncompresses them. This can improve compression if the (now larger) datais compressed with a better algorithm. Many applications and formats use deflatecompression internally, including PDF, PNG, JAR (Java archive), ODT (OpenOffice)and SWF (Shockwave Flash).
Eventually this fails because any noise (which is not compressible) in the predictionis added to noise in the sample with each pass. (Noise has a uniformly distributedspectrum, so its high frequency components are amplified by delta coding). Noisecan come either from the original data or from quantization (rounding) errors duringsampling. These are opposing sources. Decreasing the number of quantization levelsremoves noise from the original data but adds quantization noise.
A predictive filter is a transform which can be used to compress numeric data suchas audio, images, or video. The idea is to predict the next sample, and then encodethe difference (the error) with an order 0 model. The decompresser makes the samesequence of predictions and adds them to the decoded prediction errors. Better predictionslead to smaller errors, which generally compress better.
An algorithm description consists of a network of components (each making a prediction),a program that computes the bytewise contexts for each component, and an optionalprogram that post-processes the output. Both programs are written in a languagecalled ZPAQL which is compiled or interpreted by the decompresser. If the post-processingprogram is present, then it is appended to the front of the first uncompressed segmentand compressed along with its input data. If not, then the data is compressed directlywith a one byte header to indicate its absence.
A benchmark for the Calgary corpus is given below for versions of PAQ from 2000to Jan. 2010 showing major code changes. About 150 intermediate versions with minorimprovements have been omitted. Older programs marked with * were benchmarked onslower machines such as a 750 MHz Duron and have been adjusted to show projectedtimes on a 2.0 GHz T3200, assumed to be 5.21 times faster. Sizes marked with a Duse an external English dictionary that must be present during decompression. Thesize shown does not include the dictionary, so it is artificially low. However,including it would give a size artificially high because the dictionary is not extractedfrom the test data. All versions of PAQ are archivers that compress in solid mode,exploiting similarities between files. Decompression time is about the same as compressiontime.
The high compression ratio (and slow speed) in comesfrom using many different context models for different types of data. These aredescribed in historical order.
zip and gzip take an option -1 through -9 to select compression level at the expenseof speed. All options produce compressed data in deflate format which decompressesat the same speed (much faster than compression) with the same algorithm. The differenceis that with the higher options, the compressor spends more time looking for encodingsthat compress better. A typical implementation will keep a list of 3 byte matches(the shortest allowed) in a hash table and test the following data to find the longestmatch. With a higher option, the compressor will spend more time searching. It isalso sometimes possible to improve compression by encoding a literal even if a matchis found, if it results in a longer match starting at the next byte. Such testingalso increases compression time. performs an extreme level of optimizations like this. Compressed sizes and compressiontimes on a 2.0 GHz T3200 are shown below for the 14 file Calgary corpus.
Match models are used in PAQ and ZPAQ. They consist of a rotating history bufferand a hash table mapping contexts to pointers into the buffer. In ZPAQ, a pointerto the match is maintained until a mismatching bit is found. The model will thenlook for a new match at the start of the next byte. On each byte boundary, the bufferis updated with the modeled byte and the hash table is updated so that the currentcontext hash points to the end of the buffer. ZPAQ allows both the hash table sizeand buffer size to be user specified (as powers of 2). For best compression, thehistory buffer should be as large as the input (if this is practical) and the hashtable size is typically 1/4 of this. Because each pointer is 4 bytes, both datastructures use the same amount of memory.
ZPAQ fixes DELTA at 1/2 but LIMIT is configurable to 4, 8, 12,..., 1020. The followingtable shows the effect of varying LIMIT for an order 0 model on 106 digitsof π (stationary) and orders 0 through 2 on the 14 file Calgary corpus concatenatedinto a single data stream (nonstationary). Using a higher order model can improvecompression at the cost of memory. However, direct lookup tables are not practicalfor orders higher than about 2. The order 2 model in ZPAQ uses 134 MB memory. Thehigher orders have no effect on π because the digits are independent (short ofactually computing π).
The idea of using shorter codes for recent matches can be extended. The compressorfor builds a dictionary(a hash table) of pointers into the history buffer as usual to find matching strings,but instead of coding the offset, it codes the index into the table. The decompresserbuilds an identical hash table from the data it has already decompressed, then usesthe index to find the match. The length is coded as usual.