N/2 will produce output, and the interval will be expanded.) In every state [k, N) we include the maximally unbalanced subdivision (at k -t- 1), which corresponds to values of Prob{MPS) between (N- 2)/N and (N- 1)/N. We include a nearly balanced subdivision so that we will not lose efficiency when Prob{MPS) . In addition, we locate other subdivision points such that the subinterval expansion that follows each input symbol leaves the coder in a state of the form [k, N), and we choose one or more of them to correspond to intermediate values of Prob{M PS). For simplicity we denote state [k,N) by k. We always allow the interval [k, N) to be divided at k -t- 1; if the LPS occurs we output the log N bits of k and move to state 0, while if the M PS occurs we simply move to state k-t-l, then if the new state is N/2 we output a 1 and move to state 0. The other permitted subdivisions are given in the following table; it may not be necessary to include all of them in the coder. Range of states k ' 8 IN N Subdivision LPS MPS LPS input Output Next State Output M PS input Next State 1 0 0 2k 00 4k Of 4k - - ff 4k fff 8k 11 0 For example, the fourth line indicates that for all states k for which 3N/8 _ k < N/2, we may subdivide the interval at 5N/8. If the kPS occurs, we perform the follow- on procedure twice, which leaves us with the interval [4k - 3N/2, N); otherwise we output a 1 and expand the interval to IN/4, N). A coder constructed using this procedure will have a small number of states, but in every state it will allow us to use estimates of Prob{MPS) near 1, near 1/ , and in between. Thus we can choose a large N so that highly probable events require negligible code length, while keeping the number of states small enough to allow table lookups rather than arithmetic. In practice, however, use of this design procedure is > 1/2. cample: We show the e-partition for e: 0.05 bit per binary input symbol. > 0 and all k _> 0. to the probabilities can cause global changes to the optimal tree [15,19,36,73,74]. Al- though Huffman codes are optimal among prefix codes, the expected code length of a Huffman code is the same as the entropy only if all the probabilities are powers of 1/2. 2.3.2 Golomb and Rice codes A simpler procedure, due to Golomb [22], gives prefix codes that are suboptimal but very easy to implement. Golomb codes are distinguished from each other by a sin- gle parameter, so dynamic updating is accomplished by estimating the value of the parameter. In Section 2.3.3 we show how this can be done effectively. Golomb codes are used to encode symbols from a countable alphabet. The symbols are arranged in descending probability order, and non-negative integers are assigned to the symbols, beginning with 0 for the most probable event. To encode integer n using the Golomb code with parameter m, we first compute [n/mJ and output this integer using a unary code. Then we compute n rood m and output this value using a binary code, adjusted as described above so that we sometimes use [1%2 mJ bits and sometimes [1%2 m] bits. Rice coding, developed independently by Rice [53,54,55], is the same as Golomb coding except that only a subset of the parameter values may be used, namely the powers of 2. The Rice code with parameter k is exactly the same as the Golomb code with parameter m = 2 . Hence to encode integer n using the Rice code with parameter k, we first compute [n/2 J and output this integer using a unary code. > '2 is inactive, P3's processing is shifted to P2. Events TT and 12 are finished in the last time step. the completion register. If any codes are complete, the corresponding processors no longer produce output; output from the remaining processors is shifted accordingly by a prefix operation. This code can be decoded in parallel: each decoding processor can determine when the code for its event is complete, and the same concurrent write and prefix operation can determine where the remaining processors are to obtain their next bits. When all processors have completed their output, a new event is assigned to each processor. If there is one processor for each event, this method makes good use of available processors. The difficulty lies in the time-consuming prefix operation that must be performed after every time step in which a processor finishes. Reallocation coding If the number of processors is much less than the number of events, we can reduce the number of prefix operations by allowing each processor to begin working on another event as soon as it has completed an event. First we distribute the r, events equally among the p processors. For analysis we assume that the allocation is random. During processing, we follow the bit-transpose protocol of Section 2.4.1: at each time step each processor writes a single bit to a location that will be known to both encoder and decoder. When a processor finishes outputting the code for one event, it goes on to the next event allocated to it, even though other processors are still working on earlier events. Code lookup is assumed to be fast: the codes are small enough that the full code can be stored in each processor's memory, or perhaps distributed among small groups of nearby processors. When one processor finishes all its allocated events, it indicates this fact by writing 1 to the completion register. When the completion register is 1, the output process is interrupted. At this time the events allocated > H, we use one bit to encode OUT-OF-RANGE, and one bit to encode ABOVE-RANGE. Then we use a Rice code to encode the non-negative integer P - H - 1. The coding parameter k is chosen according to the method of Section 2.3.3. (d) We update the statistics used in selecting the Rice coding parameter. The decoding algorithm involves simply reversing step 3, decoding the in-range/out- of-range and above-range/below-range decisions, branching accordingly, and adjusting the decoded numbers to obtain the value of P. Implementation and refinements The basic algorithm of Section 3.5.1 is very easy to implement. As described it encodes and decodes very quickly and gives good compression. The implementation requires very little memory, only enough to store one scan line of input data and a few counts for each of the few hundred possible contexts. In this section we describe several enhancements that can be made to improve speed and compression. One possibility is to freeze the parameter choice for a context after a number of symbols have been encoded. This saves the time needed to maintain the cumulative counts, and does not hurt compression much since in practice the parameter selection algorithm converges quickly to the best value. A second enhancement is to adjust the range [L, H] when L -- H. In this special case, the in-range probability tends to be somewhat smaller than 1/2, and it makes sense to use the range [L- 1, H-F 1]. We can do this either unconditionally or based on the number of times that the value to be encoded is equal to L (and H) when A _- 0. We choose the second possibility, adjusting the interval when the values encoded have fallen out of the one-value "range" more than 2/3 of the time. We might also consider balance coding, adjusting the range for each context to adaptively balance the in-range and out-of-range probabilities. We can use the parameter estimation algorithm of Section 2.3.3 to choose the amount of adjustment. One finn] useful refinement is to assign a small initial penalty to the cumulative code lengths for small values of the parameter k in each context, to prevent their accidental use in contexts where the probability distribution is fiat; such use can greatly increase the code length for a single pixel. We have considered periodic count scaling to exploit locality of reference within an image. We tried applying it to the cumulative code lengths in each context. Halving all code lengths when the smallest one reaches 1024 can improve compression by up to 0.3 percent, but encoding time increases by about 10 percent, too much of a time penalty to pay for such a small increase in compression. Finally, we note that Golomb coding gives only a marginal improvement over Rice coding despite the wider range of model parameters available. The extra compression is typically less than one percent, and the time required almost doubles because of the need to maintain cumulative bit counts for more possible parameter values. Therefore > k! for all k _> 2, we have L}s > L}z > for k _> 2. If we have a file with t symbols in it and we add another occurrence of a symbol that already occurs c times, we increase Lss (((t + )/t/)- ) + ((t + )), log2((t + 1)/(c + 1)). Since t > c and log2(((x + 1)/x) x) is an increasing function for > 0, we have log2(((t q- 1)It) t) > log2(((c q- 1)/c)C), so L$$ increases by more than Finally, if k: 1 (the file consists of repeated occurrences of a single letter), Lss: LSD = O. [] Static semi-adaptive codes have been widely used in conjunction with Huffman coding. They are appropriate in this case: in Huffman coding we wish to avoid changing weights since changing weights often requires changing the structure of the coding tree. Our theorem does not contradict Shannon's theorem [67]; he discusses only the best static code. Encoding the model If we assume that all symbol distributions are equally likely for a given file length t, the cost of transmitting the exact model statistics is = log2(number of possible distributions) n-1 Example: For our short sample file, t =8 and n = 3, so LM = 1% 2 (12 ) = 1% 2 45 5.492. [] A similar result appears in [4] and [8]. For a typical file LM is only about 2560 bits, or 320 bytes. The assumption of equally-likely distributions is not very good for text files; in practice we can reduce the cost of encoding the model by 50 percent or more by encoding each of the counts using a suitable encoding of the integers, such as Fibonacci coding [17]. Strictly speaking we must also encode the file length t before encoding the model; the cost is insignificant, between 1%2 t and 2 1%2 t bits using an appropriate encoding of integers [13,69,75]. 4.1.2 Adaptive codes Adaptive codes use a continually changing model, in which we use the frequency of each symbol up to a given point in the file as an estimate of its probability at that point. We can encode a symbol using arithmetic coding only if its probability is nonzero, but in an adaptive code we have no way of estimating the probability of a symbol before it has occurred for the first time. This is the zero-frequency problem, discussed in detail in [4] and [76]. For large files with small alphabets and simple models, all solutions to this problem give roughly the same compression. In this section we adopt > q -- o if p < q- l/A; 5)- + ?x) if p < q; o ifp_>q. We need a simple lemma from integral calculus. Lemma 3 If g(x) is monotone increasing, then c-1 ) < j=0 If g(x) is monotone decreasing, then c-1 ) < fo j=O - 9( ) + 9(0). Proof: The increasing case is obvious. In the decreasing case, the integral for any unit interval is greater than the value at the right end of the intervah jf+l g(j:) dJ: > g(j - 1). We obtain the 1emma by adding g(j)- g(j + 1) to both sides and summing over j. [] Lemma 4 The code length 1 for all occurrences of a single symbol in a block in ED order is less than I + Az + Au. Proof: We show that 1 < Su _< Iu + Au = IL + A z + Au. Since Su represents the code length for the symbol if all occurrences of the symbol come as late as possible in an ED order, we have 1 < Su. If p < q- i/A, then -log2pE(x ) is monotone increasing, so by Lemma 3 we have Su < Iu. Ifp > q- i/A, then -log2pE(x ) is monotone decreasing, so by Lemma 3 we have Su < Iu + 1%2 p (c) - 1%2 p (0): Iu + Au. If p = q- i/A, then Su = Iu. In all three cases, Su _< Iu + Au. From the definition of Az, Iu = Iz, + Az. [] We now relate I to the entropy of the beginning and ending probability distribu- tions. We write I (i) to differentiate the values of I evaluated for different symbols a Lemma 5 Let LH = B (i_-}-]H(R) i fH(Q)). Then LH = (i). Proof: By appropriate substitutions, we have i=1 i=1 1- (-ri log= ( f (qi - ri)) .B (1_ f) i=1 The last sum is 0 because Q nd R re probability d bufions so i=l qi 1. The letorim follows from the definition of H(.). > q-1/A. The sum i=l(/ki q-/ku) is maximized when as many symbols as possible have large c and small s; the sum's largest possible value cannot exceed its value when s = ra and ymbo whi& ,,) + 0((, ). W obtain the result by setting m = 1 and summing over the symbols. [] The proof of the upper bound in Theorem 10 follows from Lemma 6 by summing over all blocks. (We are neglecting any special effects of a longer first block or shorter last block.) There is much cancellation because H(R) of one block is H(Q) of the next. Proof of the lower bound In the following proof of the lower bound of Theorem 10 we append a prime to the label of each definition and lemma to show the correspondence with the definition and lemmas used in the proof of the upper bound. Definition 2' A block of length B containing el, C2,... , C k occurrences of symbols al, a2,..., ak, respectively, has the reverse ED property if for each symbol ai and for all m, 1 _< ra _< c , the symbol occurs for the ruth time not before position (m- 1)B/c Lemma 2' Every distribution of symbol counts has an order with the reverse ED prop- erty. Proof: By Lemma 2 there is always an order with the ED property. Such an order, when reversed, has the reverse ED property. [] Lemma 3' Ifg(x) is monotone decreasing, then c-1 s(J) > fo j=O If g(x) is monotone increasing, then c-1 s(J)> fo j=O > > I . Since I represents the code length for the symbol if all occurrences of the symbol come as early as possible in a reverse ED order, we have 1 > L. If p > q, then --log2pB(x ) is monotone decreasing, so by Lemma 3' we have > I . If p < q, then -1og2pB(x) is monotone increasing, so by Lemma 3' we > + - = - rf. = q, th , = three cases, we have _> I - A . [] Lemma 6' Let Lj be the compressed length of block j. Then i H(R) f H(Q))-k Lj_> B 1-f 1-f ' Proof: From Lemmas 4' and 5, L 5 _> L H -- i=1 ZikL( )' Asymptotics when f = 1/2 give = 1 - O(c/s), which has maximum value 1, so the sum is at most k. [] The proof of the lower bound in Theorem 10 follows from Lemma 6' by summing over all blocks, noting that b = t/B and m -- 1. Non-scaling corollary By letting B = t, f = n/(t + n), and m = 1 in Lemmas 6 and 6', we obtain the code length without scaling: Corollary 1 When we do not scale at all, the code length L vs satisfies: tHfinal q- zl(Hfinal- H0) - k <( LNS < tHfinal q- zl(Hfinal- H0) q- k ]og2(t/k ). We can get important insights by contrasting upper bounds in this corollary and Theorem 10. Scaling will bring about a shorter encoding by tracking the block-by- block entropies rather than matching a single entropy for the entire file, but when we forgo scaling the overhead is less, proportional to log= t instead of to t. Scaling will do worst on a homogeneously distributed file, but even then the overhead will increase the code length by only about (k/B) log=(B/k) bits per input symbol, less than 0.1 bit per symbol for a typical file. We conclude that the benefits of scaling usually outweigh the minor inefficiencies it sometimes introduces. >
Back to Index The Design and Analysis of Efficient Lossles Data Compression Systems
by Paul Glor Howard
B.S., Massachusetts Institute of Technology, 1977
Digital Library Home