#!/usr/bin/env python """ The Walrus Binary Entropy Coder Arithmetic coders do their job so well that they're nearly synonymous with what they accomplish: optimal coding efficiency and limited latency, with an API that neatly separates anticipating the probabilities of symbols from encoding them as bits. The Walrus coder does the same job as a binary arithmetic coder-- "binary" meaning having an input alphabet of just two symbols. The Walrus has very close to optimal coding efficiency, the same latency, and follows the same API as arithmetic coders. But the Walrus maps symbols to bits differently: at each step it designates one of the two symbols "walrus" and the other "eggman." The walrus is assigned an available bit string, which is then either output (if walrus is sent) or made unavailable (if eggman is sent). This file is "literate executable pseudocode" in Python. Familiarity with arithmetic and Huffman coding will be helpful. Contents Abstract (above) Contents Introduction The Use Pattern in Python sender and receiver Encoder and Decoder Latency The Coders Arithmetic_coder Walrus_coder Prefix-Subset Codes Include Arithmetic and Walrus Coders Fractional Bits Possible Uses of Prefix-Subset Codes Appendix: Correctness Test References A separate Jupyter notebook [N] does some analysis of the Walrus's efficiency both empirically and theoretically. Introduction The Walrus coder operates within the "use pattern" or API of arithmetic coders, laid out here. In practice the conceptual parts discussed are sometimes combined, and parts shared between the encoder and decoder are duplicated. But after this introduction, the conceptually separate parts are implemented as separate Python procedures or classes, and each shared part is implemented once and actually shared. The pattern is broken into levels, and at each level the encoding and decoding sides are written to be as parallel to each other as possible, in order to make clear two important things at each level: the constraints placed on each part (just what has to be accomplished just when with only certain information), and how the decoding side can stay in sync with the encoding side and finally reconstruct the original information (especially given that symbols don't correspond simply to substrings of coded bits). Scare quotes introduce terms used here. A "sender" program wants to get some data to a matching "receiver". The sender feeds the "message," one "symbol" or "actual symbol" at a time, to an "encoder". With so-called "binary" coders the symbols are out of an alphabet of two, {A, B}, but not called bits (since generally they aren't equally likely). The encoder transforms the message into a stream of "bits" or "encoded bits" which is conveyed to the "decoder," which performs the reverse transformation, yielding the original message as symbols for the receiver. ======= sender ======= ======= receiver ====== model = model coder-neutral ^ ^ message in | | "actual" v encoded stream in v symbols --> encoder --> bits --> decoder --> symbols {A, B} ^ {0, 1} ^ | | v v coder = coder model-neutral Sender and receiver contain copies of the "model", a stateful object that gives a new expectation "p_A", a probability that the next symbol will be A rather than B, at each point in the message. The copies of the model are "synchronized": even though there is varying time lag between the two, if we lay the sequence of symbol-encoding and symbol-decoding steps side by side, the two copies of the model give the same new p_A to the encoder and decoder at the beginning each step, and are updated with the same "actual" (as opposed to expected or anticipated) symbol at the end of the step. As the sender and receiver share the model, so the encoder and decoder share identical copies of a stateful "coder," initialized and manipulated in identical sequences on the two sides. The state of an arithmetic coder is like a slice of pie; the Walrus's state is like a decimated pan of brownies. In this scheme, the model knows nothing about the particular coder, and the coder's knowledge of the model is limited to the current p_A value. The pattern so far (up to the coders) is illustrated with Python code in the next section. There's a short discussion of latency, then plug-compatible arithmetic and walrus coders in the section after that. The Use Pattern in Python Our sender will read a file of (ASCII) letters, A's and B's, and write a file of (ASCII) 1's and 0's. The receiver does the reverse. Each is given already open input and output file handles (actually Python file-like objects), and initialized model and coder objects. Writing and reading the encoded file is done inside the encoder and decoder objects, respectively. The model has the responsibility to know when the message ends. (Suffice it to say that other ways are problematic.) """ def sender(message_file, encoded_file, model, coder): encoder = Encoder(encoded_file, coder) while not model.end_of_message(): symbol = message_file.read(1) # Read *one* symbol. assert symbol in {"A", "B"} # Unexpected end-of-file would be "". encoder.encode(symbol, model.get_p_A()) model.update(symbol) message_file.close() encoder.close() def receiver(encoded_file, message_file, model, coder): decoder = Decoder(encoded_file, coder) while not model.end_of_message(): symbol = decoder.decode(model.get_p_A()) message_file.write(symbol) model.update(symbol) decoder.close() message_file.close() """ Part of the beauty of the arithmetic coder (emulated by the Walrus) is how ideas about coding are cleanly separated from knowledge about regularities in the data, which is handled by the application and conveyed to the coder as explicit probabilities. The model is initialized the same way on both sides and updated as the message progresses. Clearly, in order to lay odds, the model must know about the sort of information the next symbol represents-- the sender and model are in sync or even duplicating effort this way. In actual compression cases (e.g. not just sending a preexisting text file) the message may actually be an encoding of a data structure (for instance a picture) in memory in the sender, to be reconstructed on the receiver. It's easy to imagine a "model" working with the data structure directly. But in any case the encoder's model must make the same predictions as the decoder's-- as if it were being built up by the same code from the same message-so-far on both sides, as we do literally here. Even readers familiar with arithmetic coding may find it strange to be limited to the minimal {A, B} alphabet. The reason here is simplicity, although such coders are actually used (e.g. for bitmaps). Of course any alphabet or set of choices can be represented by a binary tree, which doesn't have to be balanced, as long as the probabilities for the two sides of the current branch can be calculated. The parallel structure of the sender and receiver sets up the proof that the sequence of p_A values is the same on both sides: Note that the model on both sides gives its p_A before the encoder or decoder is called, and is updated with the current symbol afterwards. The first value of p_A fed to the encoder and decoder is the same. Assuming the model, encoder, and decoder work, if the sequence of p_A's has been the same up to symbol n, then the model will be updated with symbol n on both sides, the n+1'th p_A will be the same on both sides, and Bob's your uncle by induction. Encoder and Decoder are Python classes; see [P] for a quick summary of Python's syntax for class and method definitions, object initialization, and method calls. """ class Encoder(object): def __init__(self, encoded_file, coder): self.encoded_file = encoded_file self.coder = coder # Initialized by the caller. def encode(self, symbol, p_A): self.coder.set_p_A(p_A) self.coder.update(symbol) prefix = self.coder.pop_common_prefix() self.encoded_file.write(prefix) def close(self): self.encoded_file.write(self.coder.pop_shortest_prefix()) self.encoded_file.close() class Decoder(object): def __init__(self, encoded_file, coder): self.encoded_file = encoded_file self.coder = coder # Initialized by the caller. self.buffer = "" def decode(self, p_A): self.coder.set_p_A(p_A) while True: symbol = self.coder.recognize_prefix(self.buffer) if symbol != None: break bit = self.encoded_file.read(1) # Read one bit. assert bit in {"0", "1"} self.buffer += bit # Append bit to the buffer. self.coder.update(symbol) prefix = self.coder.pop_common_prefix() # Pop prefix from coder. assert self.buffer.startswith(prefix), (prefix, self.buffer) self.buffer = self.buffer[len(prefix):] # Pop prefix from buffer. return symbol def close(self): remains = self.coder.pop_shortest_prefix() assert self.buffer == remains, \ "buffer = %r, remains = %r" % (self.buffer, remains) self.buffer = "" self.encoded_file.close() """ The sender and receiver step through the message calling encode() or decode(), respectively, once per symbol. The encode() method outputs zero, one, or more bits after knowing the symbol, whereas decode() may have bits already in its buffer, and may read more bits into its buffer before recognizing the symbol. The encode() of a symbol actually only ensures that unambiguous evidence of that symbol eventually comes out. The encoder makes a decision about future bitstream possibilities that it abides by in later steps. The encoder may encode thousands of symbols, narrowing and narrowing its future choices without emitting a single bit. In the diagram below, the encoder process s[i] and s[i+1] before emitting bits bbb. encode: ... s[i] s[i+1] s[i+2] s[i+3] bits: ... bbb ... ... ... decode ... s[i] s[i+1] s[i+2] The encoder and decoder see the symbol stream and bit stream, but with varying and opposite time offsets. The encoder knows a symbol before it outputs the corresponding bits, but the decoder must often get bits to decode symbol i that the encoder hadn't yet output after processing symbol i. Focusing on the sequence of symbols (and encode/decode calls) it's possible to think of the decoder as "reading ahead" in the bits, but since the encoder must output bits before the decoder can read them, we'll focus on the bits and say that the decoder follows behind the encoder like a detective. At a given step, decode() may already have bits, and may need to read more bits that were sent during the encoder's later steps, from which the decoder reconstructs what encode() decided at this step. To emphasize: for a given symbol, decode() may deal with bits that *were in the future* for encode() of the same symbol. In the diagram above bits bbb are output during the encoding of s[i+1], but the decoder must read them before decoding s[i]. On reading a bit, the decoder may suddenly be able to decode thousands of symbols before reading another bit. Latency The coder is given a limit on how many bits ahead it can plan. The decoder's buffer can hold up to that many bits before it's guaranteed to decode a symbol, but it must decode the current symbol before it gets any further p_A values. So the coders on the two sides must make the same definite distinction given only the p_A's up to this point, and actual symbols up to the previous point, in the message. This is necessary for the use pattern to work at all, separate from any desire to limit latency. If the coder is restricted to never making decisions more than n output bits ahead, then the latency is limited to n *bits*, but during the time time to encode and output those bits, a potentially long string of symbols (up to 2^n) could be encoded. On receiving the last of the bits, the decoder would then have up to 2^n symbols to deliver. The Coders With the exception of recognize_prefix(), the encoder and decoder make the identical sequence of method calls to the coder, with the following effects: set_p_A(p_A): Based on p_A from the model, divide future possible bitstreams into two disjoint sets, one for A and one for B. recognize_prefix(bit_string): (called in the decoder only) If the given bit string can only be the beginning of a bitstream in one of the two sets, identify which set, otherwise return None. Don't change state. update(symbol): Eliminate the set of future bitstream possibilities that don't correspond to symbol. pop_common_prefix(): If there is a sequence of bits that the bitstream must now start with, return it, and consider it no longer in the future bitstream. pop_shortest_prefix(): Find the shortest bit string available. This lets the sender flush the last symbols of a message. At the beginning of a message, or after any update() with a walrus symbol, this is the empty string. Normally the model informs the sender when the message is over, the sender tells the encoder to close, and the encoder outputs this string, which is enough for the decoder to recognize any not-yet- recognized symbols, at which point the receiver's model will realize this is the end of the message, the receiver will tell the decoder to close, and the decoder checks its buffer against this shortest prefix. More unexpected endings would typically leave some "sent" symbols undecodable even if the receiver detected end-of-file. If the sender were to send the shortest prefix in an attempted abort, or even just zeroes out to the end of a byte, the receiver might recognize that as a symbol the sender didn't intend, or even as the last symbol(s) of an unintended complete message. One way to allow for more graceful aborts would be to send occasional abort-or-not symbols with very low p_A in the original message. What follows is working code for an arithmetic coder and a walrus coder. In both cases it's no longer short, sweet, and easy-to-understand "pseudocode," but long and gnarly. As mentioned earlier, coders are usually integrated into encoders and decoders, or even into senders and receivers rather than separated out this way. The coders have testing and debugging features woven into them. Both have been tested empirically for both correct behavior and coding efficiency. Graphs comparing their coding efficiency are in an accompanying Jupyter notebook [N]. """ # The p_A value (probability that the next symbol is A) is represented by # an integer to sidestep issues of how to write repeatable floating-point # code. The coders here both have a configurable word length to allow # testing its effect on efficiency. P_A_NBITS = 30 P_A_STOP = 1 << P_A_NBITS P_A_HALF = P_A_STOP / 2 # 50% probability. CODER_NBITS_DEFAULT = 30 # Both coders deal with bit strings expressed as ASCII strings. def int_to_ascii_bits(i, n_bits): """ Return a zero-padded string of n_bits '0' and '1' characters. """ return "{:0{}b}".format(i, n_bits) def ascii_bits_to_int(bits): if len(bits) == 0: return 0 else: return int(bits, base=2) def xor(a, b): # Exclusive-OR of ints, just to make the code clearer. return a ^ b """ Arithmetic_coder Often binary arithmetic coders have adaptive coding built in; neither of the two coders here is adaptive. For Arithmetic_coder, the two possible-output sets are adjacent "lo" and "hi" pie slices delineated by three integer variables: self.range_start < self.start_hi < self.range_stop (following the Python convention where a "start" is the first value of a range of integers and a "stop" is one-past the last value.) Arithmetic_coder is written with scant attention to speed or tradition, as a parallel implementation to Walrus_coder's of the pedantic API used here, and to have coding efficiency representative of good arithmetic coders. The "avoid_straddle" feature handles a coding efficiency issue that (to this author's knowledge) all arithmetic coders face: avoiding or escaping the situation where the pie slice is a narrow range around the division between 011111... and 100000... """ class Arithmetic_coder(object): def __init__(self, coder_nbits=CODER_NBITS_DEFAULT, avoid_straddle=True): self.nbits = coder_nbits self.stop = 1 << self.nbits self.middle = self.stop / 2 # Center of the overall state space. self.p_A_nbits = P_A_NBITS self.range_start = 0 # The state is always a range; self.range_stop = self.stop # initially it's the whole pie. self.start_hi = None # None when split point hasn't been set. self.avoid_straddle = avoid_straddle def set_p_A(self, p_A): assert p_A == int(p_A) assert 0 < p_A < P_A_STOP assert self.range_start < self.middle < self.range_stop assert self.start_hi == None width = self.range_stop - self.range_start # The avoid_straddle heuristic: # "Always put the narrower slice on the narrow side of self.middle." if self.avoid_straddle and xor(p_A < P_A_HALF, self.middle - self.range_start < self.range_stop - self.middle): p_A = P_A_STOP - p_A self.lo_sym, self.hi_sym = "B", "A" else: self.lo_sym, self.hi_sym = "A", "B" # 61-bit intermediate results here. width_lo = int((width * p_A * 2 + 1) / (P_A_STOP * 2)) width_lo = max(1, min(width - 1, width_lo)) self.start_hi = self.range_start + width_lo assert self.range_start < self.start_hi < self.range_stop def recognize_prefix(self, ascii_bits): assert self.start_hi != None shift = self.nbits - len(ascii_bits) bits = ascii_bits_to_int(ascii_bits) bits_start = bits << shift bits_stop = (bits + 1) << shift assert self.range_start <= bits_start and bits_stop <= self.range_stop if bits_stop <= self.start_hi: return self.lo_sym elif bits_start >= self.start_hi: return self.hi_sym else: return None def update(self, symbol): assert symbol in {"A", "B"} assert self.start_hi != None if symbol == self.lo_sym: self.range_stop = self.start_hi else: self.range_start = self.start_hi self.start_hi = None def pop_common_prefix(self): assert self.start_hi == None last = self.range_stop - 1 # Last valid value of the range. # How many identical initial bits? if self.range_start == last: n_common = self.nbits else: different_bits = xor(self.range_start, last) n_common = int_to_ascii_bits(different_bits, self.nbits).find('1') if n_common == 0: common_prefix = "" else: common_prefix = int_to_ascii_bits(last, self.nbits)[:n_common] mask = (1 << (self.nbits - n_common)) - 1 # The uncommon bits. self.range_start = (self.range_start & mask) << n_common self.range_stop = ((last & mask) + 1) << n_common assert self.range_start < self.middle < self.range_stop return common_prefix def pop_shortest_prefix(self): """ Find as short as possible a prefix that puts us in the range we're supposed to be in. I.e. such that self.range_start <= prefix + "0000...", and prefix + "1111...." < self.range_stop. Reset the coder state. """ assert self.range_start < self.middle < self.range_stop if self.range_start == 0 and self.range_stop == self.stop: return "" n = 1 while True: d = self.stop >> n prefix = self.middle - d # n = 1 => 00000000... # n = 2 => 01000000... # n = 3 => 01100000... if prefix >= self.range_start: break prefix = self.middle + d - 1 # n = 1 => 11111111... # n = 2 => 10111111... # n = 3 => 10011111... if prefix < self.range_stop: break n += 1 self.range_start = 0 self.range_stop = self.stop return int_to_ascii_bits(prefix, self.nbits)[:n] """ Walrus_coder The Walrus is just being himself, but the eggman carries a bucket of eggs. You might say that in the coding process, the Walrus climbs down the infinite binary tree of possible bit strings. Once for each symbol, whether a bit string is output or not, either the "walrus" subtree or everything else is trimmed off. The coder keeps a limited-depth picture of what remains of the tree in a list called self.prefixes. The pop_common_prefix() method checks whether there is only one choice left at the top level, and as long as that's true, climbs the coder's viewpoint downward. Walruses are chosen so that the coder has at most one prefix to remember for each available length (corresponding to a remaining untouched subtree at that depth), and if a prefix of the needed length isn't listed, a higher-level subtree can be broken up to make one, while still leaving at most one (other) prefix for each length. """ # The choose_walrus_len() function is used both by the walrus coder and # to analyze it, so that the code tested is the same code that is analyzed. def choose_walrus_len(p_lps, lps, mps, coder_nbits, width, shortest_len): """ Given model information p_lps -- probability of the least-probable symbol as an int, lps -- least probable symbol ("A" or "B") mps -- most probable symbol and information about the coder and its state coder_nbits -- maximum length of prefixes width -- summarizes current state of the coder, as if you took sum(2 ** (coder_nbits - len(p)) for p in available_prefixes) shortest_len -- length of shortest prefix currently available choose the prefix length (but not the actual prefix), and symbol to become the walrus. Return walrus_prefix_len, walrus_symbol, eggman_symbol """ max_width = 1 << coder_nbits assert width >= (max_width >> shortest_len), (width, shortest_len) assert (max_width >> shortest_len) > width / 2, (width, shortest_len) # Convert p_lps to lps_width_x2 and clip to within this range: # 2 <= lps_width_x2 <= (width - 1) * 2 # 61-bit intermediate results here. lps_width_x2 = int((p_lps * width * 2) / P_A_STOP) lps_width_x2 = max(2, min(lps_width_x2, (width - 1) * 2)) # Find prefix len and corresp. width nearest to .5 * lps_width_x2: divider = 3 for nearest_len in range(coder_nbits, 0, -1): if lps_width_x2 < divider: break divider <<= 1 nearest_width = max_width >> nearest_len # The shortest prefix that can actually be used is one bit. # If the shortest is "", then it always must be split into "0" and "1". shortest_usable_len = max(shortest_len, 1) shortest_usable_width = max_width >> shortest_usable_len # Consider "all but shortest usable" as if it were another prefix: all_but_width = width - shortest_usable_width # Choose the walrus symbol and the length of its prefix. if abs(lps_width_x2 - all_but_width * 2) <= \ abs(lps_width_x2 - nearest_width * 2): # If all_but_width fits the LPS width, then # shortest_usable_width fits the MPS width, so MPS becomes walrus: walrus_symbol = mps eggman_symbol = lps walrus_prefix_len = shortest_usable_len else: walrus_symbol = lps eggman_symbol = mps walrus_prefix_len = nearest_len return walrus_prefix_len, walrus_symbol, eggman_symbol class Walrus_coder(object): def __init__(self, coder_nbits=CODER_NBITS_DEFAULT): self.nbits = coder_nbits self.max_width = 1 << self.nbits self.p_A_nbits = P_A_NBITS # Initialize the table of available prefixes. # self.prefixes[i] == None or len(self.prefixes[i]) == i. # As a binary fraction, self.prefixes[i] has weight 2**-i. # self.prefixes[i]'s "width" is 2 ** (self.nbits - i). # By shifting, that's self.max_width >> i. self.prefixes = [""] + [None] * self.nbits self.shortest_len = 0 # shortest explicit prefix's index. self.width = self.max_width self.walrus_symbol = None def set_p_A(self, p_A): assert p_A == int(p_A) assert 0 < p_A < P_A_STOP assert 1 < self.width <= self.max_width assert self.walrus_symbol == None if p_A <= P_A_HALF: lps, mps = "A", "B" # Least Probable Symbol, Most Probable Symbol. p_lps = p_A else: lps, mps = "B", "A" p_lps = P_A_STOP - p_A # Choose symbol to become walrus, and its prefix length (walrus_len). walrus_len, self.walrus_symbol, self.eggman_symbol = \ choose_walrus_len(p_lps, lps, mps, self.nbits, self.width, self.shortest_len) # Find or make a prefix of the selected length. Separate it from # self.prefixes, because if it's the same length as one of the # remaining prefixes, there's no place for it in the table. # self.width will not be changed yet. self.shortest_len may be changed. if self.prefixes[walrus_len] != None: # An explicit prefix of the right length exists. prefix = self.prefixes[walrus_len] self.prefixes[walrus_len] = None # Remove prefix from the table. if walrus_len == self.shortest_len: # Removed the shortest prefix, find the next longer one. for i in range(walrus_len + 1, self.nbits + 1): if self.prefixes[i] != None: self.shortest_len = i break assert self.prefixes[self.shortest_len] != None else: # There is no explicit prefix of the required length. One or more # new prefixes must be made explicit by adding bits to the # next-shorter (== wider) prefix ("ancestor") available. for ancestor_len in range(walrus_len - 1, -1, -1): # Look backward. if self.prefixes[ancestor_len] != None: break prefix = self.prefixes[ancestor_len] self.prefixes[ancestor_len] = None # Remove ancestor from the table. if ancestor_len == self.shortest_len: self.shortest_len += 1 for i in range(ancestor_len + 1, walrus_len + 1): self.prefixes[i] = prefix + "1" # Add longer prefixes... prefix += "0" # ...but the final new prefix is not added to the table. assert self.prefixes[self.shortest_len] != None self.walrus_prefix = prefix def recognize_prefix(self, bits): assert self.walrus_symbol != None if bits.startswith(self.walrus_prefix): return self.walrus_symbol elif self.walrus_prefix.startswith(bits): return None else: return self.eggman_symbol def update(self, symbol): assert self.walrus_symbol != None walrus_len = len(self.walrus_prefix) walrus_width = self.max_width >> walrus_len if symbol == self.walrus_symbol: # Reset table to contain only the walrus prefix. self.prefixes = [None] * (self.nbits + 1) self.prefixes[walrus_len] = self.walrus_prefix self.shortest_len = walrus_len assert self.prefixes[self.shortest_len] != None self.width = walrus_width else: # The table already contains everything but the walrus prefix. self.width -= walrus_width self.walrus_symbol = None def pop_common_prefix(self): assert self.walrus_symbol == None common = self.prefixes[self.shortest_len] assert common != None, (self.shortest_len, self.prefixes) if common == "": return common for prefix in self.prefixes[self.shortest_len + 1:]: if prefix == None: continue while not prefix.startswith(common): common = common[:-1] # Remove the last bit ('0' or '1' char). if common == "": return common # Shift everything toward the front of the table: plen = len(common) prefixes = self.prefixes[plen:] # Shorten all the prefixes: self.prefixes = [None if p == None else p[plen:] for p in prefixes] # Add None's at the end: self.prefixes += [None] * plen self.shortest_len -= plen assert self.prefixes[self.shortest_len] != None self.width <<= plen return common def pop_shortest_prefix(self): assert self.walrus_symbol == None shortest = self.prefixes[self.shortest_len] self.__init__(coder_nbits=self.nbits) return shortest """ Prefix-Subset Codes Include Arithmetic and Walrus Coders The basic principle of arithmetic coders is easier to understand than that of walrus coders. They're called "arithmetic" because binary numbers and binary arithmetic make for a nice way to split future bitstrings into classes. The walrus coder uses one of a more arbitrary family of ways to split future bitstrings: it has a particular (or peculiar) way of pruning the future-tree. The larger family might be called "prefix-subset codes." The name contrasts them with codes like Huffman codes, where each symbol is identified with a single prefix. Both arithmetic codes and the Walrus code divide the tree in such a way that symbols generally don't map to single prefixes or subtrees. The Walrus is lopsided since it always *does* give one symbol it's own prefix. The Walrus was designed while trying to answer whether there could be codes that had similar use-pattern and performance characteristics to arithmetic codes but weren't arithmetic codes. The following rough outline seems to me to describe what a coder has to be to fit the spec of an arithmetic coder: A prefix-subset code starts with with a (usually infinite) tree of possible encoded messages and makes choices about which section(s) of the tree represent each possible present symbol. Once the symbol is chosen (on the sender) or deduced (on the receiver), the coder proceeds to work on the subset of the tree representing that symbol. In the example of the walrus coder, one subtree represents the walrus, and the generally fragmented remainder of the tree represents the eggman. More generally, a prefix-subset coder's tree doesn't have to be binary. For instance, the branches could represent symbols in another encoder's alphabet, or lengths of on-or-off periods as in the dit's, dah's, pauses between them, pauses between letters, and pauses between words in Morse code. The tree doesn't have to be complete. That is, there may be parts that are either used for other purposes (as in the frame boundaries in mp3 files) or physically impossible (as in modulation patterns on a magnetic or optical disk track). Different branches might have more complicated cost functions than the length of a binary string. But prefix-subset codes aren't all possible codes; to match the use-pattern and performance characteristics of arithmetic codes creates limits. In particular the requirement of limited latency means the code must make distinctions only a certain distance into the future. That means that its working memory must describe a limited-depth set of subtrees. (There are finite data structures making unlimited-depth commitments, such as, "In the final chapter the suspect is the acquitted." Or "The output interpreted as binary is 352 mod 1009.") Also, the model's p_A prediction must alternate with the encoding or decoding of a symbol, in order to match the arithmetic coder's use pattern. Fractional Bits The Walrus coder, and the prefix-subset picture, give a way to look at how a coder can use "fractional bits": it means that the coder can make future commitments deciding for or against options that had probabilities other than powers of 1/2. Possible Uses of Prefix-Subset Codes This view of codes could be used to adapt to a medium, format, device, or interface that imposes limits or requirements on the encoded stream, but where general and efficient use of the medium is still wanted. The Mp3 frame boundaries, Morse dits and dahs, and disk track waveforms are examples. Another is steganography, where a specialized data format is employed to encode a more general message, and besides encoded length, plausibility is a cost function. For each of these applications, a model of the medium is needed; this seems to require a more intimate integration between the medium and coder than the "model" does on the message side. But maybe "choose a minimum-cost commitment that has xxx probability" and "pop the common initial behavior string" are a suitably general API. """ ###################### APPENDIX: CORRECTNESS TEST ###################### class TestMessageMismatch(Exception): pass class TestMessageUniform(object): """ A combo mock-model and pseudorandom mock-message-file. It has model methods end_of_message(), get_p_A() and update(), and file methods read(), write(), and close(), as well as others. write(symbol) compares to the next pseudorandom symbol and raises an exception if there's a mismatch. This "Uniform" class generates sequences where p_A is constant; it can be subclassed for more interesting behavior. There are hints for that in the comments. """ def __init__(self, p_A, seed=None, entropy_target=None, rnd_invert=True): """ Use this to create a combined model and "open message file", for either reading (& sending) or (receiving &) writing: model = message_file = TestMessageUniform(p_A) p_A -- The probability that a symbol will be "A" rather than "B". p_A is FLOAT, contrary to p_A in the coder. seed -- The prng is already seeded by p_A, the class name, and entropy_target, but you can add more info to get multiple sequences with those parameters identical. entropy_target -- The entropy of a message is how many bits it should take to encode it assuming the model is correct. The object counts the entropy and signals end_of_message once it reaches entropy_target. """ entropy_target = int(entropy_target or 1000) self.entropy_target = entropy_target seed = (type(self).__name__, seed, p_A, self.entropy_target) self.prng = random.Random(int(pickle.dumps(seed).encode("hex"), 16)) int_p_A = int(p_A * 2 ** P_A_NBITS + .5) assert 0 < int_p_A < (1 << P_A_NBITS), \ "p_A " + str(p_A) + " out of encodable range." self.message_p_A = p_A self.rnd_invert = rnd_invert self.message_len = 0 self.entropy = 0 self.__p_A = None self.current_p_A = None self.current_symbol = None self.is_sending = False self.is_receiving = False self.__got_eom = False # Being the file and the model at the same time, this mock can # test that the sender and receiver are using the API correctly. # # Sender tests end_of_message(), then calls # both read() and get_p_A() but in either order, # then update(). # Receiver calls end_of_message(), then calls # get_p_A(), then write(), then update(), # and nobody does both read() and write(). # # So here are the interlocks: # # method | requires | sets # --------------+------------------------+----------------------------- # end_of_ | entropy | __got_eom # message() | | # --------------+------------------------+----------------------------- # __generate() | current_symbol not set | current_symbol, # | not sending & receiving| # | not end_of_message() | # --------------+------------------------+----------------------------- # get_p_A() | not end_of_message() | current_p_A, __got_eom. # | calls __get__p_A(), | Call as often as you like. # | otherwise nothing. | # --------------+------------------------+----------------------------- # __get__p_A() | nothing | __p_A if not set. # --------------+------------------------+----------------------------- # read() | calls __get__p_A(), | (indirectly) current_symbol # | calls __generate() | is_sending # --------------+------------------------+----------------------------- # write() | p_A | (indirectly) current_symbol # | calls __generate() | is_receiving # --------------+------------------------+----------------------------- # update() | current_p_A, | entropy (affects # | current_symbol | end_of_message()) # | | clears __p_A, current_p_A, # | | __got_eom, current_symbol. # --------------+------------------------+----------------------------- # close() | current_p_A not set | nothing # | current_symbol not set | # --------------+------------------------+----------------------------- def end_of_message(self): self.__got_eom = True return (self.entropy >= self.entropy_target) def get_message_len(self): return self.message_len def get_entropy(self): """ Return the entropy of the message so far, according to the model's probabilities, in bits. """ return self.entropy def __get__p_A(self): """ Override this method for non-fixed p_A. """ # Keep this if-block-- it prevents re-randomizing mid-symbol. if self.__p_A: return self.__p_A # Replace this line when you override this method: self.__p_A = self.message_p_A # Keep this section to keep the "rnd_invert" feature. # Randomly flip self.__p_A to its complement. Because __generate() # uses self.__p_A, the odds of A vs. B are still correct. # Also, because we use self.prng, the symbol encoded will match the # symbol decoded later. # # Generate this random number regardless of whether it will be # used, to keep the prng sequence the same. do_invert = (self.prng.random() >= .5) if self.rnd_invert and do_invert: self.__p_A = 1.0 - self.__p_A return self.__p_A def __generate(self): assert not(self.is_sending and self.is_receiving), \ "Both read() and write() called on the same object." assert self.current_symbol == None, \ "read() or write() called twice without update()." assert not self.end_of_message(), \ "read() or write() after end_of_message() => True." if self.prng.random() < self.__p_A: self.current_symbol = "A" else: self.current_symbol = "B" self.message_len += 1 return self.current_symbol def get_p_A(self): assert self.__got_eom, \ "get_p_A() called after end_of_message() => True." self.current_p_A = p_A = self.__get__p_A() int_p_A = int(p_A * 2 ** P_A_NBITS + .5) assert 0 < int_p_A < (1 << P_A_NBITS), \ "p_A " + str(p_A) + " out of encodable range." return int_p_A def read(self, n_symbols): assert n_symbols == 1, "Expect to read 1, not %d symbols." % n_chars self.__get__p_A() self.is_sending = True return self.__generate() def write(self, symbol): assert len(symbol) == 1, \ "receiver called write() with more than one symbol: %r" % symbol assert self.current_p_A != None, \ "Did not call model.get_p_A() before message_file.write()." self.is_receiving = True correct = self.__generate() if symbol != correct: err = "%s instead of %s at position %d" % \ (symbol, correct, self.message_len - 1) raise TestMessageMismatch(err) def update(self, symbol): assert self.current_p_A != None, \ "Did not call model.get_p_A() before model.update()." assert self.current_symbol != None, \ "Did not call message_file.read() or write() before model.update()." assert symbol == self.current_symbol, \ "update() expected %r, got %r." % (self.current_symbol, symbol) if self.current_symbol == "A": self.entropy += -log(self.__p_A, 2) else: self.entropy += -log(1 - self.__p_A, 2) self.__p_A = self.current_p_A = self.current_symbol = None self.__got_eom = False def close(self): assert self.end_of_message() and self.__p_A == None, \ "message_file.close() called unexpectedly." class Old_Arith_coder(Arithmetic_coder): def __init__(self, *args, **kwargs): kwargs.setdefault("avoid_straddle", False) super(Old_Arith_coder, self).__init__(*args, **kwargs) def test1(Coder_class, coder_nbits=CODER_NBITS_DEFAULT, lg_min_p_A=-16.0, encoded_name = "/tmp/walrus_piece.out", entropy_target=None): print "test1(%s)" % Coder_class.__name__ print " P_A_NBITS =", P_A_NBITS print " coder_nbits =", coder_nbits print print "%7s %9s %18s %7s" \ % (" ", "# of ", "# of bits ", " ") print "%7s %9s %8s %8s %7s" \ % ("lg(p_A)", "symbols", "ideal", "actual", "match?") # for lg_p_A in [(-i / 2.) for i in range(2, int(-lg_min_p_A * 2) + 1)]: lg_p_As = [] i = -10 while i / 10. >= lg_min_p_A: lg_p_A = i / 10. lg_p_As.append(lg_p_A) if lg_p_A > -5.0: i -= 1 else: i -= 5 for lg_p_A in lg_p_As: p_A = 2 ** lg_p_A # Encode symbols from the pseudorandom message_file into encoded_file: model = message_file = TestMessageUniform(p_A, entropy_target=entropy_target) entropy_target = model.entropy_target # The class sets the default. encoded_file = open(encoded_name, "w") sender(message_file, encoded_file, model, Coder_class(coder_nbits=coder_nbits)) message_len = message_file.get_message_len() entropy = message_file.get_entropy() enc_size = os.path.getsize(encoded_name) # Decode bits from encoded_file into the mock message_file object, # which throws an exception if the output doesn't match # the original pseudorandom sequence. model = message_file = TestMessageUniform(p_A, entropy_target=entropy_target) try: encoded_file = open(encoded_name, "r") receiver(encoded_file, message_file, model, Coder_class(coder_nbits=coder_nbits)) missing = message_len - message_file.get_message_len() matches = "Yes" if missing == 0 else "%d missing" % missing except TestMessageMismatch, e: matches = str(e) print "%7.1f %9d %8.1f %8d %s" \ % (log(p_A,2), message_len, entropy, enc_size, matches) sys.stdout.flush() print if __name__ == "__main__": import random import pickle from math import log import os import sys if len(sys.argv) < 2 or sys.argv[1] != "graphs": # Quick test. encoded_name = "/tmp/walrus_piece.out" lg_min_p_A = -8.0 entropy_target = 1000 else: # Generate data for walrus.ipynb graphs. encoded_name = "walrus_graph_data.out" lg_min_p_A = -16.0 entropy_target = 10000 for coder in Arithmetic_coder, Old_Arith_coder, Walrus_coder: test1(coder, coder_nbits=12, lg_min_p_A=lg_min_p_A, encoded_name = encoded_name, entropy_target=entropy_target) print "Wrote", encoded_name """ ========================= References =========================== This piece's home page is http://www.mac-guyver.com/switham/2020/11/Walrus/ [P] http://docs.python.org/2/tutorial/classes.html#class-objects [N] walrus_effish_graphs.ipynb -- Under the home page