o) We should really be looking up hashes both in our cache and in our cache of hashes from the peer. If it's only in the peer's cache, we can add it to ours and send it as if it were a new piece of data. This could either be a big speedup or a big slowdown for the encoding process. It depends on how often we see data from the peer then be data we need to send back to the peer. XCodec 0.9.0 goals: o) Stop using hashes like names and use actual names. This abstraction will allow us to minimize the cost of collisions, speed lookup, etc. It also means that different systems will be able to use different encode/hash algorithms for lookup based on their requirements. o) Expand the protocol to have different ops for referening hashes in our namespace vs. those of our peer... o) Exchange not just our UUIDs but a list of all of the UUIDs of other systems we're talking to, allowing us to also reference hashes in other namespaces that we share access to. o) Also exchange other parameters, like size of the backref window, using the minimum between the two peers. Past ideas: o) Create a new XCodecTag that incorporates a hash and a counter and perhaps other things... o) The counter will increment for each collision (or perhaps just be a random number after the first collision and bail out if there's a collision on the random number) and add new variants of extract, etc., that give a counter to append to the hash to get the Tag. o) Allow either powers of two above 128 or multiples of 128 to be usable chunk sizes instead of just 128. Include any time we define a tag a bitmap of 128-byte blocks (or blocks of each size down to 128) within the chunk that are to be learned, too, so that we still deal well with changes. Eventually allow defining new e.g. 4K blocks based on old 4K blocks with a single 128-byte block difference? o) Add a pass number to the Tag so we can do recursive encoding. Use a limited number of bits and put this above the opcode so that we can have separate back-reference windows, etc., for each pass and so that we can avoid escaping for subsequent passes, perhaps? o) Deflate after recursive encoding. o) A new encoder that can exploit all of those features, possibly keeping the old encoder around for applications that need low latency and high throughput. To-do: o) Add a 'count' field to the hash and allow incrementing it to do collision overflow. For this it'd be nice to have an interface that would return a range of matches in the dictionary. Put the count at the end to make this possible. Would need to change the encoding logic to use a different OP for these that took, say, a count or even just the full hash/identifier. o) Only have N bytes outstanding at any given time (say 128k?) and add some type of ACK, perhaps? This is necessary to: o) Write a garbage-collector for the dictionary. LRU? Possibly-bad future ideas: o) Incorporate run-length encoding. o) Incorporate occasional (figure out frequency) CRCs or such of the next N bytes of decoded data to make it possible to detect any hash mismatches, using a different hash function to any that go into the hash. o) If the encoded version of a stream is larger than the source would be escaped, it'd be nice to just transmit it escaped and to have some way to tell the remote side how to pick out chunks to be taken as known to both parties in the future. One approach would be to send a list of offsets at which hashes were declared. %%% Hash-set deduplication: For a given number of hashes (say 64), put an unordered list (hash?) of each 64 hashes that are encountered into a database. When data is encoded, check whether its 64 hashes have appeared previously. If they have, then use a compact encoding to list the order in which they appear and the offsets within the list at which escaped or new data is to be inserted. Eventually extend with one of the Computational Biology algorithms for finding sequences missing an element or with one element changed so that we can do work with deltas and offset sequences/sets more reliably.