The chemfp FPB fingerprint format is a chunk-based format. An FPB file starts with an 8 byte signature followed by a sequence of chunks. Each chunk starts with a four character chunk code followed by an 8 byte length, and then that many bytes for the data itself.
The general idea for this type of format dates back to at least the IFF format from 1985. I first came across it when I read the PNG specification. I liked it so much that I decided to model FPB along those lines.
My first attempt was in 2008(!). I've iterated a few times since then. The result is more IFF-like than PNG-like. This essay describes the reasons for those changes.
4 bytes vs. 8 bytes.
IFF and the PNG specification use an 4 byte chunk length. If the image
data doesn't fit into a single chunk then it can be split across
several chunks. Actually, quoting from the PNG spec,
chunks are allowed so that encoders can work in a fixed amount of
memory; typically the chunk size will correspond to the encoder's
buffer size. Chunks are also useful because PNG viewer can update
the image display after each chunk, even though the file hasn't been
4 bytes is only enough for 4 GB. My design goal is 100 million fingerprints. Even 1024 bit fingerprints still take up 12 GB of data. There's very little I can do with only a chunk of the entire data set, and for search performance and ease of programming I want all of the fingerprints in a single contiguous region.
That's why I used an 8 byte chunk length. (I could have used a 6 byte length instead of 8, which sets an upper limit of 256 TB. That's enough space for 275 billion 1024-bit fingerprints. While that's more than the 166 billion from GDB-17, I figured I would be decadent and use two more bytes.)
A curious thing about the PNG format, compared to most other IFF-like formats, is the 4-byte CRC, which occurs as the fourth component of the chunk, after the chunk data. Here's how the specification describes it:
A 4-byte CRC (Cyclic Redundancy Check) calculated on the preceding bytes in the chunk, including the chunk type code and chunk data fields, but not including the length field. The CRC is always present, even for chunks containing no data. See CRC algorithm.
Part of me likes this extra integrity check. The question I had to ask my self was, should I carry that idea over to the FPB format?
Why chunk CRCs instead of a file CRC?
The CRC is designed to detect certain types of transmission errors. ("Transmission" includes data storage, which is data transmission to the future.) It can, for example, detect if a burst of up to 32 zeros or 32 ones corrupted the transmission. As a trivial reduction, it can also tell if there was a single corrupted bit.
But why does each chunk have a CRC? Why not the file itself? For that I consulted the PNG mail archive. The short answer: Info-ZIP experience and Usenet.
First off, why does it have a CRC at all? Based on my reading, Greg "Cave Newt" Roelogs proposed a CRC in the first place. Roelogs became the UnZip maintainer with release 4.0, in December 1990, and became involved with PNG development in January 1995. PNG's reference compression code comes from Zip and UnZip. Zip is an archive format, and the CRC helps verify long-term integrity of the zip file.
Roelogs proposed a per-file CRC, but there was a lot of discussion
about where to put it. Ed
Why not just put a CRC as the last field of
every chunk?, and Lee Daniel
Crocker listed various advantages to a per-chunk CRC:
1. Streamability. Streaming applications that read/write chunks do not have to maintain state information across chunks. ... 2. Simpler file changes. ... 3. You can make changes to comments, gamma, and other chunks with a binary editor, and programs won't think the whole file is bad. ... 4. More data integrity. While a CRC-32 is likely to catch changes in even a very large file, it is mathematically guranteed to catch 1- and 2-bit errors in an 8k chunk (I don't the book with me right now, but I think those are the right numbers). 5. You'll never hear me argue about where to put the CRC again, and the nesting problem with future extensions is gone, and Newt gets to keep his "chunk concept". Hell, if a 0.0005 increase in file size can just cut the message traffic on this list down, it's worth it. 6. You can cut the file size issue in half by using 16-bit chunk lengths (a separate argument I lost earlier but now have more weight behind), and eliminate it entirely by doing that and using CRC-16s. I personally favor the former but not the latter, and don't really care that much either way. That will even end the continual "this is just IFF" noise on comp.graphics too.There was a bit of back-and-forth about using CRC-16 vs CRC-32 vs. something stronger, like MD5. It was Alexander Lehmann who described the appropriate engineering tradeoffs for PNG:
IFF doesn't contain any checksumming, validity checks can be done with length checking (unexpected EOF), but that is not sufficient for network (especially multipart USENET) transmission.
If an uuencoded file contains truncated lines, this goes unnoticed by length checks, unless the missing chars are contained in a header or length chunk. (unless the image data is decompressed, but I would prefer a file checker that doesn't burn excess cpu cycles just to check if the uncompressed data has the correct size)
So there you go: chunk CRCs were added to help detect transmission errors over multi-part Usenet transmissions without needing to wait until the end to detect failures.
Multi-part Usenet transmissions?
If you don't recall the glory days of Usenet, it is a distributed newsgroup system designed around text. Post sizes were limited. If you wanted to send a large image, or application binary, or directory archive, you needed to split it up into multiple parts, encode it (usually uuencoded) and post each part separately.
For example, the first public release of Python was sent to alt.sources in 20 files. The first was titled "Python 0.9.1 part 01/21", followed by 02, etc.
These could be decoded and reassembled either manually or with various automated tools. Some newsreaders could even figure this out automatically. But sometimes things went missing or were incomplete.
Given this sort of noisy environment, I think you can understand better why a chunk CRC might be useful.
Is CRC good enough?
I wondered if CRC was good enough? Should I use something stronger than CRC? How many PNG failures occur?
We know that bit errors exist. Artem Dinaburg, in Bitsquatting: DNS
Hijacking without exploitation discusses one of the effects of
single-bit errors in DNS resolution, with links to previous work
including a Verisign
paper estimating the number of network errors in DNS queries (a
bit error rate of 10-9
is about what we'd expect for
typical data communication circuits), and how to
use bit errors to break out of a virtual machine.
But those are typically taken care of by the network layer. On the other hand, Bram Cohen, in his Stanford EE380/2005 talk "Under the hood of BitTorrent" (video no longer available, or at least, I can't find it) mentioned how TCP's checksum isn't good enough once you start swapping terabytes of data over the Internet, so BitTorrent had to add additional validation.
I asked on the PNG mailing list, but no one recalled seeing an error in over 10 years.
Going back to the original design work, Roelogs commented that
logical though a per-chunk CRC might be, it's overkill – and
that's coming from the guy who proposed (demanded) CRCs in the first
place. I agree. There's no need to have as many CRC validations as
My data files are only 10 GB at best, and more like 0.5 GB. I concluded that was nowhere near the point where I had to worry about sophisticated or advanced error detection.
Single bit failures are possible in the PNG format
By the way, there's a subtle framing issue with a chunk CRC. The length field determines where to find the CRC, but is not part of the CRC. If one of the length bits changes then there's a chance - a very small chance - that the bytes at the new position have a valid CRC for the interveaning bytes.
There's an even smaller chance that the new chunk will contain valid data for that chunk type, so this isn't something to worry about for real-world data.
On the other hand, if you think about it for a bit you'll realize that it's possible to construct a valid PNG such that a single bit change will still generate a valid PNG. Sketch: add a private ancillary chunk whose payload is the alternate PNG, plus a partial final chunk designed to skip the rest of the original PNG. It will need to be a little clever to include the bytes needed to resynchronize with another CRC, but that isn't very hard.
Sadly, I couldn't figure out a way to use this for evil.
When to validate the CRC? What to do with it fails?
What happens when the CRC fails?
Most PNG readers exist to display the image. They need to read all of the image data, which means reading and decoding nearly all of the chunks in a data file. The overhead for CRC validation is trivial. The main question for the reader is simply, what to do when it fails?
If you use libpng then you call png_set_crc_action() to tell it what to do on validation failure. The actions are "warn, and use the data", "warn and discard the data", "use the data without warning", "warn and discard the data" (for non-critical chunks), and "exit with an error." All reasonable actions for different circumstances.
My fingerprint data is a bit different than image data. For example, a 1.3 million fingerprints data file, with 1024 bit fingerprints, takes 217 MB. wc -l on that file takes 0.2 seconds, which sets the minimum time needed to compute the CRC.
Not only is my data set larger than a typical image, but I don't always need to use all of the data. Simsearch uses the sublinear search algorithm described by Swamidass and Baldi, which greatly reduces the search space for high threshold searches. As a result, typical search time is around 1/100th of a second, well less than the time needed to verify any CRC.
This tells me that FPB-based applications will rarely want to validate the fingerprint print chunk, much less the entire file. Experience tells me that if validation fields are rarely used, then they are rarely tested, and more likely to be error-prone. In fact, that's likely why there's a "warn, and use the data" option for libpng.
No chunk CRC for FPB
I concluded that it's not worthwhile to follow the PNG lead and store a CRC with each chunk. Generating the chunk CRC is a bit of hassle, with very little gain.
Instead, I'll wait to see real-world failure cases before deciding if I need to have any sort of validation, or possibly even error correction. Should this happen, I'll define a new chunk type which can store the validation data about the preceeding chunks, and place it just before the end of the FPB file.
Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me
Copyright © 2001-2013 Andrew Dalke Scientific AB