Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2010/01/11/fingerprint_file_format

Fingerprint File Format

(This has nothing at all to do with human fingerprints. I'm talking about a technique used in chemistry used to represent characteristics of a small molecule as a bit string so that fast bit operations can be used instead of slow graph operations. It's often rather like a Bloom filter where the hash function is based on molecular substucture.)

SD files and SMILES files are the two most common small molecule structure formats. PDB is the most common macromolecular structure format. FASTA and GenBank are similarly common in sequence databases. But molecular fingerprints don't have a common format.

I propose a new common format for these, called "FingerPrint Format", or "FPF file", with the standard extension of ".fpf." I'm looking for feedback and suggestions, and hopefully also uptake by others.

One problem of course is that they are many types of fingerprints and many ways to represent them. I'm limiting myself to those which are easily represented as a fixed-length bit string of length between 8 and about 8K, and which are dense enough to store the bits directly as bits rather than run-length or other encoding.

My goal is to make fingerprint data sets more portable, so that tools and algorithms developed by one group can more easily be shared and tested by another.

Use cases

Yes, these are all oriented around me. If it wasn't something I wanted then I would be getting paid for this. (Got funding?)

1) One of my clients wanted a 3-nearest-neighbors search (within some cutoff) of a set of Daylight fingerprints, to be used in part of their dataflow system. The files were static, and only updated once every 6 months or so. There was no need for a database, so I wrote a one-off search system for them.

2) I want to write a high-speed Tanimoto similarity search algorithm which works with a pre-built data structure and makes good use of modern processors. Most formats are not designed for high performance and require a preprocessing phase to load the data, and in a command-line tool the data load time will be the limiting factor, not the search.

3) I've found it a bit trying to determine the format and bit order of each fingerprint tool I use, since they are almost all different.

4) I want to develop some screening data sets for substructure queries and evaluate how effective they are against different inputs.

5) Oh, here's one which others want! Do the N**2 nearest-neighbor searches of a data set across a set of machines without spending much time building a fingerprint generation and parsing infrastructure.

Design goals

Here are my design goals:

These are not design goals:

I haven't decided how important these are:

Also, while not a design goal, I would like to have a set of reference implementation for different types of Tanimoto searches and substructure filtering and others. I also want them to be fast enough to be used in benchmarking new implementations, since I've seen too many benchmarks in general where the reference baseline was a naive (meaning "not fast") implementation.

FPF is based on the PNG block structure

I decided to base the format around the PNG specification. The first 8 bytes are a unique signature, and the rest of the file is a set of blocks. Each block contains a 4 byte identifier tag, followed by the length as a 4 byte integer, followed by 'length' many bytes for the block data, followed by 4 bytes for the CRC checksum.

All integers in this proposal are unsigned 32 bit values and stored in network byte order, taking up four bytes. That leads to a limitation in the fingerprint block size which I'll get to later.

FPF uses the same chunk layout, but with a different signature and with tags which are more appropriate for fingerprints.

The FPF signature as hex bytes is 0x89 0x46 0x50 0x46 0x0D 0x0A 0x1A 0x0A. The subsequence 0x46 0x50 0x46 is "FPF", where PNG uses "PNG".

I keep the PNG's use of a CRC checksum as a way to check for incomplete or corrupted files. I don't know how useful that is in practice.

FPF block types

FHDR: header block

I haven't figured out what goes in here. Perhaps something which indicates if the fingerprint type is known to be good/bad for substructure filtering or comparison?

tEXt (and potentially iTXt, zTXt): text block(s)

These are exactly the same as from PNG, which are key/value pairs including the controlled vocabulary of:

   Title            Short (one line) title or caption for image
   Author           Name of image's creator
   Description      Description of image (possibly long)
   Copyright        Copyright notice
   Creation Time    Time of original image creation
   Software         Software used to create the image
   Disclaimer       Legal disclaimer
   Warning          Warning of nature of content
   Source           Device used to create the image
   Comment          Miscellaneous comment; conversion from
                    GIF comment
Needless to say, but I'll do so anyway, some of these need to be tweaked for fingerprints.

I think FPF only needs tEXt, which is a simple key/value record using Latin-1. If those should be UTF-8 then look at iTXt or come up with a new type of text block.

Question: should there be some field which encodes the generation parameters ("OpenBabel FP2, folded to 32 bits")? That lets someone pass in structures without having to know the details of the given data set. Yes, these options are very implementation specific and not portable.

FDNS: dense fingerprint block

The block format is:

[ integer number of bits per fingerprint ]
[ integer number of bytes per fingerprint (including alignment) ]
[ integer number of bytes used for initial alignment = "initial alignment"]
[ "initial alignment" number of bytes of value 0 ]
[ fingerprint 1, which  is "number of bytes per fingerprint" long ]
[ fingerprint 2 ]
[ fingerprint 3 ]
  ...
[ fingerprint N ]
The fingerprints are stored in binary as a set of bytes. The bytes are written in little-endian order and the bits of each byte are written in big-endian order. To get bit i from a fingerprint:
byte_offset = i / 8;
bit_offset = i % 8;
bit = (fingerprint[byte_offset] >> bit_offset) & 1;
(CACTVS uses little-endian for the bytes and the bits, OpenBabel uses big-endian. I decided this mixture was easier to code.)

The minimum number of bytes per fingerprint is (bits_per_fingerprint+7)/8. Extra bits used to fill in the remainder of the last byte must be set to 0. The fingerprint may be padded with extra 0 bytes in order to make the fingerprint a multiple of a word size (typically 32 bits, 64 or 128 bits).

The "initial alignment" part is there to align the fingerprints in the case of memory-mapped files. It can be judiciously chosen so that the first byte of the first fingerprint is word aligned or even page aligned, if that makes a difference.

The number of fingerprints is not stored. It can be calculated based on the block size, the header size, and the number of bytes per fingerprint.

By the way, I expect memory alignment to be a down-stream issue where one program can generate a simple fingerprint file, using no extra alignment bytes, and another program can tweak the alignments to be more optimal for a given OS and implementation.

POPC: ordered population count offsets block

[ integer offset in DENS to first fingerprint with popcount=0 ]
[ integer offset  "  "   to first fingerprint with popcount=1 ]
   ...   
[ integer offset  "  "   to first fingerprint with popcount=N ]
[ integer offset in DENS to the byte after the last fingerprint with popcount=N ]

One way to speed up similarity queries and substructure filters is to exclude testing fingerprints which trivially cannot be accepted based on the popcounts of the target and query fingerprints. "Popcount" is short for "population count" and is the number of set bits in the fingerprint. For example, in a substructure filter test the query cannot have a smaller popcount than the target, and Baldi explained how reduce the Tanimoto search space.

FPF stores the popcount information implicitly, by ordering the fingerprints in the FDNS block so that all fingerprints with popcount=0 are listed first, then those with popcount=1, then popcount=2 and so on.

The "POPC" block contains byte offsets to the start and end of each set of fingerprints with the same popcount. If there are N bits per fingerprint then there will be N+2 offsets, and the fingerprints with popcount 0<=i<=N are between offsets POPC[i] and POPC[i+1].

The last offset is redundant and must be identical to the length of the DENS block. I include it here because my C implementations of the algorithms were easier if they have this information, and storing it in the data file means one less malloc or special case to consider.

If the POPC block is present then the DENS fingerprints must be ordered by population count.

NAME block - list of fingerprint identifiers

[NUL byte]
[ name 1 ]
[NUL byte]
[ name 2 ]
 ....
[ name N ]
[NUL byte]

If present, each fingerprint is associated with a single name and vice versa. The name must not contain the NUL character and should be a unique identifier and should not contain any control characters. (Are the names Latin-1 or UTF-8 encoded? If UTF-8, does the string also undergo a canonicalization step? Or are only ASCII names allowed?)

The names are stored in the same order as the fingerprints in the DENS block, so that the first name corresponds to the first fingerprint, the second with the second, and so on. Each name must be NUL terminated. (However, sorted names would make searching possible in log time instead of linear. On the other hand, most people will load the names into a local dictionary, and take the linear hit once.)

The first byte in the NAME block must be 0 (the NUL character). This is present even if there are no fingerprints and hence no names.

The purpose of the leading NUL character is to simplify text searches. To find the identifier "ID", construct a search for NUL+"ID"+NUL and do a linear search. (It also simplifies binary searches; pick a point p then backtrack to the previous NUL before doing the test.)

NOFF block - offsets into the NAME block

[integer offset to name for fingerprint 1]
[integer offset to name for fingerprint 2]
[integer offset to name for fingerprint 3]
  ...
[integer offset to name for the last fingerprint]

This block maps contains offsets into the NAME block. To get the name for fingerprint i, go to the ith entry in the table then use that as the byte offset of the name the NAME block.

To go the other way requires a binary search. Given the byte offset to a name in the NAME block, use a binary search to find the entry in the NOFF block. The byte position of the entry gives the index into the NOFF block, which is also the index of the fingerprint in the DENS block.

Using sorted names means the mapping from identifier to fingerprint can be done in log time. Using unsorted names requires linear time.

However, I expect most people to load the NAME data into a local dictionary/hash table, which will take linear time but only once. In that case, keeping the names in the same sort order as the fingerprints is much easier to deal with.

Unresolved issues

Sorted or unsorted names?

I'm repeating it here because it's tricky. How well supported should fingerprint lookup by name be? If the names are in sorted order then I can find the name in log(n) time, then find the fingerprint index in log(n) time, but I had to do that every time. While building a table mapping from name to index requires disentangling the lookup table and an linear operation.

Preserving the fingerprint sort order means the fastest program requires O(n) time lookup for every case but makes building the mapping from name to fingerprint very easy.

Scaling

This description used 32-bit unsigned integers as lengths and offsets. That means the largest fingerprint data set can contain at most 2**32 bytes. Consider someone using the 881 CACTVS substructure keys aligned to 128 bits. This uses 896 bits or 112 bytes per fingerprint, which means the block can hold a bit over 38 million fingerprints.

Of course, if someone does an 8K fingerprint then that drops to just over 4 million fingerprints.

Is this a realistic concern for the next 10 years? I think most people don't use fingerprints over 1024 bits, but on the other hand the databases are currently in the 50 million compound range.

There are three ways to address it. One is to use 64 bit values instead of 32 bit. Another is to use multiple blocks, with a special block to indicate the start of a new set of blocks. A third is to simply have multiple FPF files. One thing to bear in mind is that multiple section or multiple files is likely easier for people using map/reduce strategies.

Storing multiple diverse fingerprint sets in the same file

Substructure fingerprints are likely not good similarity fingerprints and vice versa. Is there a need to store different fingerprint sets in the same file?

I don't think so. I think using multiple files works well for that.

Preserving order

A substructure screen may go through several stages. For example, if the input structure contains a hormone substructure then the results of a general query filter might be passed through one optimized to distinguish between different types of hormones, with the hitlist passed from the first stage to the next.

Using names as the hitlist identifiers does not work that well because of the expensive lookup cost. It's better to pass around a list of offsets. But if the DENS block is reordered then the original sort order is no longer available. Also, there needs to be a way to report the match using its input order.

There are two solutions to that: don't sort by popcount, or add a table which maps from input order to DENS order. (Or possibly two tables, although the inverse mapping can be constructed in O(N) time with only one malloc because it's a bucket sort.)

Usefulness of Baldi

The Baldi algorithm suggests optimial searching based on ordering the search of the popcount bins based on the maximum minimum value of the bin. (The paper actually suggests allocating an array and sorting, but as the two sides of the distribution are monotonic decreasing it can also be done with an iterator doing the equivalent of a merge sort of the two sides.)

I implemented it but excepting high similarity searches (perhaps 80% or so? I did this over a year ago), I didn't notice any real improvement in Tanimoto searches. I conjectured that non-linear disk seeks were causing the problem. The memory and especialy disk subsystems are really optimized for forward searches but the Baldi algorithm jumps back-and-forth, and breaks all the lovely cache-prefetching going on behind the scenes.

While I'm convinced that the Baldi limits are useful, the reordering to make it fast ends up making the system more complicated.

BTW, SSDs (Solid State Drives) fix part of the problem by eliminating seek time, through not cache prefetching. Another solution might be to store two sets of fingerprints, with the second sorted in reverse popcount order and on another disk. Then with two threads going forward... Hmmm...

In any case, Swamidass and Baldi reported good speedups for top-K searches. I haven't seen their code, but I have seen other search code which uses a very slow scoring function, so I would like a platform where it's easier for people to compare against known fast implementations of different approaches.

Fingerprint density

I said that I'm assuming dense fingerprints, where the bits are random. Most chemical fingerprints are sparse, with under 20% density for similarity fingerprints and even less for feature key fingerprints used for substructure screening.

Perhaps using run-length encoding of the bits in a fingerprint, or run-length encoding of the inverted index, might be better. However, I have a lot less experience with that, and storing those data structures on disk is more complex than I want to deal with.

CPUs are very fast. A test I did a few years ago suggests that disk and memory access are the limiting factor, and not the algorithm. If that's the case then it's more worthwhile to make the CPU do extra work while it's waiting for less data. On the other hand, another test showed my code was still rather slower than computing the md5 checksum of the same file. These tests were suggestive, not rigorous.

Streaming output

Each chunk starts off with a length. This leads to several complications. It's hard to stream everything to a pipe except if the output data is known, which likely means storing all of the fingerprints in memory before generating the output.

If the output is a file, not a stream, then it's possible to write the all the fingerprints to the file, then seek back to the length field and, and then seek back to the end. They wouldn't be sorted, but a post-processor could sort a file which is too large to keep in memory. (Which given that my laptop has 3GB of memory doesn't seem a real concern.)

That would also mean keeping the names in memory until the fingerprints are written, since the above trick only works for a single block.

If memory is critical (which shouldn't be the case for names, as they are usually only 10 bytes or so), then an implementation might write the blocks to the filesystem and merge later on. Just like the old days of dealing with tape drives. But really, this seems more a theoretical concern than something to worry about since the point of an FPF file is use it as a data source for a pipeline but not part of the pipeline stream.

Development and future

This is something I've been thinking about for a while but haven't gotten around to because I'm not working with fingerprints right now. I am a consultant so if you're interested in funding me, email me. I also do have other, paying work so this doesn't have much priority.

I like the general approach of using PNG blocks. I've implemented parts of the system, and the resulting code has been nice and clean. The format I sketched out fits in well with the high-performance C code for Tanimoto search and substructure filter codes I wrote a couple of years ago. I also like how the format is relatively future proof and if someone needs some new data chunk it isn't hard to add.

I do need feedback from others. If you have ideas, or comments, or perhaps know about existing fingerprint formats I should use instead, then leave a message.

If there's enough interest in this, I'll set up some sort of project repository and mailing list for it, so it's up to you to speak up!


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2013 Andrew Dalke Scientific AB