chemfp's FPB format

[ previous | newer ] /home/writings/diary/archive/2014/07/09/fpb_format

chemfp's FPB format

Chemfp 1.2 supports new a fingerprint file format, called the "FPB" format. It's designed so the fingerprints can be memory-mapped directly to chemfp's internal data structures. This makes it very fast, but also internally complicated. Unlike the FPS format, which is designed as an exchange fingerprints between diverse programs, the FPB format is an binary application format. Internally it's a chunk-based container file format similar to PNG, Interchange File Format, and similar type-length-value formats. I'll talk more about the details in a future essay.

chemfp business model

Chemfp is a package for cheminformatics fingerprint generation and high-speed Tanimoto search. Version 1.1 is available for free, under the MIT license. Version 1.2 is the first release with my new business model. It's "free software for a fee." It's still under the MIT license, but you need to pay to get a copy of it.

Previously the commercial and no cost versions were the same version, but who wants to pay for something that's available for nothing? Many free software projects suffer from a resource problem because it's hard to get funding when you don't charge anything for the software. But people will pay to get access to useful features, which will goes into support and additional development. If all goes well, I'll release older commercial versions as no cost versions after a few years.

FPS files take a couple seconds to load

Perhaps the most widely useful new feature in chemfp-1.2 is the FPB file format, which complements the FPS file format. The FPS format is a human-readable text format which is easy to generate and parse, but it's not designed to be fast to read and write. I'll show you want I mean using ChEMBL 18 as my target database, and Open Babel's 1021-bit FP2 fingerprints.

I'll create the fingerprints in FPS format:

% ob2fps chembl_18.sdf.gz --id-tag chembl_id -o chembl_18_FP2.fps
% head -7 chembl_18_FP2.fps
#FPS1
#num_bits=1021
#type=OpenBabel-FP2/1
#software=OpenBabel/2.3.90
#source=chembl_18.sdf.gz
#date=2014-07-08T22:07:54
20000120200a0010402006040c00000064000000000220c80080104c03104c01000041
0000488021808002180a000000000020001800200084348082000802010c000c000320
020409000000080000041000017004cb10009340000000010000888012000001004010
20000420020029100f8010000900800008010002000300      CHEMBL153534

then do a similarity search, asking it to report the search times to stderr:

% simsearch --query 'Cn1cnc2c1c(=O)n(c(=O)n2C)C' chembl_18_FP2.fps --time
#Simsearch/1
#num_bits=1021
#type=Tanimoto k=3 threshold=0.7
#software=chemfp/1.2b2
#targets=chembl_18_FP2.fps
#query_sources=chembl_18.sdf.gz
#target_sources=chembl_18.sdf.gz
3	Query1	CHEMBL113	1.00000	CHEMBL1767	0.97222	CHEMBL74063	0.97183
open 0.00 search 2.15 total 2.15

The "--query" command-line parameter is new in chemfp-1.2. It takes a SMILES string by default. Simsearch looks at the fingerprint type line of the header to get the appropriate toolkit and generate the corresponding fingerprint for that structure query record.

Why aren't FPS searches faster?

Similarity search in chemfp is supposed to be fast. Why does it take over 2 seconds to search 1,352,681 records? Answer: nearly all of the time is spent reading the data and parsing the FPS file. Just doing a "wc -l" on the file takes 0.5 seconds, so that sets the upper bound on performance, unless I switch to an SSD.

This is why Noel O'Boyle's head-to-head timing comparison against Open Babel's own "fastsearch" finds that they have similar search times for single query searches; both simsearch and fastsearch are mostly I/O and parser bound.

Use 'fpcat' to convert from FPS to FPB format

I'll do the same test, but with the FPB format. I could ask ob2fps to write an FPB file instead of and FPS file, by simply changing the extension for the output file, like this:

% ob2fps chembl_18.sdf.gz --id-tag chembl_id -o chembl_18_FP2.fpb

However, this will re-parse the structure and recompute the fingerprints, which takes a long time.

Since I already have the fingerprints in the FPS file, I'll instead use the new "fpcat" program to convert from FPS format to FPB format.

% fpcat chembl_18_FP2.fps -o chembl_18_FP2.fpb

The conversion took about 12 seconds to run. The FPB format is pre-sorted and indexed by population count, to enable sublinear similarity search directly on the file, and the fingerprints are word aligned for optimal popcount calculations.

You can get a sense of the popcount ordering by using "fpcat" to view the contents of the FPB file as an FPS file:

% fpcat chembl_18_FP2.fpb | head -6
#FPS1
#num_bits=1021
#type=OpenBabel-FP2/1
#software=OpenBabel/2.3.90
0000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000	CHEMBL17564
0000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000	CHEMBL1098659
% fpcat chembl_18_FP2.fpb | tail -1
373e2327965282fe3795e7443613480cb59dd5d47164f8cc11ac2bbe91be9f873c09f1
dfaff7b0ffb09eb7fb21993243e3dbea4038a0e011a60be22f229e2634ced97e0a0e7c
9b832ffbb502d6a8139e08bccffaf5bb3ba8f36edf23814fe2953ff77738e10615f32a
e09040f7d42cbf510f15df765b3a6279f802471a86cb04	CHEMBL2368798

FPB searches are much faster

Simsearch accepts FPS and FPB files. By default it figures out the file type by looking at the file extension. I'll pass in the .fpb version:

% time simsearch --query 'Cn1cnc2c1c(=O)n(c(=O)n2C)C' chembl_18_FP2.fpb --time
#Simsearch/1
#num_bits=1021
#type=Tanimoto k=3 threshold=0.7
#software=chemfp/1.2b2
#targets=chembl_18_FP2.fpb
3	Query1	CHEMBL113	1.00000	CHEMBL1767	0.97222	CHEMBL74063	0.97183
open 0.00 search 0.00 total 0.00
0.305u 0.070s 0:00.41 90.2%	0+0k 0+0io 0pf+0w

Yes, FPB search is less than 1/100th of a second. I wrapped everything in the "time" command to show you that the whole search takes 0.4 seconds. Much of that extra time (about 0.25 seconds) is waiting for my Open Babel and my hard disk to load the available Open Babel file formats, but there's also overhead for starting Python and importing chemfp's own files.

The slowest part is loading Python and Open Babel

In fact, I'll break it down so you can get a sense of how long each part takes:

# Python startup
% time python -c "pass"
0.011u 0.007s 0:00.01 100.0%	0+0k 0+8io 0pf+0w

# Open Babel extension overhead
% time python -c "import openbabel"
0.027u 0.013s 0:00.04 75.0%	0+0k 0+3io 0pf+0w

# Overhead for Open Babel to load the available formats
% time python -c "import openbabel; openbabel.OBConversion()"
0.233u 0.021s 0:00.25 100.0%	0+0k 0+0io 0pf+0w

# Chemfp import overhead to use Open Babel
% time python -c "from chemfp import openbabel_toolkit"
0.281u 0.032s 0:00.31 100.0%	0+0k 0+26io 0pf+0w

In other words, about 0.3 seconds of the 0.4 seconds is used to get Python and Open Babel to the point where chemfp can start working.

When is the FPB format useful?

If you already have a fingerprint (so no toolkit overhead), or have an SSD, then the total search time on the command-line is less than 0.1 seconds.

For a command-line user, this is great because you can easily integrate similarity searches into scripts and other command-line tools.

It's also useful for web development. Of course, a web server in production rarely restarts, so the load time isn't critical. But as you develop the server you end up restarting it often. Many web application frameworks, including Django, will auto-reload the application server every time a file changed. It's annoying to wait even two seconds for 1.3 million records; imagine how much more annoying it is to handle a few 5 million record fingerprint sets.

Switch to the FPB format, and reloads become fast again. Even better, because FPB files are memory-mapped, the operating system can share the same memory between multiple processes. This means you can run multiple servers on the same machine without using extra memory.

Combine these together and you'll see that even CGI scripts can now include similarity search functionality with good performance. (Yes, there are still good reasons for using CGI scripts. The last one I developed was only two years ago, which was supposed to be a drop-in replacement for a system developed 10 years previous.)

More about chemfp

If you're interested in chemfp, see the product page and download version 1.1 to evaluate the performance. If you're interested in paying for access to version 1.2, email me.

Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me