Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2011/05/30/chemfp-in-alpha


Any long time reader knows that I'm interested in chemical fingerprints. That link points to a 7 part series on how to generate fingerprints, how to compute the Tanimoto score between two fingerprints, and how to do that calculation quickly.

If you already know what I'm doing then I'll jump to the punchline. My chemfp project has just released chemfp-1.0a1.tgz. It supports:

It's an alpha release because the test suite isn't complete and in doing the documentation in the last day I've found a number of corners where I just haven't tested the code paths. It's also in alpha because I want feedback about its usefulness, and what you think I should do to make the format and the project to be more useful.


Fingerprints are one of the core concepts in my field. It's hard to read a few papers in my field without coming across something to do with fingerprint generation or scoring. Yet for all of the sophisticated understanding of the mathematical concept, the software ecosystem around fingerprints are rather weak.

The RDKit, OpenBabel, OpenEye's OEChem, the CDK, Schrodinger's Canvas, Xemistry's CACTVS, Accelrys's PipelinePilot, CCG's MOE, and many, many more tools generate fingerprints. I'm limiting myself to those which produce dense binary fingerprints of size less than about 10,000 bits, and usually only a few hundred. The best known are the MACCS keys with 166 bits. The CACTVS keys in PubChem are 661 bits, and hash fingerprints are usually around 1024-4096 bits.

So here's a question: how well does RDKit's MACCS key implementation compare to OpenEye's? Which provides a better similarity score: the CACTVS substructure keys or the slightly longer OpenBabel path fingerprints?

Try answering that question and you'll see that there is no exchange format for fingerprints. OpenBabel, Canvas, and OEChem all have black-box storage files for fast fingerprint search, but they are not meant for external use. Instead, you'll likely end up making your own format, and tools which work with that type. Which aren't portable outside of your own research group. (In one case a research used a format with space separated "0"s, and "1"s, meaning 2 bytes to store each bit!)

I first ran into this problem years ago when a client had a fixed data set and they wanted a web application to return the nearest 3 structures in the data set. They had the Daylight toolkit, which support the basic fingerprint functions but which doesn't have an external file format. I had to roll one, which doesn't take much time, but it wasn't worth my client's time for all the optimizations that I know are available.

One of my interests is fast Tanimoto/similarity searching. There are a number of papers on fancy ways to organize fingerprints, often with hierarchical trees. I've tried implementing these only to find that they were still slower than my fast linear search with the Swamidass and Baldi cutoff optimization. One of the papers showed benchmarks of their improved algorithm vs the linear+Baldi optimization code, and show that theirs was still faster. I looked at their code and realized that they had a horribly slow Tanimoto calculation. I think (without strong evidence!) that these algorithms can't beat a fast linear search, so I want to have a reference system for them to show me that I'm wrong.

Early in 2010 I proposed an exchange format. Two formats, actually, one in text for easy generation and exchange and the other a binary format for fast analysis. The binary format is on hold while I work on the text format, in part because the text format is much more useful.

A format by itself might look pretty, but it isn't useful. My experience is that functionality is the major reason to use a given format, not prettiness. I wanted to provide a good set of initial tools to encourage people to use the chemfp format. So, after about 40 days of work over the last 1.5 years, I present to you:



I'll start with a PubChem compound data set and use sdf2fps to extract the PubChem fingerprints

% sdf2fps --pubchem  ~/databases/pubchem/Compound_013150001_013175000.sdf.gz | head -15
#type=CACTVS-E_SCREEN/1.0 extended=2
100609001313e00801037013004002004800000900200100240000000000000000 13150007
042503881095e111130f710700c0000018000003006000000c0000000000000000 13150008
100601001111e00801037001000000000800000900200100240000000000000000 13150009
100609001313e00801037013004002004800000900200100240000000000000000 13150010
000401001491e11011017100000000002000000000000000100000000000000000 13150011
0004010014d1e91011017140000000002000002000000000100000000000000000 13150013
000601001493e11811017110000000002000000800000000100000000000000000 13150014
000401001411e00001017000000000002000000000000000100000000000000000 13150023
000401001411600000000000000000000000000000000000000000000000000000 13150026

Now I want to do a similarity search of that data set. I'll make the target data file:

% sdf2fps --pubchem $PUBCHEM/Compound_013150001_013175000.sdf.gz -o targets.fps.gz
generate some queries,
% sdf2fps --pubchem $PUBCHEM/Compound_005000001_005025000.sdf.gz | head
#type=CACTVS-E_SCREEN/1.0 extended=2
1116e1401793e61811037101000000000000000000000000000000000000000000 5000001
112669001b93e31999407100000000004000000000000000200000000000000000 5000002
100601001393e10801007010000000000000000800000000000000000000000000 5000003
11662b801f97e21913077101000000000000000100800000100000000000000000 5000005
(Sadly, I also get the message:
close failed in file object destructor:
Error in sys.excepthook:

Original exception was:
sigh. Well, there's a reason this is listed as "alpha".)

Now that I have the queries and targets, I'll use simsearch and do the default similarity search, which finds the nearest k=3 targets to the query according to the Tanimoto similarity.

% sdf2fps --pubchem $PUBCHEM/Compound_005000001_005025000.sdf.gz | head > queries.fps
% simsearch -q queries.fps targets.fps.gz
#type=Tanimoto k=3 threshold=0.0
3 5000001 0.7403 13163368 0.7308 13163366 0.6474 13174749
3 5000002 0.7233 13153311 0.7192 13152891 0.6250 13174818
3 5000003 0.7734 13163171 0.7638 13163170 0.7197 13174864
3 5000005 0.6833 13150812 0.6531 13162365 0.5507 13174872

This output format is still somewhat experimental. I'm looking for feedback. You can see it's in the same "family" as the fps format, with a line containing the format and version, followed by key/value header lines, followed by data. I'm still not sure about what metadata is needed here (do I really need the source and target filenames? Should I also have the date?) so feedback much appreciated.

You can of course specify different values of -k and set a --threshold. You can also search for just --counts, eg, if you want to find how many targets are within 0.4 of the query but you don't actually care about the identifiers.

There's a lot more in the package. See rdkit2fps, ob2fps, and oe2fps for examples of how to generate new fingerprints from the three supported toolkits.

Now, I admit, some of it's buggy but all the examples in the wiki documentation do work -- I skipped the ones that didn't!

I don't usually release "alpha" tools but I'm really looking for feedback. Kick the tires and let me know what you think! (And if you want to fund me - all the better. I am a consultant you know. :) )

Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me

Copyright © 2001-2013 Andrew Dalke Scientific AB