Any long time reader knows that I'm interested in chemical fingerprints. That link points to a 7 part series on how to generate fingerprints, how to compute the Tanimoto score between two fingerprints, and how to do that calculation quickly.
- fingerprint extraction from SD files including PubChem's substructure fingerprints
- native fingerprint generation in RDKit, OpenBabel, and OpenEye's OEChem
- cross-platform implementations of the RDKit MACCS keys and a variation of the PubChem Substructure keys
- Tanimoto substructure search
It's an alpha release because the test suite isn't complete and in doing the documentation in the last day I've found a number of corners where I just haven't tested the code paths. It's also in alpha because I want feedback about its usefulness, and what you think I should do to make the format and the project to be more useful.
Fingerprints are one of the core concepts in my field. It's hard to read a few papers in my field without coming across something to do with fingerprint generation or scoring. Yet for all of the sophisticated understanding of the mathematical concept, the software ecosystem around fingerprints are rather weak.
The RDKit, OpenBabel, OpenEye's OEChem, the CDK, Schrodinger's Canvas, Xemistry's CACTVS, Accelrys's PipelinePilot, CCG's MOE, and many, many more tools generate fingerprints. I'm limiting myself to those which produce dense binary fingerprints of size less than about 10,000 bits, and usually only a few hundred. The best known are the MACCS keys with 166 bits. The CACTVS keys in PubChem are 661 bits, and hash fingerprints are usually around 1024-4096 bits.
So here's a question: how well does RDKit's MACCS key implementation compare to OpenEye's? Which provides a better similarity score: the CACTVS substructure keys or the slightly longer OpenBabel path fingerprints?
Try answering that question and you'll see that there is no exchange format for fingerprints. OpenBabel, Canvas, and OEChem all have black-box storage files for fast fingerprint search, but they are not meant for external use. Instead, you'll likely end up making your own format, and tools which work with that type. Which aren't portable outside of your own research group. (In one case a research used a format with space separated "0"s, and "1"s, meaning 2 bytes to store each bit!)
I first ran into this problem years ago when a client had a fixed data set and they wanted a web application to return the nearest 3 structures in the data set. They had the Daylight toolkit, which support the basic fingerprint functions but which doesn't have an external file format. I had to roll one, which doesn't take much time, but it wasn't worth my client's time for all the optimizations that I know are available.
One of my interests is fast Tanimoto/similarity searching. There are a number of papers on fancy ways to organize fingerprints, often with hierarchical trees. I've tried implementing these only to find that they were still slower than my fast linear search with the Swamidass and Baldi cutoff optimization. One of the papers showed benchmarks of their improved algorithm vs the linear+Baldi optimization code, and show that theirs was still faster. I looked at their code and realized that they had a horribly slow Tanimoto calculation. I think (without strong evidence!) that these algorithms can't beat a fast linear search, so I want to have a reference system for them to show me that I'm wrong.
Early in 2010 I proposed an exchange format. Two formats, actually, one in text for easy generation and exchange and the other a binary format for fast analysis. The binary format is on hold while I work on the text format, in part because the text format is much more useful.
A format by itself might look pretty, but it isn't useful. My experience is that functionality is the major reason to use a given format, not prettiness. I wanted to provide a good set of initial tools to encourage people to use the chemfp format. So, after about 40 days of work over the last 1.5 years, I present to you:
I'll start with a PubChem compound data set and use sdf2fps to extract the PubChem fingerprints
% sdf2fps --pubchem ~/databases/pubchem/Compound_013150001_013175000.sdf.gz | head -15 #FPS1 #num_bits=881 #software=CACTVS/unknown #type=CACTVS-E_SCREEN/1.0 extended=2 #source=/Users/dalke/databases/pubchem/Compound_013150001_013175000.sdf.gz #date=2011-05-29T22:34:14 075e00000000000000000000000000000000000000000c0683c10000000000832a000038000800 000030108118000c03430300000140244202004100008440001011000026111004460389892104 100609001313e00801037013004002004800000900200100240000000000000000 13150007 071e04000000000000000000000000000080040000000c0683c10000000012833f000058000000 000030200119000c60030020021140054a000040100024040010118020101330644c21ac58419c 042503881095e111130f710700c0000018000003006000000c0000000000000000 13150008 075e00000000000000000000000068000000000000000c060300a0010000008302000038000000 000030148318204c00c10000000140044200004100000400001011001020111004440189882104 100601001111e00801037001000000000800000900200100240000000000000000 13150009 075e00000000000000000000000000000000000000000c0603000000000000832a000038000800 000030108318204c03430300000140244202004100008440001011011026111004440389892104 100609001313e00801037013004002004800000900200100240000000000000000 13150010 031e0c000200000000000000000000000000000000002c06010000000000008902000058200000 00003020211b000d80010000501140054a000e42000024100810119800001310044c05a8080184 000401001491e11011017100000000002000000000000000100000000000000000 13150011 031e0c000208000000000000000000000000000000002c06010000000000008902000058200200 00803520211b000d80010000501140054a000e42000024102810119800001710044c05a8080184 0004010014d1e91011017140000000002000002000000000100000000000000000 13150013 035e1c000200000000000000000000000000000000002c06010000000000008902000078202000 00003030a11b000d83010802509140254ac20e43000024500814119800265310044c05a9890184 000601001493e11811017110000000002000000800000000100000000000000000 13150014 03ce00000600000000000000000000000080040000000c0000000000000012800f000038200000 00003000811a004d8003000010296004420000c400010410081211180000111005440588080104 000401001411e00001017000000000002000000000000000100000000000000000 13150023 01ce00000600000000000000000000000080040000000000000000000000008001000038200000 00003000a10a004d8001000010296004420000c400010410081211180000111005440588080100 000401001411600000000000000000000000000000000000000000000000000000 13150026
Now I want to do a similarity search of that data set. I'll make the target data file:
% sdf2fps --pubchem $PUBCHEM/Compound_013150001_013175000.sdf.gz -o targets.fps.gzgenerate some queries,
% sdf2fps --pubchem $PUBCHEM/Compound_005000001_005025000.sdf.gz | head #FPS1 #num_bits=881 #software=CACTVS/unknown #type=CACTVS-E_SCREEN/1.0 extended=2 #source=/Users/dalke/databases/pubchem/Compound_005000001_005025000.sdf.gz #date=2011-05-29T22:53:46 07de04000600000000000000000000000080040000003c060100000000001a8003000078200800 00003014a31b208d81c10300103140844a0a00c10001a6109810118810221311045c07ab892184 1116e1401793e61811037101000000000000000000000000000000000000000000 5000001 07de0c000200000000000000000000000080460200000c02000000000000008103000078200800 00003038a77b60bd03c993291015c0adee2e00410984ee400c909b851d261b50045f07bb8de184 112669001b93e31999407100000000004000000000000000200000000000000000 5000002 07de04000600000000000000000000000080060000000c0603000000000000800a000078200800 00003010811b000c03c10300103140a44a0a004100008640981011000026131004460389892104 100601001393e10801007010000000000000000800000000000000000000000000 5000003 075e1c000200000000000000000018000080040000000c00000000000000128001000078200800 0000b000851b404091410320103140800b1a40c10001a6109800118802321350645c072db9e188 11662b801f97e21913077101000000000000000100800000100000000000000000 5000005(Sadly, I also get the message:
close failed in file object destructor: Error in sys.excepthook: Original exception was:sigh. Well, there's a reason this is listed as "alpha".)
Now that I have the queries and targets, I'll use simsearch and do the default similarity search, which finds the nearest k=3 targets to the query according to the Tanimoto similarity.
% sdf2fps --pubchem $PUBCHEM/Compound_005000001_005025000.sdf.gz | head > queries.fps % simsearch -q queries.fps targets.fps.gz #Simsearch/1 #num_bits=881 #software=chemfp/1.0a1 #type=Tanimoto k=3 threshold=0.0 #query_source=/Users/dalke/databases/pubchem/Compound_005000001_005025000.sdf.gz #target_source=/Users/dalke/databases/pubchem/Compound_013150001_013175000.sdf.gz 3 5000001 0.7403 13163368 0.7308 13163366 0.6474 13174749 3 5000002 0.7233 13153311 0.7192 13152891 0.6250 13174818 3 5000003 0.7734 13163171 0.7638 13163170 0.7197 13174864 3 5000005 0.6833 13150812 0.6531 13162365 0.5507 13174872
This output format is still somewhat experimental. I'm looking for feedback. You can see it's in the same "family" as the fps format, with a line containing the format and version, followed by key/value header lines, followed by data. I'm still not sure about what metadata is needed here (do I really need the source and target filenames? Should I also have the date?) so feedback much appreciated.
You can of course specify different values of -k and set a --threshold. You can also search for just --counts, eg, if you want to find how many targets are within 0.4 of the query but you don't actually care about the identifiers.
Now, I admit, some of it's buggy but all the examples in the wiki documentation do work -- I skipped the ones that didn't!
I don't usually release "alpha" tools but I'm really looking for feedback. Kick the tires and let me know what you think! (And if you want to fund me - all the better. I am a consultant you know. :) )
Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me
Copyright © 2001-2010 Dalke Scientific Software, LLC.