(See Optimizing substructure keys for the implementation of this approach.)
I have to work on other tasks for the next couple of months (paying work calls, and I need to prepare for my Python and Django programming courses for cheminformatics; February 2011 in Leipzig, spaces still available!) so I don't have time to write up what I did to optimize the subgraph enumeration algorithm so it's about 3 times faster.
I do want to give you an idea of one of the directions I'm going. I want to see if there are better substructure fingerprint filters than the published ones I know about (MACCS and CACTVS), and I want to use more automated target-based techniques to find those keys. (Actually, I want to use query-based techniques but getting access to good query data sets is hard.)
I can now process about 150 structures per second, so I analyzed the first 352,710 PubChem structures (those with ids between 1 and 400000) with k=6 to get 128,284 unique SMARTS patterns with up to 6 atoms. I removed those patterns which occured in less than 0.1% of the structures leaving 7,999 SMARTS patterns.
I want to make a substructure filter based only on those SMARTS patterns. This is different from the MACCS and CACTVS keys which have fields for "more than 2 carbons" or "halogen". Can I pick, say, 64 of those SMARTS in order to have good selectivity?
What does "good" mean? I don't have the time to work out the ideas, but my first thought was to minimize the number of structures which have the same fingerprint. (I think this minimizes the NxN search time, but don't quote me on that.)
This is a high-dimensional search in a non-smooth space. After 20 years I finally have a chance to use a genetic algorithm. I downloaded Pyevolve and spent just enough time playing with it to get preliminary results. Result? Worse than a naive algorithm I'll talk about next, and without some nice characteristics like consistency of keys when the input set changes. That's often the problem with GAs. Well, there went two days of CPU time.
My naive approach was to start with the set of all compounds. For each SMARTS pattern I can break the set up into two parts: those with the pattern and those without. With a target-based optimization the best I can do is find the SMARTS which most closely splits the set in half. Now I have two sets. Repeat the algorithm but this time see how the SMARTS splits every set and use the sum of the squares of the size differences as my metric. (I have no idea if that's really the right metric.)
I let it run for about 6-7 hours and got the following set of SMARTS, from most divisive to least:
Cc(c)cc CCO CN C=O CCCC cN n C=C Cl CC(C)C cO S COC cc(c)c CC(C)O CC CN(C)C O C1CCCCC1 CCC=O F CCCCCC Br C(CO)O C CccC CNC C1CCCC1 NO P C(=O)O cc(c)nc C(CO)CO CCC N cccc CCC(C)C=O C(CN)N cCc cc(c)c(c)c CC(C)(C)C [I] C(O)O CcccC C=N CC(=O)C o CCNCC ccncn C(CCO)CO [Si] c(cO)O CccccC CCOCC s CCC=C(C)C [As] CC(C)N C#C [Se] NN COCCOC CC(C)C(C)C c(cCl)cClAt least they look reasonable.
At that point there were 53,430 fingerprints with at least 2 members, and at least one fingerprint with 773 structures. I think that means that an NxN search will need 773*N subgraph isomorphism tests. (It does not mean that a query needs only 773 tests since a search for "CC" will hit a lot of structures.)
I wonder how useful they are in practice. If you test it out, let me know.
These are meant mostly as a teaser. They are not a replacement for existing substructure keys. It's tuned for a single target data set and it's made up of the SMARTS which are a subset of SMILES. A real set of filters should never have a small query which requires subgraph isomorphism against a large portion of the database. For example, if the query is [Dy] then these filters will do nothing at all to help.
BTW, what in the world is CID444218? Its molecular formula is C10H39Ac10N7No7NpO6PS-3. Seven atoms of nobelium? That's not bad for something with a half-life of an hour and where "no [chemical] compounds exist for nobelium." Nor is that the only PubChem records with nobelium. Anyone want to place an order for MolPort-006-699-029? With overnight shipping you'll only need 4kg in order to get 1g the next day.
Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me
Copyright © 2001-2013 Andrew Dalke Scientific AB