MACCS key 44
The MACCS 166 keys are one of the mainstay fingerprints of cheminformatics, especially regarding molecular similarity. It's rather odd, really, since they were developed for substructure screening and not similarity. I suppose that Jaccard would agree that any relatively diverse feature vector can likely be used to measure similarity, whether it be Alpine biomes or chemical structures.
Here's a bit of dirty laundry that you'll not read in the literature. There are a lot of MACCS implementations, and they don't agree fully with each other. The differences are likely small, but as far as I can tell, no one has really investigated if it's a problem, or noted that it might be a problem.
I'll structure the explanation around key #44. What is definition for key 44?
To start, there is no publication describing the MACCS 166 public keys. All of the citations for it either say a variation of "MDL did it" or cite the 2002 paper which reoptimized the keys for similarity ([PDF]). Thing is, just about everyone uses the "unoptimized" definitions, so this is, technically, the wrong citation. (Why do people use it? Tradition, perhaps, or because it feels better to have a real citation rather than a nebulous one.)
Instead, the definitions appear to have come from ISIS/Base, and have been passed around from person to person through informal means. I haven't used the MDL software and can't verify the source myself. There's a relatively recent whitepaper from Accelrys titled "The Keys to Understanding MDL Keyset Technology" which says they are defined in the file "eksfil.dat". A Google search finds 8 results for "eksfil.dat". All are tied to that white paper. The PDF has creation and modification dates of 31 August 2011, and Archive.org first saw that URL on 11 October 2011.
It's easy to see that the reoptimization fingerprint is not the same as the 166 keys that everyone uses. You'll find that many places say that key 44 is defined as "OTHER". Table 5 of the reoptimization paper has an entry for '"other" atom type', but there's nothing which assigns it to key 44. You can't even try to infer some sort of implicit ordering because the previous entry in table 5 is "isotope", which is key 1 in the MACCS 166 keys, and two entries later is "halogen", which is key 134.
If you cite Durant, Leland, Henry, and Nourse (2002) as your reference to the MACCS 166 bit public keys then you are doing your readers a disservice. Those define different fingerprints than you used. Just go ahead and cite "MACCS keys. MDL Information Systems" and if the reviewer complains that it's a bad citation, point them to this essay and ask them for the correct one. Then tell me what they said. If Accelrys complains then they need to suggest the correct citation and put it in their white paper. Even better would be a formal publication and a validation suite. (I can dream, can't I?)
In practice, many people use the MACCS keys as interpreted by the implementers of some piece of software. I used "interepreted by" because "implemented by" is too strong. There are ambiguities in the definition, mistakes in the implementations, and differences in chemical interpretation, compounded by a lack of any sort of comprehensive validation suite.
Let's take key 44, "OTHER". Remember how the definition comes from an internal MDL data file? What does "OTHER" mean? RDKit defines it as '?' in MACCSkeys.py to indicate that it has no definition for that key. That line has a commit date of 2006-05-06. RDKit's lack of a definition is notable because Open Babel, CDK, a user contributed implementation for ChemAxon and many others reuse the RDKit SMARTS definitions. All of them omit key 44.
Others have implemented key 44. TJ O'Donnell, in "Design and Use of Relational Databases in Chemistry" (2009) defines it as the SMARTS [!#6!#7!#8!#15!#16!#9!#17!#35]. MayaChemTools defines it in code as an atom with element number in "1|6|7|8|9|14|15|16|17|35|53". (See _IsOtherAtom.)
These are the ones where I have access to the source and could investigate without much effort.
Both the whitepaper and the reoptimization paper define what "other" means, and the whitepaper does so specifically in the context of the MACCS 166 keys. It says:
"Other" atoms include any atoms other than H, C, N, O, Si, P, S, F, Cl, Br, and I, and is abbreviated "Z".This appears definite and final. Going back to the three different implementation geneologies, RDKit and its many spinoffs don't have a definition so by definition isn't correct. O'Donnell's is close, but the SMARTS pattern omits hydrogen, silicon, and iodine. And MayaChemTools gets it exactly correct.
Good job, Manish Sud!
Are these MACCS variations really a problem?
No. Not really. Well, maybe. It depends on who you are.
When used for similarity, a dead bit just makes things more similar because there are fewer ways to distinguish between molecules. In this case too, key 44 is rare. Only a handful of molecules contain "other" atoms (like the gold in auranofin) so when characterizing a database it's likely fine.
You don't need to trust my own gut feeling. You can read the RDKit documentation and see "The MACCS keys were critically evaluated and compared to other MACCS implementations in Q3 2008. In cases where the public keys are fully defined, things looked pretty good."
Okay, so you're hesistent about the keys which aren't "fully defined"? No need to despair. Roger Sayle ported the RDKit patterns (and without key 44) over to ChemAxon, and reported:
This work is heavily based upon the previous implementation by Miklos Vargyas, and the SMARTS definitions developed and refined by Greg Landrum and Andrew Dalke. This implementation achieves ~65% on the standard Briem and Lessel benchmark, i.e. almost identical to the expected value for MACCS keys reported in the literature by MDL and others.
(NB: All I did was proofread the RDKit SMARTS and find a few places that needed fixing.)
The MACCS 166 keys are a blunt tool, designed for substructure search and repurposed for similarity more because it was already present and easy to generate. 2D similarity search is another blunt tool. That's not to say they are horrible or worthless! A rock is a blunt tool for making an ax, but we used stone axes quite effectively throughout the Neolith.
Just don't treat the MACCS 166 keys as a good luck charm, or as some sort of arcane relic passed down by the ancients. There are limitations in the definition and limitations in the implementation. Different tools will give different answers, and if you don't understand your tools they may turn on you.
And when you write a paper, be honest to your readers. If you are using the RDKit implementation of the MACCS keys or derived version in another toolkit (and assuming they haven't been changed since I wrote this essay), point out that you are only using 164 of those 166 bits.
For a warmup exerecise, what is the other unimplemented bit in the RDKit MACCS definition?
For your homework assignment, use two different programs to compute the MACCS keys for a large data set and see 1) how many bits are different? (eg, sum of the Manhattan distance between the fingerprints for each record, or come up with a better measure), 2) how many times does the nearest neighbor change?, and 3) (bonus points) characterize how often those differences are because of differences in how to interpret a key and how often it's because of different toolkit aromaticity/chemistry perception methods.
I expect a paper in a journal by the end of next year. :).
(Then again, for all I know this is one of those negative results papers that's so hard to publish. "9 different MACCS key implementations produce identical MACCS keys!" doesn't sound exciting, does it?)
Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me
Copyright © 2001-2013 Andrew Dalke Scientific AB