Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2008/06/26/fingeprint_background

Molecular fingerprints, background

A molecular "fingerprint" is a widely used concept in chemical informatics. Comparing molecules is hard. Comparing bitstrings is easy. Many people turn a molecule into a bitstring under the assumption that comparing the bitstring gives insight to how to compare molecules. To make life more complicated, chemical informatics uses the term "fingerprint" but feel free to think of it in your head as "bitstring."

There are two main major types of fingerprints, and lots of lesser used ones. "Structural fingerprints" are based on substructure features. The most widely known and used are the MACCS keys, which are available in two flavors: 166 bit and 320 bit. There's also the CACTVS substructure keys, which PubChem uses.

Each bit in a structural fingerprint corresponds to some chemical property, usually some sort of substructure presence. These are designed to be chemically relevant, which usually means it's good for the sorts of chemistry that the designer was interested in. It was easier to find documentation about the PUBCHEM_CACTVS_SUBSKEYS in PubChem so I'll share a bit of that definition.

Section 1: Hierarchic Element Counts - These bits test for the 
presence or count of individual chemical atoms represented 
by their atomic symbol.

Bit Position	Bit Substructure
0	>= 4 H
1 	>= 8 H
2	>= 16 H
3	>= 32 H
4	>= 1 Li
5	>= 2 Li
6	>= 1 B
7	>= 2 B
8	>= 4 B
Section 2: Rings in a canonic Extended Smallest Set of Smallest Rings (ESSSR) ring set
115	>= 1 any ring size 3
116	>= 1 saturated or aromatic carbon-only ring size 3
117	>= 1 saturated or aromatic nitrogen-containing ring size 3
118	>= 1 saturated or aromatic heteroatom-containing ring size 3
119	>= 1 unsaturated non-aromatic carbon-only ring size 3
Section 7: Complex SMARTS patterns
713	Cc1ccc(C)cc1
714	Cc1ccc(O)cc1
Substructure keys are often used in substructure searches as a way to filter out obvious mismatches before doing a computationally expensive subgraph isomorphism test. They are sometimes used for similarity searches but in practice I think hash fingerprints are better because some bits are more likely to be set than others (how many compounds do you deal with that have 4 or more borons in them?).

The other major fingerprint type is called "hash fingerprints." The most well known of these are the Daylight fingerprints. Enumerate all linear substructures of length N in the molecule. Daylight sets the bounds on N as 3 <= N <= 7, but this is tunable. There are N atoms and N-1 bonds in this length. Each atom and bond has a numeric value, based on the element type, bond type, and other properties deemed interesting. Use some sort of hash function to turn this 2*N-1 valued term into an integer. Usually the last step of the process is to take that integer modulo the number of bits in the fingerprint so that it fits in the space available. Mark that bit with a 1. Many different paths might cause the same bit to be set to 1.

Hash fingerprints are always a multiple of 8 bits long and usually a power of 2 between 128 bits and 4096 bits. (Even OpenBabel, which uses a 1021 bit hash, maps that to a 1024 bit fingerprint, with 0 for the remaining 3 bits.) They are very similar to Bloom filters. Hash fingerprints are used most often in similarity searching, with the hope that two similar compounds create similar fingerprints, and that two similiar fingerprints means the compounds are similar.

Supposing that's true, how do you compare two fingerprints FP1 and FP2? There are many ways. One is to count the number of shared bits, that it, the number of bits which are on in both bitstrings. This is a good start, but consider two large quite dissimilar molecules which both have a section in common but otherwise are quite dissimilar. Intuitively you want the dissimilar structure to reduce the overall similarity score.

There's a huge number of ways to compare hash fingerprints, but each depends on only 4 different values:

When implementing the comparison routines it's easier to use slightly different definitions: I'm using the Daylight fingerprint notation here. It's a bit annoying because in my code I'll really want to use lower case "a" and "b" for what they use as "A" and "B". When I get to the code I'm likely going to switch.

The common term by those in the know for the "total number of 'on' bits" is "popcount", which is short for "population count." The collective wisdom of Wikipedia prefers Hamming weight. Which reminds me, take a break and enjoy Hamming's seminar "You and Your Research".

Because the bitstrings are bits, the way to compute "c" (the number of bits in common) is to compute the boolean "and" (fp1 & fp2) and find the popcount of the result.

The most widely used and accepted finger similarity function in chemical informatics is called the Tanimoto coefficient or Tanimoto score, or simply "the Tanimoto." It's identical to the Jaccard index as applied to binary bitstrings, and is defined as:

which is the same as
You can see that identical fingerprints have a score of 1.0, and when there are no bits in common the score is 0.0. Also note how if there are more bits in only one or the other fingerprint then the score descreases. If there are no bits in either fingerprint then this generates a "0/0" case. It means that the fingerprint generation code couldn't handle the chemistry involved. For hash fingerprints this should never happen, except perhaps with the vacuum case. For some sorts of substructure keys it might happen. It's like trying to compare a shark and Mt. Everest, based on how similar their wheels are.

By the way, there's another way to define the Tanimoto coefficient of two fingerprints. It's the number of on bits in the intersection of the two fingerprints, divided by the number of on bits in their union. I learned that one reading the OpenBabel source code.

Copyright © 2001-2008 Dalke Scientific Software, LLC.