ECFP-like fragments in PubChem

Previously I posted about unique fragments in PubChem. That used my molecular subgraph enumeration algorithm. In this essay I'll report some results from looking at the unique bit counts from RDKit's MorganFingerprint algorith, which is an ECFP variant.

My first graph in the previous essay shows that there are about 2 million unique fragments of size up to 7 in PubChem, and that the second 1/2 of the data files contained few fragments which weren't in the first 1/2. This suggests that there aren't that many substructures of size 7, compared to the number of possible structures of size 7, which is quite curious.

Rather, I expect that the number of unique fragments should level off with enough molecules. In the simplest case, there are 112 elements and 5 elements which can be aromatic, for a total of 117 possible unique atom types. I found 110 of them in my PubChem subset.

Similarly, I found only 1103 unique fragments with two atoms. The breakdown as a function of fragment length is:

- Length 1:
`110`unique fragments - Length 2:
`1103`unique fragments - Length 3:
`4209`unique fragments - Length 4:
`19398`unique fragments - Length 5:
`86336`unique fragments - Length 6:
`364342`unique fragments - Length 7:
`1447488`unique fragments

How many possible substructures are there of size 7? Assuming only 10 atom types and ignoring cycles and different bond types gives about 10 million. There are 1+1+2+6+21+112+853=996 connected subgraphs of up to size 7, and I'll guess that 800 of them are chemically accessible. I'll hazard 2**6 different bond types, so about 500 billion possible substructures. That's over three orders of magnitude larger than what I see.

It's curious because it means that any substructure-based fingerprint
using up to 7 atoms has at most a small multiple of ~2,000,000
different values. Sure, a hash algorithm might potentially generate
values in the range 0-2^{32}, but only a few million of them
will be actually be generated.

(Some algorithms will generate multiple bits per feature, eg, features with 1 or 2 atoms generates a single bit while features with 3 or more atoms generates two bits. This acts as a simple weighting scheme. There's a perfect correlation between those two bits, so I'm not counting the second one as meaningful.)

Fingerprint statistical models often assume a roughly Bernoulli process, and the number of unique features is unbounded, though increasingly rare. This observation suggests that there is actually an upper limit to the size, which changes the distribution type slightly.

Is this observable in other fingerprints? I used RDKit's MorganFingerprint algorithm, which is a variation of the ECFP "extended connectivity fingerprint." I used radii values of 1, 2, 3, 4, and 5, and with the other parameters at their default. Each step includes increasingly distant information, so should be more diverse.

The following shows the number of unique bits found as a function of the number of molecules processed. The molecules are ranked by PubChem id, although it doesn't include all of the PubChem structures since RDKit couldn't process all of the structures. (It complains about a number of bad valences.) A more complete analysis would randomize the structures to remove local coherence effects.

That graph is almost impossible to understand because the dynamic
range is so large. Log and log-log scales don't help either. The
best solution is to normalize by the maximum number for each
graph. That gives:

This sort of curve is a species discovery curve. It appears to show that the MorganFingerprint(radius=1) saturates at around 100,000 different bit values, and radius=2 might saturate at around 3.5 million unique values. This makes sense, as the radius=2 fingerprint corresponds to about 6-7 atoms, and I found 2 million unique values. (An average branching factor of 2.5 gives 2.5**2 or 6.25 atoms. However, a local branching factor of 3 gives 9 atoms, which adds more unique values to the fingerprint.)

I'll guess that there's under 40 million unique bits for radius=3 but it becomes harder to estimate. As the radius increases, the trend in the diversity of new values clearly gets closer to linear, which means there's less and less saturation. I can't predict the total number of unique values for radius=5 because it's still too flat.

The species growth curve is often fit as A(1-exp(-bx)) or A(log(b*x-1)). The first has a fixed limit, the latter implies there is no upper bound. This case is somewhere in the middle: for good chemical reasons, there's a large but finite number of possible ways to arrange a fixed number of atoms. For the smallest fragment size (1 atom), we are at that limit. For larger sizes, we are nowhere near the chemical limit, and I think a log fit works best.

Equally obvious, I would need to randomize the input order in order to get a smoother curve so I could make that prediction. But the end of the holidays was a couple of days ago and I need to get back to paying work.

Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me

Copyright © 2001-2013 Andrew Dalke Scientific AB