Unique fragments in PubChem

[ previous | newer ] /home/writings/diary/archive/2011/12/25/unique_fragments_in_pubchem

Unique fragments in PubChem

For reasons I'll get into later, I wanted to get an idea of the subgraph distribution of PubChem. That is, given my method for molecular subgraph enumeration, create all subgraphs of up to size 7 atoms and get an idea of how common they are. More specifically, atom uniqueness depends only on the atomic element and aromaticity, as assigned by OEChem, and the unique bond categories are "single-or-aromatic", double, and triple.

Last month I downloaded 2,138 sdf.gz files from PubChem and did structure perception with OpenEye's OEChem. Starting a couple of weeks ago, I use my subgraph enumeration algorithm to process 1,724 of them. For some reason, it stopped at that point. Since it took 7.5 days to process those files, and the data set is already a bit ungainly, I decided to leave the full analysis for another time and to not figure out what happened with the processing.

In the 1,724 files are 21,570,907 PubChem records and my enumeration found 1,925,185 unique substructures.

I kept track of the number of unique fragments per input file and the running total number of unique fragments over all of the files, plotted here:

You can see that 50% of the unique fragments are in the first 25% of the data files and essentially all are found in the first 50% of the files. (The number does increase after the 1000th file, but it's very slow.) It's also interesting to see the internal structural diversity in the different files. I suspect there are some large regions made from contributed combinitorial libraries.

The unique fragments which exist in the most number of records are:

21387437 C
20195255 O
19959057 c
19892743 cc
19755355 ccc
19457485 cccc
19270867 CC
19015890 ccccc
18599872 cccccc
18488545 c1ccccc1
18386628 N
17672171 Cc
17324074 Ccc
17109361 CN
16985355 Cccc
16533358 C=O
16522121 Ccccc
15993406 Cc(c)c
15759069 Cc(c)cc
15508521 Cccccc

You shouldn't be surprised to see that carbon is found in 21,387,437 of the 21,570,907 structures.

I made a distribution plot of the fragments, where the horizontal axis is rank order (C then O, cc, and so on). I show it at a few different scales in order to get a better understanding of the distribution. It's quite obviously *not* a Zipf distribution.

The vertical axis is the count in millions. You can see that the 10,000th most common substructure is in a very small percentage of the structure; it's actually 0.5%.

At the other end of the list, 478,278 fragments (24.8%) exist only once (like C#NF), 251,372 fragments (13.1%) exist twice (like B#[Cr]), and 132,574 fragments (6.89%) exist thrice. Here's the first 20 values as a table,

1 478278  # In other words, 478,278 substructures exist only once in the data set
2 251372
3 132574
4 100665
5 67536
6 57500
7 42959
8 37983
9 31750
10 28684
11 24016
12 23169
13 18695
14 17659
15 15501
16 14717
17 13452
18 12500
19 11394
20 11276

and in graphical form.

Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me