Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2012/06/11/optimizing_substructure_keys

Optimizing substructure keys

Advertisement: This essay describes a research project I've slowly been working on during the last 18 months. I believe I can do very fast substructure searches of large compound data sets. I would like to make it more real, but that requires funding. Contact me if you want to collaborate and can pay for it.

There are two related goals in my quest for an inverted index library, which is unfortunate because what I want for one is not necessarily the best for the other. One goal is to see if I can do PubChem-scale substructure searches with "interactive" performance (under 0.1 seconds). I think I can do that with a lot of precomputation based on substructure enumeration and optimized key development. Based on the feedback from various people, I think this is best handled within an existing database, and most probably PostgreSQL.

The first goal requires developing new database substructure search methods. My other goal is to see if I can optimize substructure keys within the framework used in most existing substructure search programs.

The existing approaches use either a explicit substructure-based fingerprint like the MACCS or CACTVS/PubChem keys, or an exhaustive hash-based fingerprint like the Daylight fingerprints. The substructure-based ones are hand-optimized and small (under 1000 bits), but don't do well with unusual chemistry. The exhaustive ones can handle diverse chemistries, but are larger (typically 1024-4096 bits).

I want to see if I can use my subgraph enumeration approach to do target- and query-based key optimization. That is, given a set of structures in the database and a characteristic set of queries, I would like to develop short, dense fingerprints which can optimize those queries.

Problem: I lack a query set

I'm not there yet. The biggest problem is that I lack any sort of representative sample of the types of queries people do. People at pharmas don't want to contribute their data sets because it's hard to remove all of the proprietary queries from the dataset. By law, PubChem is not allowed to release this data.

Do you have a good (1,000+) query set that I can use for a benchmark? If yes, email me!

So until I get a good query set, I'll have to assume that the targets themselves are representative queries. This of course is incorrect. By definition, a substructure search will likely be smaller than what's in the database. Plus, if someone uses a sketch program to make the query, then they are going to build it up an atom or ring at a time, so the first dozen queries or so will likely be small and have lots of matches.

Optimizing the query keys

At its heart, a substructure query is a subgraph isomorphism test. This is slow. A substructure filter is a quick test which tries to reduce the number of required isomorphism tests. For example, if an "N" (nitrogen) exists in the query but does not exist in the target then there's no need to do the full isomorpism test.

However, if "N" does not exist in the query then it might still match a target molecule which has an "N" in it. For example, the query "O" should still match the target "N#[N+][O-]" (nitrous oxide).

So if the query feature is "1" then the target feature must be a "1", otherwise the isomorphism test is required. This is a simple decision tree; all compounds with featurei take one branch, and all which don't take the other. Since I'm assuming that the queries are sampled from the targets, then what I want is to end up with the set of features which minimize the size of the number of nodes at the end of the decision tree.

Last year I tried a genetic algorithm to find this, but it didn't converge well, and the results weren't pleasing. For example, if I decide that I want to add 32 bits to a fingerprint, then I have to redo the search from scratch and I end up with a completely different set of fingerprints.

A greedy optimization algorithm

I came up with a greedy optimization instead. The base of the decision tree has all of the records in a single node. I find the "most divisive" feature, defined as the feature which best separates the tree into two branches, with about half of the records in one branch and half in the other. Each iteration finds the features which most evenly tries to split all of the nodes. I use the scoring function:

def score(current_leaves, feature):
    score = 0
    for node in current_leaves:
        num_on = len(node & feature)
        num_off = len(node) - num_on
        score += (num_on-num_off)**2
    return score
This does a lot of set intersection tests; about 30 million per iteration in my benchmark. Actually, since all I care about is the number of elements in the intersection, and not the set itself, I wish that Python had a "set.intersection_size()" method.

I use the scoring function to find the most divisive feature. This is the one which generates the smallest score. I use that feature to split all of the current leaves into "haves" and "have nots." As an optimization, I've decided that if there are fewer than 10 elements in a node then I don't care about further splitting that node. That's small enough that I can easily do a subgraph isomorphism test.

There are a few other optimizations. I ignore duplicate fingerprints, and I ignore features with fewer than 10 records in them. Set intersections are expensive, so I also do a short-circuit evaluation while computing the scoring function and abort the calculation if the current score is ever worse than the current best score. I also sort the nodes so the ones with the largest size go first. This increases chance of early short-circuiting.

I do some other optimizations, but you'll need to see the code for all the details. Yesterday I found that pypy 1.9's set intersection was about twice as fast as the one in cpython, so that's what I used to run the code.

Preliminary results

I used the same benchmark set (37 MB 7zip compressed) from yesterday's essay, which is a small subset of my full data set. From previous analysis, I know it has high local fingerprint diversity, so I figure it's a good enough start. Here's the subset of output containing the feature keys and the amount of time it took to find each one:

0 Cccc 5.5 seconds
1 CCO 3.6 seconds
2 CN 3.8 seconds
3 C=O 3.8 seconds
4 CCCC 5.1 seconds
5 cN 4.7 seconds
6 Cl 9.0 seconds
7 CC(C)C 10.2 seconds
8 n 8.2 seconds
9 C=C 10.5 seconds
10 cO 10.1 seconds
11 S 11.3 seconds
12 COC 11.7 seconds
13 cc(c)c 24.3 seconds
14 CC(C)O 24.1 seconds
15 O 24.5 seconds
16 CN(C)C 27.4 seconds
17 C1CCCCC1 41.4 seconds
18 CC 36.8 seconds
19 CCC(=O)O 44.1 seconds
20 F 55.2 seconds
21 CCCCCC 52.1 seconds
22 Br 48.2 seconds
23 CCC 52.3 seconds
24 C 51.1 seconds
25 C(CO)O 53.5 seconds
26 CccC 64.2 seconds
27 CCNCC 66.1 seconds
28 P 60.2 seconds
29 CCC(C)C=O 59.6 seconds
30 c1ccccc1 68.4 seconds
31 N 74.9 seconds
32 C1CCCC1 80.9 seconds
33 cCO 85.1 seconds
34 NO 96.5 seconds
35 C(CCO)CO 97.8 seconds
36 C(CN)N 96.4 seconds
37 cc(c)cc(c)c 99.7 seconds
38 CC(C)(C)C 99.2 seconds
39 I 102.6 seconds
40 CcccC 208.8 seconds
41 CCCCCCC 100.7 seconds
42 cc(c)nc 98.0 seconds
43 [Si] 93.7 seconds
44 CNC 95.3 seconds
45 CC(=O)C 94.5 seconds
46 c(cO)O 91.7 seconds
47 CCCCC 89.6 seconds
48 cCc 90.5 seconds
49 CC(C)(C)O 82.6 seconds
50 [Na] 82.6 seconds
51 o 84.3 seconds
52 [As] 85.6 seconds
53 NN 84.6 seconds
54 C(O)O 81.6 seconds
55 c(n)n 83.3 seconds
56 [Se] 80.4 seconds
57 CC=CC 80.1 seconds
58 COCCCOC 79.6 seconds
59 c(cCl)cCl 78.1 seconds
60 C=N 105.0 seconds
61 C#C 59.6 seconds
62 [Cr] 58.5 seconds
63 c(ccN)cN 58.1 seconds
64 CccccC 58.0 seconds
65 CC(C)C(C)C 54.9 seconds
66 ccc 52.7 seconds
67 B 53.3 seconds
68 CCCOCCC 55.1 seconds
69 CccO 51.0 seconds
70 [K] 48.7 seconds
71 CCN(CC)CC 48.6 seconds
72 CCCC=C(C)C 47.4 seconds
73 [Te] 47.8 seconds
74 cc=O 47.4 seconds
75 [Ti] 47.4 seconds
76 O=S(=O)O 47.7 seconds
77 CCC(CC)CO 46.7 seconds
78 [Fe] 46.8 seconds
79 c1ccncc1 46.7 seconds
80 CCOCCOC 52.5 seconds
81 c(ccN)ccN 35.2 seconds
82 [Al] 34.5 seconds
83 CC(C)N 34.2 seconds
84 [Sn] 32.4 seconds
85 c(cCl)Cl 32.4 seconds
86 CSC 31.9 seconds
87 C=CC=C 30.9 seconds
88 c(ccO)ccO 30.3 seconds
89 [Mo] 30.0 seconds
90 s 30.0 seconds
91 C(Cl)Cl 29.3 seconds
92 [Ca] 28.7 seconds
93 CccN 28.3 seconds
94 C(CO)CO 27.3 seconds
95 [W] 26.7 seconds
96 O(P)P 26.8 seconds
97 [Cu] 27.0 seconds
98 CCC(=O)CC 27.5 seconds
99 CCCNCCC 26.8 seconds
100 CO 28.0 seconds
101 [Pb] 19.6 seconds
102 CCOC(=O)C 19.6 seconds
103 [Zn] 18.8 seconds
104 N=N 18.6 seconds
105 [Mg] 18.2 seconds
106 CC(C)CC(C)C 18.0 seconds
107 [Hg] 17.7 seconds
108 C=CCCO 17.2 seconds
109 [V] 17.1 seconds
110 [H] 17.0 seconds
111 cc(c)c(c)c 17.1 seconds
112 c(ccO)cO 16.8 seconds
113 [Zr] 16.5 seconds
114 CC(=O)N 16.5 seconds
115 c(ccCl)cCl 15.9 seconds
116 cc(c)cN 15.8 seconds
117 [Mn] 15.6 seconds
118 COP(=O)OC 15.5 seconds
119 Ccnc 15.4 seconds
120 [Ni] 15.7 seconds
121 CCCCCC=O 12.4 seconds
122 C1CC1 12.1 seconds
123 cOc 11.9 seconds
124 [Co] 11.7 seconds
125 c1cccc1 11.7 seconds
126 [Sb] 11.5 seconds
127 [Pt] 11.5 seconds
128 C(N)N 11.4 seconds
129 O=S 11.0 seconds
130 [Nb] 10.8 seconds
131 C(CCO)CCO 10.8 seconds
132 [U] 10.6 seconds
133 c(cBr)cBr 10.7 seconds
134 CCc 10.6 seconds
135 [Ba] 10.2 seconds
136 CC(C)OC(C)C 10.0 seconds
137 C[Si](C)C 10.1 seconds
138 ccNcc 9.8 seconds
139 [Ru] 9.5 seconds
140 C=CCC=C 9.7 seconds
141 CN(C)(C)C 7.5 seconds
142 SS 7.3 seconds
143 C(F)(F)F 7.2 seconds
144 [Re] 7.0 seconds
145 ccnccn 6.9 seconds
146 CC(C)(C)CCO 6.8 seconds
147 [Cd] 6.7 seconds
148 CCC(C)CC 6.7 seconds
149 CCSCC 6.5 seconds
150 [Ge] 6.5 seconds
151 Cc(c)c(c)c 6.4 seconds
152 [Sr] 6.3 seconds
153 C(COCCO)O 6.2 seconds
154 Ccc(C)cC 6.0 seconds
155 c(c(cCl)Cl)Cl 6.0 seconds
156 OO 6.0 seconds
157 ccnc=O 5.9 seconds
158 C(CCl)Cl 5.8 seconds
159 [Pd] 5.7 seconds
160 c(cO)N 5.8 seconds
161 [Ag] 4.7 seconds
162 C1CCCCCC1 4.7 seconds
163 [Ta] 4.6 seconds
164 [Ce] 4.6 seconds
165 C(CC=O)C=O 4.5 seconds
166 cccCl 4.4 seconds
167 CCC(CC)O 4.3 seconds
168 [Gd] 4.2 seconds
169 [Li] 4.2 seconds
170 cc(c)c(c)O 4.2 seconds
171 C=CCO 4.1 seconds
172 CP 4.1 seconds
173 C#N 4.0 seconds
174 COcc(c)OC 3.9 seconds
175 [Bi] 3.8 seconds
176 c(cN)cS 3.8 seconds
177 [Y] 3.8 seconds
178 CC(C)(C)C=O 3.7 seconds
179 c(ccBr)cBr 3.7 seconds
180 [Yb] 3.8 seconds
181 [In] 3.1 seconds
182 CC(C)CO 3.1 seconds
183 Cc(c)cccO 3.0 seconds
184 COCC(O)O 3.0 seconds
185 [La] 2.9 seconds
186 c(cN)N 2.9 seconds
187 [Eu] 2.9 seconds
188 CCC(CC)CC 2.8 seconds
189 [Ga] 2.8 seconds
190 cCCc 2.8 seconds
191 COC(CO)CO 2.7 seconds
192 [Th] 2.7 seconds
193 C=C(C)C 2.7 seconds
194 c(cF)cF 2.6 seconds
195 C[Si]C 2.6 seconds
196 CncccnC 2.6 seconds
197 cc(c)nc(c)c 2.6 seconds
198 C(CNCCN)N 2.5 seconds
199 [Tl] 2.5 seconds
200 C=C(C)C(C)(C)C 2.5 seconds
201 [Ho] 2.2 seconds
202 [Au] 2.2 seconds
203 CC(C)S 2.2 seconds
204 [Nd] 2.1 seconds
205 cc(cCl)c(c)Cl 2.1 seconds
206 CCC(CC)OC 2.1 seconds
207 c(ccS)ccS 2.1 seconds
208 C(=O)OC=O 2.0 seconds
209 [Be] 2.0 seconds
210 CC(C)c 2.0 seconds
211 [Tb] 2.0 seconds
212 COc 1.9 seconds
213 CCCCO 1.9 seconds
214 c(cBr)Br 1.9 seconds
215 CcccccC 1.8 seconds
216 C(COP)COP 1.8 seconds
217 C(CBr)Br 1.8 seconds
218 [Rh] 1.7 seconds
219 CCCCNC=O 1.7 seconds
220 C(CO)N 1.7 seconds
221 O[Si]O[Si]O 1.5 seconds
222 COcccOC 1.5 seconds
223 CC(C)CC=O 1.4 seconds
224 Cn(c)c(c)c 1.4 seconds
225 [Sm] 1.4 seconds
226 CCC(CO)CO 1.4 seconds
227 C(Cl)(Cl)Cl 1.4 seconds
228 CS 1.3 seconds
229 [Cs] 1.3 seconds
230 cc(c)c(cCl)Cl 1.3 seconds
231 CC1CCC(C)C1 1.3 seconds
232 cc(c)c(c)F 1.2 seconds
233 C(S)S 1.2 seconds
234 cnn 1.2 seconds
235 O=P(O)O 1.1 seconds
236 CCOCOCC 1.1 seconds
237 O=[As]O 1.1 seconds
238 CccccN 1.1 seconds
239 [Er] 1.0 seconds
240 C1CCNCC1 1.0 seconds
241 C(=O)CO 0.7 seconds
242 CC(=C(C)C)C 0.6 seconds
243 [Tm] 0.6 seconds
244 cc1ccncc1 0.6 seconds
245 N=O 0.6 seconds
246 C[Si](C)(C)[Si] 0.6 seconds
247 O=POPOP=O 0.6 seconds
248 CCCOC(C)C 0.6 seconds
249 Ccc(c(C)c)O 0.6 seconds
250 C(CC=O)CC=O 0.6 seconds
251 c(cN=O)cN=O 0.6 seconds
252 c(c(cBr)Br)Br 0.6 seconds
253 C(=CCO)CO 0.6 seconds
254 CCCCCCN 0.5 seconds
255 [Dy] 0.5 seconds
256 c(cCl)OccCl 0.5 seconds
257 CC(C)(F)F 0.5 seconds
258 c(cI)cI 0.5 seconds
259 Cc(c)ccCO 0.5 seconds
260 CCCCC(=O)O 0.5 seconds
261 [Ir] 0.4 seconds
262 c(ccCl)cN 0.4 seconds
263 cc(c)cccs 0.4 seconds
264 cc(c)oc(c)c 0.4 seconds
265 cc(c)c(c)Cl 0.4 seconds
266 COCCC(O)O 0.4 seconds
267 [Pu] 0.4 seconds
268 CCCCOC=O 0.4 seconds
269 cc(c)c(c)S 0.3 seconds
270 c(cn)cncN 0.3 seconds
271 C=CCCC=C 0.3 seconds
272 [Rb] 0.3 seconds
273 CNCCCNC 0.3 seconds
274 C1CC2CCC1C2 0.3 seconds
275 C(=O)ccC(=O)O 0.3 seconds
276 CCCSCCC 0.3 seconds
277 C(C=O)C=O 0.3 seconds
278 CCcccCC 0.3 seconds
279 p 0.3 seconds
280 COc(c)cc=O 0.3 seconds
281 cc1cc(c)sc1 0.2 seconds
282 [Pr] 0.2 seconds
283 CNCO 0.2 seconds
284 CCCC(C)C 0.2 seconds
285 CC[Sn](CC)CC 0.2 seconds
286 C(=O)NC=O 0.2 seconds
287 CC(cc(c)c)O 0.2 seconds
288 c(ccSN)cN 0.2 seconds
289 Ccccc(c)c 0.2 seconds
290 C(C=O)CO 0.2 seconds
291 CCP(CC)CC 0.2 seconds
292 [Hf] 0.2 seconds
293 CN(C)CccO 0.1 seconds
294 O([Si])[Si]O[Si]O[Si] 0.1 seconds
295 CCCCCOC 0.1 seconds
296 CCcccOC 0.1 seconds
297 C1CCC1 0.1 seconds
298 CC(C(C)(C)C)O 0.1 seconds
299 cc1cncc1c 0.1 seconds
300 c1cccccc1 0.1 seconds
301 CCCNcc=O 0.1 seconds
302 CSccc(c)c 0.1 seconds
303 [Os] 0.1 seconds
304 CC(C[Se]C)N 0.1 seconds
305 CCC(CC)Cl 0.1 seconds
306 CCC=CCC 0.1 seconds
307 C(=O)C=O 0.1 seconds
308 CCccCC 0.1 seconds
309 c(cc=O)c=O 0.1 seconds
310 CC(c)c 0.1 seconds
311 CCCOC1CC1 0.1 seconds
312 O=Cl(=O)O 0.1 seconds
313 c(cN(=O)O)cO 0.1 seconds
314 C(CCl)O 0.1 seconds
315 C(C(CO)(O)O)O 0.1 seconds
316 CCC(CC)Br 0.1 seconds
317 cc(cBr)cBr 0.1 seconds
318 C(ccccO)O 0.1 seconds
319 CCCCCCO 0.1 seconds
320 CCCN(C)P 0.1 seconds
321 COCcc(c)c 0.0 seconds
322 C[Si](C)([Si])[Si] 0.0 seconds
323 Cc(c)cc(c)S 0.0 seconds
324 C=C(C)CC(C)c 0.0 seconds
325 CBr 0.0 seconds
326 C(CC(Cl)Cl)CCl 0.0 seconds
327 C(S(=O)=O)S(=O)=O 0.0 seconds
328 C1=CCCCC1 0.0 seconds
329 C1CNCCN1 0.0 seconds
330 C=Cc(c)c(c)c 0.0 seconds
331 CC(c)N(C)Cc 0.0 seconds
332 CCC(=O)OCC 0.0 seconds
333 CcncccF 0.0 seconds
334 O(P)POPOP 0.0 seconds
335 c(c(cF)F)N 0.0 seconds
336 cc(c)cccI 0.0 seconds
337 C(C=O)COC=O 0.0 seconds
338 CCC(CC)N 0.0 seconds
339 CCCC(=O)S 0.0 seconds
340 CC[Pb](CC)CC 0.0 seconds
341 COP(OC)OC 0.0 seconds
342 C[Si](C)(C)C[Si] 0.0 seconds
343 Ccc(C=O)cC 0.0 seconds
344 cc1ccocc1 0.0 seconds
345 ccscc(c)c 0.0 seconds
346 CC(C)(C)COC 0.0 seconds
347 cc(c)cccN 0.0 seconds
348 cc(c)cccO 0.0 seconds
349 C(CO)COC=O 0.0 seconds
350 COc(c)cOc 0.0 seconds
351 [as] 0.0 seconds
352 c(ccCl)ccCl 0.0 seconds
This says that 353 bits are all that's needed in order to put 202,632 unique fingerprints into bins with at most about 10 unique fingerprints in a given bin.

I think that's quite amazing given that most people use fingerprints which are at least twice and usually at least four times that length.

Also, it took 1.75 hours to generate this, and most of the time is spent doing len(set1 & set).

Is it useful?

I really wish I had a good query set to use for tuning - and another dataset to use for validation. I don't, but Greg Landrum made a synthetic one for testing the RDKit/PostgreSQL cartridge. Last December we worked on optimizing the fingerprint filters. I used an earlier version of this program to generate 54 query bits, which he fused with his hash-based fingerprint. Greg reported:

Tacking your most recent 54 bits, interpreted as SMILES, onto the end
of a shorter version of my new fingerprint gives:
[06:10:40] INFO: FINISHED 50001 (41150823 total, 1325699 searched, 1118055 found) in 80.30
[06:10:40] INFO:   screenout: 0.03, accuracy: 0.84

The original performance for the shorter fingerprint was:
[05:04:23] INFO: FINISHED 50001 (41150823 total, 1693315 searched, 1118055 found) in 90.90
[05:04:23] INFO:   screenout: 0.04, accuracy: 0.66

so those bits definitely help with the screenout.

Putting the bits at the beginning of the fingerprint gets a bit more speed:
[06:14:48] INFO: FINISHED 50001 (41150823 total, 1325699 searched, 1118055 found) in 71.09
[06:14:48] INFO:   screenout: 0.03, accuracy: 0.84
I think a 20% speedup is nothing to sneeze at.

I would still like an inverted index library

The greedy algorithm does a lot of set intersections, partitions sets into subsets, and removes old sets. I feel like this is not what databases are designed for, That's why I'm looking for an inverted index library which lets me manage boolean set operations myself. Then again I'm not a database person and I don't know if my belief is true.

I ran my greedy feature selection algorithm until it ran out of data, at only 353 bits. I want to have 1024 substructure bits. This means I need to feed my algorithm a lot more data. That's fine - my test set is only 1/200th of the data I have. However, my implementation running the test set takes up 2 GB of memory on my 12 GB desktop machine, and took 6300 seconds (1h45m) to run.

My concern is that a distributed memory implementation of my algorithm is more complex to write than a shared memory one. The Amazon high-memory instances have 68 GB of RAM. I estimate I need 400 GB with Python sets, and under 40 GB with a more compact bitset implemented in C. This seems eminently doable, which is why I'm looking to see if someone has already done the work.

Based on my searching, and the lack of pointers by others, the answer is "no."


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2013 Andrew Dalke Scientific AB