Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2012/08/21/structure_query_collection

The Structure Query Collection

Finally! Long time readers might recall my long-term goal of improving small molecule search performance. There's my chemfp project for high-performance Tanimoto searches, and I have ideas on how to optimize substructure keys. But they suffer from a lack of a good performance benchmark.

Well, I still don't have a benchmark, since that requires targets. What I do have are several sets of structure queries, in a form that you can easily use to make your own benchmarks. I call it the Structure Query Collection, or SQC for short.

So you want to know if search is faster

There are dozens, if not hundreds of papers on ways to improve search performance. Let's suppose you have a faster way to do a substructure search. (By this I mean that that someone sketches a compound, which your search program interprets as a substructure query.) To show improvement, you need a baseline. The easiest is to select a random set of compounds from the database itself, to use a queries. A problem is that all of the queries will have at least one match - the query compound itself. Real-world queries will sometimes not find matches. Perhaps you can partition your data so that some of the compounds aren't in the and can be used as queries. What's a good ratio for that?

There's even a more fundamental problem: I think people who do a substructure search tend to use smaller structures than what's in the database. If the benchmark queries and targets come from the same set of compounds, then the queries are probably not characteristic of how the dataset will be used. This is only a suspicion: am I right?

The search for real-world data

The only way to find out is to look at real-world query sets. I know people at pharmas who of course have logs from internal searches, but it's considered confidential information since it may contain proprietary compound structure data. Even if the dataset was filtered so that only the public structures are released, there still may be enough for a competitor to get an idea of the types of structures the company is interested in. So that's a no.

PubChem by law can't reveal any information about the user-submited queries. I contacted ChemSpider, but at the time they didn't have enough queries to be worthwhile. In general, companies which provide search don't want to reveal their queries since they get paid by people who want that information to be secret. And so on.

(BTW, if you have a chemical query data set that you can make public, email me and let me know so that I can add it to the collection.)

BindingDB queries

But all is not lost. Last year I talked with Michael Gilson of BindingDB. They have no promises or expectation of privacy of user-submitted data, and he had no problem with getting me that information. More specifically, he asked Tiqing Liu to do the data extraction for me.

That meant some munging of the log files, and a bit of back and forth while I figured out what Tiqing did and he could figure out what information I wanted. The result was a set of SMILES, sketched in Marvin, for each of three classes of search: 3.3 million for exact search, 550,000 for similarity search, and 14,000 for substructure search.

(While I first asked for this data last year, this topic is very much a background task and I didn't have the time to follow up on it until a couple of weeks ago.)

SMARTS queries

That covers searches where the input is a sketched structure. What about SMARTS searches? Rather fewer people sketch or write SMARTS queries directly. Instead, SMARTS are more often part of the automated workflow for feature counts and structure alerts. While I would love to get my hand on real-world user-created SMARTS query sets, it's also useful to have workflow query sets too. Luckily, I know of a few.

A few weeks ago, Ehrlich and Rarey published Systematic benchmark of substructure search in molecular graphs - From Ullmann to VF2. This has a set of SMARTS patterns extracted from various sources. They used it to benchmark two different subgraph isomorphism implementations. The SMARTS patterns are available in the supplementary information, although it took me a while to figure out how to get the patterns.

Greg Landrum had developed a similar benchmark for RDKit, but it was somewhat hidden away in its source distribution and not many people knew about it.

For a classical touch, RDKit defines 162 of the 166 MACCS keys as SMARTS patterns. The MACCS keys are widely used as a cheminformatics fingerprint, so this could be used to benchmark fingerprint generation. SQC actually includes the tweaked version of these SMARTS which I put into chemfp. I changed a few of the SMARTS to be more cross-toolkit friendly.

The "Structure Query Collection"

I've collected all of these together into one spot, called the Structure Query Collection. Not a very creative name, I know, but it is descriptive. There are currently the three BindingDB structure sets and the three SMARTS query sets.

Each set contains the raw data or relevant subset of the data as I got it from the respective sources, as well as source details. The raw data needs processing - at the very least to put the SMILES or SMARTS into a easily used form - so I also included the details, including software, for how I extracted and cleaned up the data so it can be used in a benchmark.

First exploration: substructure query sizes

Earlier I conjectured that structures used as a substructure query will tend to be smaller than the molecules in the target set. Now I have the data to test that hypothesis.

I'll use my OpenSMILES-ragel project, which has a command-line program to count the number of atoms and bonds in a SMILES string. It does very little SMILES processing, and mostly assumes that the SMILES are correct beyond the basic syntax level.

Here I get the number of atoms in each substructure query:

% smiles_counts < sqc/BindingDB_substructure/BindingDB_substructure.dat  | head
atoms: 20 bonds: 22
atoms: 21 bonds: 23
atoms: 24 bonds: 27
atoms: 17 bonds: 19
atoms: 21 bonds: 24
atoms: 23 bonds: 26
atoms: 23 bonds: 26
atoms: 20 bonds: 23
atoms: 10 bonds: 11
atoms: 16 bonds: 18
Or to get the average and standard deviation:
% smiles_counts < sqc/BindingDB_substructure/BindingDB_substructure.dat | 
   awk '{sum+=$2; sumsq+=$2*$2} END {print sum/NR, "+/-", sqrt(sumsq/NR - (sum/NR)**2)}'
17.8607 +/- 11.1145

For representative targets, I'll use the same ZINC subset from Ehrlich and Rarey's supplemental file "supp3.zip", in datasets/zinc_lead-like_2011-02-12_first100k.smi. This contains some SMILES which the OpenSMILES grammar doesn't accept (it uses an odd SMILES dialect I've not seen before), but I can filter out the failures by removing lines where the number of atoms is "-1".

% SMILESFILE=~/databases/Ehrlich_and_Rarey_VF2_Ullmann/datasets/zinc_lead-like_2011-02-12_first100k.smi
% smiles_counts < $SMILESFILE | head
atoms: 20 bonds: 21
atoms: 17 bonds: 17
atoms: 19 bonds: 20
atoms: -1 bonds: -1
atoms: 22 bonds: 23
atoms: -1 bonds: -1
atoms: 17 bonds: 17
atoms: 14 bonds: 16
atoms: 21 bonds: 23
atoms: 22 bonds: 24
% smiles_counts < $SMILESFILE | fgrep -v -- -1 |
   awk '{sum+=$2; sumsq+=$2*$2} END {print sum/NR, "+/-", sqrt(sumsq/NR - (sum/NR)**2)}'
20.238 +/- 2.92336
Since I wasn't sure if there was a bias in the '-1' reports, I double-checked this by changing the failure cases to something which was no longer a valid SMILES but where smiles_counts would still report the same number of atoms. First, I check that the transformation works, and then I redo the count:
% perl -pe 's/\)(\d|%\d\d)+/)/g' $SMILESFILE | fgrep -- -1
% perl -pe 's/\)(\d|%\d\d)+/)/g' $SMILESFILE | smiles_counts |
   awk '{sum+=$2; sumsq+=$2*$2} END {print sum/NR, "+/-", sqrt(sumsq/NR - (sum/NR)**2)}'
20.4499 +/- 2.90956

So, BindingDB substructure queries average 18 atoms while the ZINC targets average about 20 atoms. This agrees with my hypothesis, but I would need to get a copy of BindingDB for better assurance.

Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me

Copyright © 2001-2013 Andrew Dalke Scientific AB