Dalke Scientific Software: More science. Less time. Products

Diary RSS | All of Andrew's writings | Diary archive

Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure. Code to taste. Best served at room temperature.

Varkony Reconsidered #

This is an expanded version of the talk "Varkony Reconsidered: Subgraph enumeration and the Multiple MCS problem", which I gave at the 6th Joint Sheffield Conference on Chemoinformatics on 23 July 2013. (In the programme I called it "Varkony Revisited", but I later decided that "Reconsidered" was a more appropriate term.)

The major goal of the presentation was to show how the 1979 paper of Varkony, which is part of the MCS literature, is better seen as an early paper in frequent subgraph mining. My own MCS algorithm, fmcs, is a variation of that approach, and is best seen as an intellectual descendent of an algorithm by Takahashi. Both the Varkony and Takahashi papers are relatively uncited in the MCS literature, so I spend some time explaining how they fit into the larger context.

The slide number references refer to the accompanying PDF slides. See the slides for full citation references and images.

Early MCS history

[slide 1]
Today I'm going to talk about the maximum common substructure problem.

[slide 2]
Here's an example of the MCS problem in its simplest form. You have two molecules and you want to find the largest connected substructure which is common to both structures. I won't be talking about disconnected substructures.

[slide 3]
Talking about the MCS problem here is a bit like bringing coal to Newcastle. It seems that half of the significant papers in the field came from this university. One of the earliest ones is this 1967 paper by Armitage et al..

[slide 4]
Armitage proposed using the MCS as a way to identify a reaction mapping, that is, to identify the conserved subgraph between the product and the reactant.

[slide 5]
If you read the paper you'll see that it also talks about using the MCS as a measure of chemical similarity, that is, a larger MCS should indicate that two molecules are more similar than smaller ones.

[slide 6]
This early effort wasn't successful. It only handled acyclic molecules, in part because of the very limited computer hardware they had. They report that their Ferranti Mercury computer had 480 words. I researched the computer, and it looks like it was built around 1960, and had 10-bit words. It was the first Ferranti with valves instead of tubes.

Multiple structure MCS and Varkony

[slide 7]
I'm actually not as much interested in the pairwise MCS as I am the multiple structure MCS, where there are two or more structures.

[slide 8]
The 1970s and 1980s saw a lot of work on the MCS problem, and the problem space was pretty well developed. You can see that in this 1979 paper by Varkony et al.

[slide 9]
In it they motivate the reason for working on the multiple structure MCS: "molecules displaying similar activities do so because they share some similar feature or features." Given a set of compound, the frequent presumption is that the MCS contributes to the activity.

[slide 10]
By the late 1970s we knew the main ways to compare the atoms and bonds in the molecular graphs. For example, you may want carbons to match only carbons, but have fluorines match any other halogen. In the general case, the chemist can define atom types so that atoms match only other atoms of the same type.

Bonds comparisons have similar flexibility. You might require bonds to match other bonds with the same type, or to any bond type. You can also look at ring membership, so that ring bonds only match ring bonds and not chain bonds.

The Varkony paper mentioned all of those types of topology comparisons, and they are still present in nearly all modern MCS tools, though of course most add additional refinements.

[slide 11]
This talk is titled "Varkony Reconsidered" because I think it sets the early groundwork for an approach that I and a few others have taken in working on the MCS problem. In order to set the stage, I think it's important to describe the Varkony algorithm.

It starts by assigning atom and bond labels to all of the structures and ordering the structures in a way that should improve the overall performance.

For each structure, get the list of subgraphs with just two atoms and one bond. These are called the nuclei, to "convey the sense of growth ... in subsequent steps." Canonicalize the nuclei, which is easy in this case because there are only two ways to represent a subgraph with two atoms.

Now that you have the SMILES string ... sorry, I'm a modern cheminformatics researcher so I think of these as canonical SMILES strings for the subgraphs even though SMILES wasn't published for almost a decade .. now that you have the canonical subgraphs, check to see which ones are present in every structure. If a bond isn't in all of the structure then mark it as "do not participate", so it won't be used for future nuclei growth.

For the nuclei which are present, go back to the original graph and grow the original subgraphs by one atom, in every possible way. This will give you all subgraphs of size 3 atoms, where each of those subgraphs contains a subgraph of size 2 atoms that is present in all of the structures.

Canonicalize those subgraphs of size 3 and only use those subgraphs which are common across all of the molecules. Keep repeating this process until no more common subgraphs can be found.

This finds all of the common subgraphs, so the MCS is (or are) the final common subgraphs in the process.

[slide 12]
It's pretty easy to see that method this will work, but that it might not be so fast. Still, using 1970s hardware they were able to fine the MCS of three marine sterols in between 45 seconds and a several minutes, depending on how it was parameterized.

As a side note, it's interesting to read these papers to see how the technology has changed over time, and to see how the implementation language affects the algorithm and timings. In this case they used two different programming languages. Canonicalization was done in Algol while the rest of the code was in the much slower INTERLISP. I interpret this as the authors saying that their algorithm is good, and asking us to not take the timings at face value because the implementation gets in the way.

The computer was a 1972-era PDP-10 hosted by the SUMEX facility at Stanford. The SUMEX (Stanford University Medical Experimental Computer) project promoted research in artificial intelligence to bio-medical problems, and made their hardware available to others over the network.

Some other MCS approaches

McGregor

[slide 13]
Just a few years later McGregor published a paper on how to apply backtracking to the pairwise MCS problem. Backtracking was not a new programming concept, but in my reading of the MCS literature the McGregor paper was the one which specifically promoted it as a technique for the MCS problem, and it influenced a lot of the subsequent work.

Basically, start by mapping one edge from the first structure to the second. Extend the edge in the first structure and look for a corresponding extension in the second. There can be multiple possibilities, so pick one, but keep track of the other possibilities. Keep repeating the process until it's not possible to grow any further.

This gives a maximal common substructure. "Maximal" here means that there's no larger common subgraph which includes maximal common substructure, but there might be another maximal substructure which is even larger. A maximum common substructure is the largest maximal substructure.

[Note: The "maximal" terminology dates to at least Cone et al. 1977.]

At this point, the backtracking algorithm kicks in. It backs up and tries an alternative possibility. It keeps going forward and backtracking until it's done.

Backtracking can take a lot of time because the algorithm will find all of the ways to map any subgraph from the first structure to the second.

[slide 14]
However, if the current best maximal common substructure is of, say, size 10 atoms, then you know that any larger MCS must have at least 11 atoms. If at any point in the backtracking you can prove that there's no way to match 11 atoms then there's no reason to continue searching that part of the search tree.

Pruning makes backtracking effective.

[Note: pruning was not first introduced in the McGregor paper. Cone et al. in 1977 mention pruning as part of DENDRAL, but most papers reference McGregor as the source for the general backtrack and pruning for the MCS search in cheminformatics.]

[slide 15]
I was able to find some information about how the McGregor algorithm was first implemented. It was written in ALGOL68 for the University of Sheffield's ICL computer, which was about 5 years old by the time McGregor developed the software. It had 24 bit words, and it looks like subgraphs were represented in the algorithm using a word as a bitset for atoms and bonds, which meant the implementation was limited to at most 24 atoms and 24 bonds.

It took about 9 seconds on average to find the pairwise MCS between "a simple file of 140 reactions."

In my literature research, I found a 1986 paper which still used the ICL 1906S computer. This shows that the hardware was in use at Sheffield for over 15 years. Humorously, you can see how it comments that Algol 1968 was "not a particularly efficient language" for the 1986 algorithm, reflecting similar language from the Varkony paper.

Clique detection

[slide 16]
There's another technique to find the MCS by reducing it to the maximum clique problem.

I'm not going to go into the details of that the clique approach. I find that it's easier for me to think about backtracking than the equivalent approach of finding cliques.

[Note: The usual citation for this is to Cone et al. in 1977, though Cone cites G. Levi's 1973 paper in Calcolo titled "A note on the derivation of maximal common subgraphs of two directed or undirected graphs." As that paper costs US$40 to download and is an Italian applied math journal that at least my local library doesn't have, I've only read the first two pages, which are available online for free. That's enough to see that it's primarily a math paper and not a cheminformatics paper, which is why people usually cite Cone. Interstingly, the word "clique" does not exist in that paper. Who first said "clique" in a cheminformatics paper?]

[slide 17]
Let's jump forward a couple of decades. In 2002, Raymond and Willett published a very informative review paper on the various MCS solutions. This is after Raymond had implemented the RASCAL MCS program, which uses the clique approach.

[slide 18]
In the paper they recommend using the clique approach, in large part because mathematicians and computer scientists have spent a lot of time figuring out how to make clique detection be fast.

Papers can be viewed as a very long, slow conversation about a topic. Cao et. al point out that certain types of non-transitive comparison are hard to do with a clique algorithm. "For example, allowing a bromine atom to be matched to a chlorine atom when they are attached to an aromatic ring, but not elsewhere, is much harder to do with association graphs." Cao also points out that there there have been improvements to the graph backtracking methods, such as the VF algorithm. In other words, just because Raymond and Willett might have been right in 2002, that doesn't mean that a clique solution is the best one for now.

[Note: libmcs from ChemAxon uses the clique approach plus heuristics to implement an approximate pairwise MCS. As the poster titled "Making the Most of Approximate Maximum Common Substructure Search", which was next to mine at the Sheffield conference, Péter Kovács discussed the effort it took to handle non-transitive comparisons.]

Multiple structure MCS through pairwise reduction

[slide 19]
The Raymond and Willett review paper focuses on pairwise MCS algorithm and not the multiple structure MCS. They point out that most methods can be extended to handle a collection of graphs.

This is true. Any pairwise method which finds maximal substructures can be made to support multiple structures through a reduction method: the maximal common substructures of the first pair are used as reference structures to compare against the third, and so on. This can be combined with pruning methods to make the search be practicable.

[Note: I developed this pairwise reduction technique for my own MCS algorithm in the late 1990s. Researching it now, it looks like Denis M. Bayada and A. Peter Johnson published this approach in "Detection of the Largest Patterns Common to a Set of Molecules, Using 2D Graphs and 3D Skeletons", which is a chapter in "Molecular Similarity and Reactivity: From Quantum Chemical to Phenomenological Approaches, 243-265." As a side-side note, it also looks like they didn't quite get the key differences between Varkony and Takahashi, but I haven't yet talked about that topic.]

[slide 20]
Pairwise reduction is an indirect solution. There's a straight-forward direct extension of the clique program for multiple structures but it takes memory which grows exponentially with the number of molecules. This problem was resolved in the MultiMCS paper of Hariharan et al.

My two MCS implementations

[slide 21]
My first attempt at the MCS problem was in the late 1990s. I worked for a company called Bioreason. We developed a method to combine the MCS with clustering to build a dendrogram "where each node contains a set of molecules and a chemical fragment common to the molecules." I wrote that MCS code using a pairwise backtracking algorithm combined with the the reduction technique to handle multiple structures.

[Note: Actually, this was the second MCS implementation at Bioreason. Brian Kelley write the first, as an approximate method based on a genetic algorithm. We used this method during testing, to make sure that the GA never found a larger common substructure than the exact method.]

The experience with that project lead to ClassPharmer, which in turn became the basis for MedChemStudio, available from Simulations Plus. Even though there's probably no remaining code of mine in the project, I'm still pleased that there's a direct lineage from code I wrote in the 1990s to that used by an tool which people are still using.

fmcs

[slide 22]
In mid-2011, Roche asked if I would implement an MCS method for RDKit. I no longer had access to my original MCS work, but that's okay because I had been working on a new idea based on subgraph enumeration. The result of this is 'fmcs', which is now distributed as part of RDKit.

It has a pretty standard set of MCS options. It defines a common substructure through a SMARTS pattern definition, which gives pretty good control over how atoms and bonds should be matched. Arbitrary atom class assignment is possible through a bit of a hack; you can overload the isotopes field as an arbitrary integer label to represent the atom class type. Sadly, there is no corresponding hack to define arbitrary bond classes, but that need hasn't come up in my discussion with scientists, nor mentioned in previous MCS papers.

[slide 23]
The most novel difference in fmcs that it can find the MCS that matches at least M of N molecules. If M=N then this is the normal multiple structure MCS. For example, the MCS between all of vanillin, zingerone, and capsaicin has 9 heavy atoms ...

[slide 24]
... while the MCS which is in at least two of those structure has 13 atoms.

[slide 25]
The fmcs algorithm is similar to that of Varkony. I pick a reference molecule from the set and generate its subgraphs. For each subgraph, I convert it into a SMARTS pattern, then do a subgraph isomorphism check to see if matches the other molecules.

The largest SMARTS pattern that matches is an MCS for all N structures. Of course, if M<N then I need to enumerate subgraphs from a few more reference structures, while on the other hand I don't have to check all of the other molecules to see if a given subgraph SMARTS matches. The major difference at this level is that Varkony uses canonicalization to determine if two subgraphs match, while I use a subgraph isomorphism search. I chose this because I could build on top of the existing VF2-based SMARTS matcher in RDKit to do the backtracking search.

Just like with the Varkony method, you can intuitively see that it will work, but it might be slow. The methods to make it fast are nearly identical to how to speed up any other MCS program.

[slide 26]
This is a new algorithm, and it's my algorithm, so I'll go into the algorithm in a bit more detail.

(1) First, I create atom and bond types. These are based on the atom and bond SMARTS. The atom type is just the atom SMARTS while the bond type is based on canonical SMARTS representation for the atom type of one end, the bond type, and the other atom type; of the two possibilities, use the alphabetically smallest one.

[slide 27]
Once I have the canonical bond types, I can remove (or mark as "do not participate") the bonds which don't occur at least M times, because there's no way they can be part of the MCS. Bond removal may cause fragments.

Order the molecules by the size of the largest fragment, such that the smallest largest fragment is first. This is a common technique, which should help improve the overall performance.

[slide 28]
(2) There are generally two ways to search for an MCS: either start by looking for one bond match then two bonds, three, and so on, or to start from the smallest largest fragment and start removing bonds.

I decided to be optimistic and try the smallest largest fragment first before starting by building up the subgraphs. I expect that the input to the MCS algorithm will be similar structures or structures which are already known to have a common core. In that case, the pruning from step 1 might have been enough so that the smallest largest fragment is the actual MCS.

So, I take the largest fragment, turn it into a SMARTS pattern, and do the subgraph match to see if there are at least M-1 matches. If so, the algorithm is done.

subgraph enumeration

If not, then it's time to start enumerating the substructures.

[slide 29]
The literature shows that subgraph enumeration is not obvious. The Varkony algorithm suffered from generating duplicate structures, I haven't come across a description of the enumeration method when looking through the cheminformatics literature, and it's not typically covered in a graph theory course.

[slide 30]
The basic concept is quite simple. Start by generating all subgraphs which use the first bond, and then generate all subgraphs which don't use the first bond. Repeat the process for the second edge, then the third, and so on. This gives you all 2**n ways to make a subgraph, including the empty graph, and does so uniquely.

This simple method generates both connected and disconnected subgraphs but fmcs only uses connected subgraphs so I use a slightly different algorithm.

[slide 31]
I create an initial set of seeds and put them into a priority queue. The terminology and idea of a 'seed' is very similar to the 'nucleus' in Varkony. A seed contains three things: the set of bonds which are in the seed, the set of bonds which may not be used for future growth, and the set of bonds connected to the seed which are available for future growth.

If an initial seed contains bond i then bonds 1 to i are specifically excluded from future growth. This prevents duplication.

[slide 32]
For each iteration, I pick a seed and grow it. If there are n possible outgoing edges then there are 2n-1 possible ways to grow the seed into a new seed. For example, if there are two outgoing edges then I could choose both edges, choose the first, or choose the second. These new seeds go back into the priority queue. The n outgoing edges from the parent graph are marked as unavailable for future growth by the child, and new outgoing edges are found.

[slide 33]
I sort the priority queue of seeds by size. As an unusual variation compared to most MCS algorithms, the fmcs algorithm can grow by multiple bonds per step, then backtrack if that fails. I wrote it this way because of the heritage of how I came up with the algorithm. The fmcs approach is optimistic, but I've not tested to see if it's better than a more conservative single bond expansion.

[slide 34]
I also implement some of the usual pruning methods in MCS programs. If I can show that fragment growth can never beat the current largest subgraph then I don't bother to enumerate that part of the graph. There's also some optimization that I haven't yet added, like keeping track of the number of maximum number of bond types common to all structures and using that limit as a more fine-grained upper bound.

SMARTS subgraph isomorphism instead of canonicalization

[slide 35]
Then comes the key difference from Varkony. I convert the subgraph into a SMARTS pattern and check that it exists in at least M-1 of the remaining structures.

If it doesn't, then discard this seed and don't try to grow it, because no larger seed can be part of the MCS.

However, if it does, then add its children to the priority queue. If the seed is larger than the current best common substructure size, then it's an MCS candidate. fmcs supports an option that was in MultiMCS, which is that the if a ring bond from a structure is in the MCS then the corresponding bond in the MCS must also be in a ring. This option is called "complete rings only." It's implemented with a special post-processing step.

If the candidate passes these requirements, then it's the new best MCS.

[slide 36]
There's an optimization I can do as a consequence of turning the subgraph query into a SMARTS. I can easily cache the subgraph isomorphism search results so if I see something like "[#6]-[#6]-[#6]" again, I can reuse the cached result rather than do the isomorphism tests again.

[slide 37]
If you think about this a bit you might realize that there are multiple ways to represent the same search using SMARTS. Perhaps I can save time by canonalizing the SMARTS strings, in order to reduce the number of duplicate subgraph isomorphism searches.

[Note: The general category of "canonical SMARTS" is NP-complete, as Roger Sayle pointed out in a 1997 Daylight EuroMUG meeting. However, the fmcs atom expressions are simple, so canonicalizing them is trivial.]

Greg Landrum added a feature to RDKit so I could test this out. I found out that canonicalization in general took too much time. Remember, canonicalization is only needed if there is backtracking where different subgraphs are isomorphic and represented differently, and importantly, only when the canonicalization time is faster than the isomophism search time. The combination of these factors was rare in my test benchmarks. For example, in one experiment I needed to find the MCS over over 1,000 structures before canonicalization was faster. I don't expect that to be a common case.

[slide 38]
Instead, I developed what I've been calling semi-canonical SMARTS. For each atom in the subgraph, I assign it an initial atom invariant based on the atom type and the lexicographically ordered list of its subgraph bond types. This is a circular fingerprint of size 2, and corresponds to a single iteration of the Weininger et al. CANGEN algorithm. These invariants are fast to generate, and unique enough to gave usually canonical subgraphs.

[slide 39]
Finally, in my code I check for a timeout. This is, after all, an NP-hard problem, and there are times would I would rather give it a 30 second timeout for an approximate solution than spend hours waiting for a result.

MCS benchmark timings

[slide 40]
Now for the scary question: How well does this new algorithm work?

To examine that, I had to create a benchmarks which reflect how I think the MCS program might be used. (Actually, I would have used existing benchmarks, but the only ones I could find were too small to be informative.)

I started with a cleaned up version of ChEMBL 13 and generated 7 different benchmarks. The first selects two different compounds at random and finds the MCS between them. The others are based on fingerprint similarity searches. I found the k=2, k=10, and k=100 nearest neighbors for fingerprints picked at random, and I found the k=100 nearest neighbors for thresholds of 0.95, 0.90, and 0.80.

The benchmarks and the code used to generate them are under the BitBucket account for fmcs.

[slide 41]
The benchmarks include cases which cause all of the MCS programs to time out after 30 seconds. When timeout occurs, fmcs will report the largest substructure found by that point, which is usually pretty good, and it means tha fmcs can also be used as an approximate MCS program with a fixed upper bound in CPU time. Not all MCS programs can report a partial size.

[slide 42]
I was able to time fmcs against three other methods: extractCommonScaffold from the Indigo C++ cheminformatics toolkit, the MultiMCS maximum clique Java program I mentioned earlier, and MoSS, a frequent subgraph mining Java program that I'll mention in a bit.

It's difficult to compare the results directly. Unfortunately, most cheminformatics toolkits like to reperceive structures on input so they follow consistent aromaticity rules. This is what the Daylight toolkit did, and many follow its lead. While everyone agrees that "C1=CC=CC=C1" should be perceived as "c1ccccc1", they all differ with more complex fused multi-ring systems. Since the MCS search methods depend on bond types, different perceptions of an input structure can sometimes lead to different MCS results.

I inspected many of the differences until I started to go cross-eyed. As far as I can tell, the different programs only differ due to a difference in aromaticity perception.

[slide 43]
It's important to note that this benchmark does not give a perfect head-to-head comparison. Indigo's extractCommonScaffold and MoSS (when configured to find substructures with 100% support) return all maximal subgraphs, which mean they are doing more work than fmcs, which only finds one maximum subgraph. Also, these 4 benchmarks were done on three different computers, so there can be a factor of 2 or more variation just because of the different hardware.

Still, it's enough to point out that fmcs isn't all that bad. I've circled the fastest times in red and the second fastest in yellow. It looks like MoSS is overall the fastest algorithm, but not always.

[slide 44]
fmcs definitely isn't the fastest method, but I'm okay with that. For one, most of fmcs is written in Python. A profile shows that less than 5% of the time is spent in C++. Presumably, if more of fmcs were in C++ then there might be a 5-10x speedup... which sounds suspiciously like what the authors of the Varkony paper said about their implementation!

Also, there are some additional pruning heuristics that I haven't yet added.

MCS threshold profile

[slide 45]
I mentioned that fmcs finds the MCS of M of N structures, but I didn't give any real justification for why that might be useful. I wanted to see if I could use the MCS to find misclassifications in a cluster and give some sort of automatic quality score. It's pretty common to use similarity or some other means to group structures together, then try to find the MCS of the group. Unfortunately, the MCS is a rather fragile definition. A single misclassified structure can cause a very big difference in the result. I wondered if an MCS definition which might leave one or two structures out would provide useful information.

[slide 46]
I used ChEBI as my test case. ChEBI is a database of small molecules placed into an ontology by human curators. Some of the ontology terms are structure related. While terms like "contains two rings", cannot be defined as a SMARTS pattern, others can in principle be defined by SMARTS.

There are occasional misclassifications in the ChEBI ontology. For example, one of the nine 1,4-benzodiazepinones shown here is incorrect. If you look at the depictions for a bit you might be able to figure out which one is wrong.

[slide 47]
Try to find the MCS of all 9 and you'll see that that the expected benzodiazepinone core structure isn't found.

[slide 48]
But bring it down a step and find the MCS which is in at least 8 structures and you'll see the expected benzodiazepinone core in 8 of the structures, and quickly see the one misclassified structure. In this case it's a 1,5-benzodiazepinone.

[slide 49]
What I then did was compute the MCS size (as the number of atoms) for values of M from 4 to 9. (I actually used a binary search here so I didn't have to test all possible values of M. The MCS for M=6 is the same size as the MCS for M=8, which meant there was no need to compute the size for M=7.)

The plot of M vs. MCS size gives me a threshold profile. An abrubt dip in the profile indicates that there's something I might need to look at.

I computed the profile with three different topology parameters. One profile requires that element types and bond types match, another only requires that element types match, and the last only looks at the untyped topology. This gives me different sorts of information which can help identify the reason for a mismatch.

[Note: the pure topology profile never proved useful.]

[slide 50]
To see if MCS threshold profiles can be useful, I developed a prototype tool to explore the ChEBI ontology. The ontology terms are on the left. Click on a term and it shows the corresponding profiles. The profile points are selectable. Clicking on a point shows the SMARTS pattern for that point, displayed on the right using Karen Schomburg's excellent SMARTSview. The structure members in the term are displayed on the bottom, sorted so the structures which don't contain the SMARTS are shown first, and strutures which match have the match highlighted.

[slide 51]
The benzodiazepinone mismatch showed up directly in the profile based on atom and bond types. Sometimes it's the difference between two different MCS profile types which gives a useful signal. There appears to be a registration problem with the pyrazoles because five of them are not perceived as being aromatic by RDKit. In this case the profile based on bond types has a dip while the one which ignores bond types stays constant. This is often a signal that there is an aromaticity problem.

[slide 52]
The monocarboxylic term is interesting because it shows some of the different types of issues that come up with this approach. The SMARTS [#6]-[#6](=[#8])-[#8] matches 1424 of 1435 structures. However, this SMARTS requires a carbon in the R1 position, and there are two false negatives with structures which have a nitrogen in that position. (An obvious next step, which I haven't done, would be to use the SMARTS pattern to find additional structures which should have been classified as monocarboxylic.)

There are four true negatives, where the structure simply isn't a monocarboxylic acid. Finally, five of the structures show an implementation bug. The original ChEBI structures are in an SD file, which has a mechanism to denote an R-group. My prototype tools uses SMILES strings, which doesn't have an R-group notation. The conversion from SD records to SMILES converted the R-group into a "*", which is treated as its own atom type.

[slide 53]
I examined many ChEBI terms. Nearly all of them either had no problem or were grouped by a structural term that could not be described by a SMARTS. With practice, I found that the MCS profile could help me identify possible problem areas, but I ended up being unable to come up with an automatic quality score.

Frequent subgraph mining

[slide 54]
Here's where the history gets interesting.

I presented the fmcs algorithm last fall at the German Conference on Cheminformatics (GCC). Two people asked me if the fmcs solution wasn't just the same as frequent subgraph mining.

[slide 55]
I had no idea - I hadn't heard of frequent subgraph mining. So I read some of the papers from that field. They nearly all come from data mining, and use a somewhat different and perhaps more difficult language than I'm used to, so it took me a while to figure them out.

[slide 56]
The best known frequent subgraph mining programs seem to be gSpan and MoSS. Reading them I get the strong sense that these are written by people who don't really know the underlying chemical motivation. For example, I interpret the original MoSS paper to say that they don't really understand why chemists are interested in closed molecular fragments, but they have a conjecture. Still, since they know that's what chemists want, there are some optimizations they can do do handle that.

[slide 57]
These papers introduce new relevant terms for me, which I should apply to the fmcs description. Where I say "in at least M structures", they say "has a support of M." A closed molecular fragment is a fragment with support M where there is no larger fragment which contains the fragment and which also has support M. In my terms, it's a maximal common substructure to at least M of the structures.

[slide 58]
Frequent subgraph mining is often used to find discriminative fragments, that is, fragments which are in the actives but are rarely shared by the inactives. This is actually a more sophisticated approach than most of the MCS paper, which tend to focus on actives and ignore information about inactives. I was a bit chagrined when I read about this because while obvious in retrospect, I had never really thought about that issue in the MCS context.

gSpan and MoSS

[slide 59]
The gSpan and MoSS programs look at all of the possible embeddings across all of the structures. This can be generously be seen as an extension of the McGregor method to N structures. They start with the empty set then work on all unique embeddings using one atom, then one bond, two bonds, and so on. At each point they keep track of the support for that embedding. If it's too low then they prune.

The fundamental difference between gSpan and MoSS is how the graph embedding tree is traversed; gSpan uses a breadth-first search (Apriori method) while MoSS uses a breadth-first search (Eclat method).

[slide 60]
The problem with either approach is that the number of embedded graphs can be huge. In a toy case there might be only a couple of ways to match a simple pattern like "O-S-C" ...

[slide 61]
... but in more complex cases, like with rings, it can lead to "prohibitively high search complexity."

The original MoSS paper partially handled the complexity through a local solution, which prevented some of the explosion where a benzene ring can cause 12 different embeddings.

[slide 62]
Subsequent worked solved it through clever use of subgraph canonicalization. MoSS keeps track of the canonical representation for each substructure. When there are two different embeddings for the same substructure it uses the one which is lexicographically smallest. It does this by introducing a postfix version of SMILES, which is a perfect match to the depth-first machinery. The method is quite aware that the addition of a new bond can sometimes reorder the canonical representation, so it carefully monitors for all of those possible cases.

I've been doing cheminformatics for quite a while, and have never come across this approach before. I think it's brilliant and insighful thinking.

[slide 63]
It's my hypothesis that there's another way to describe the MoSS algorithm. Instead of thinking of the canonicalization as a way to remove duplicates, an alternative is to generate all canonical subgraphs in lexicographic order, and check for valid support across all of the structures.

If so, then that's part of the same intellectual family as the Varkony paper, where MoSS overcame many of the problems that Varkony had.

[slide 64]
What's amazing nabout this is that MoSS is designed as a frequent subgraph mining program, and not an MCS program. It finds the MCS as an edge case by setting the support threshold to 100%, but it can do much more. Yet in the timings, its performance was at least comparable to dedicated MCS programs!

Equally amazing is that it doesn't seem that the frequent subgraph mining people realize that they are doing a superset of the MCS algorithm. That connection is not mentioned in any of the frequent subgraph mining papers that I read, and you'll see later that gSpan avoids using subgraph isomorphism because it's NP-hard, seemingly not realizing that gSpan is itself solving an NP-hard problem.

"ad hoc" MCS methods

[slide 65]
Let's go back to the Raymond and Willet review paper. They classified Varkony is as an "ad hoc" method. I used to think that was a bit odd, since fmcs can be seen as a variant of the Varkony approach, and fmcs's performance is pretty good.

[slide 66]
It wasn't until I learned more about frequent subgraph mining that I realized that Raymond and Willett are right. I think that Varkony et al. is miscategorized as an MCS paper, when it's actually a frequent subgraph mining program, which finds all subgraphs with 100% support. As such, it does a lot more work than it needs to in order to find the MCS, including a lack of pruning methods.

With that in mind, I reread the Varkony paper for the fourth or so time and saw the single line that validated my belief in my interpretation: "There are three options to display results of the search for structural similarities: (a) All. The program reports all common substructures of all sizes."

That's why I say that the Varkony paper should be seen as one of the earliest frequent subgraph mining papers.

The change to find all substructures with a given support is trivial, but as "frequent subgraph mining" didn't start until about 20 years later they likely didn't realize that it could be an interesting new field.

[slide 67]
Going again back to the review paper, Raymond and Willett also mention in passing that there was a modified version of the Varkony algorithm subsequently proposed by Takahashi in 1987. I read through that paper, and realized that it replaced canonical subgraph generation with subgraph ismorphism. This is the intellectual predecessor to fmcs!

[slide 68]
Unfortunately, it's hard to figure that out from the paper. The English is difficult to follow and the subgraph isomorphism algorithm they use comes from Sussenguth 1965, which I hadn't heard of before. I'm surprised that a 1987 paper wouldn't have used the 1976 Ullmann method.

Based on the text, the Sussenguth algorithm is apparently very slow. Takahashi says that it's "the most time-consuming phase of the compuing flow", and that canonicalization is "an indispensable prerequisite" for performance. However, in my own tests, subgraph isomorphism testing is fast and canonicalization is not required. I conclude this is due to RDKit's use of VF2 instead of the Sussenguth algorithm.

Thus, I think most people who read Takahashi believe that it, like Varkony, requires canonicalization, when it actually doesn't. The only reason I figured this out was that I implemented fmcs and showed that to be the case.

Putting those details aside, it was quite exciting to read that someone else had previous worked on this approach before, and that I had managed to extend it in new and interesting ways.

[Note: I've researched a bit more of the data mining literature, and found that a few of them, like the book "Mining Graph Data" edited by Diane J. Cook, Lawrence B. Holder with the chapter "Discovery of Frequent Substructures" by Xifeng Yan and Jiawei Han (the gSpan authors) reference Takahashi as a graph related method in chemiformatics, but didn't notice that Takahashi is actually a variation on the Varkony paper from 10 years previous.]

Why so little previous work like fmcs?

[slide 69]
The obvious followup question is, why haven't any other people tried the subgraph isomorphism approach before? I've shown that fmcs has decent performance, and it's not that hard to understand.

Searching the literature gives some possible clues. Going back to the Armitage paper from 1967, they say "substruture search is useless here, unless one searches for every substructure of one compound in the struture of the second." That, combined with pruning, is the fmcs algorithm. But in reading the paper, it looks to be a throw-away line meant to imply that it's a completely unfeasible approach. Which, to be fair, it was for the hardware of that era.

But more than that, we needed to develop a better understanding of backtracking and pruning, and improved subgraph searches with first Ullmann and then VF2 before the fmcs approach could be practicable. By the late 1990s though, these were pretty well understood.

I suspect it was a combination of other factors. The clique approach is quite elegant, and Raymond and Willett showed that it was faster than the usual McGregor backtracking approach. There aren't that many people working on the MCS problem in chemisty, and even fewer working on the multiple structure MCS problem, so progress is slow.

Then there's the shaky feeling that the Takahashi/fmcs approach is too complicated to work well. fmcs goes from subgraph to SMARTS and back to subgraph, and potentially wastes a lot of time re-finding the same substructure only to add a single bond to it. And it's true - I can conceive of faster implementations. But on the other hand, a lot of work is put into making SMARTS matchers go fast, and that's work I don't need to redo.

The original gSpan paper gives another clue. The computer scientists avoided using subgraph isomorphism tests because it's "an NP-complete problem, thus pruning false positives is costly." It's always a bit scary to invoke NP-complete problems.

While they are correct, they didn't notice two things. First, in practice isomorphism matching of molecules is very fast, and second, as I showed, when an algorithm like gSpan or MoSS is used to find all substrutures with 100% support, it ends up identifying the maximum common substructure ... which is itself NP-complete.

[Note: This comment in the gSpan is one reason why I don't think the frequent subgraph mining researchers quite realize its connection to the MCS problem. Another is that I asked Thorsten Meinl about it, and he doesn't know of anyone who has specifically pointed that out.]

Future work

[slide 70]
So, where do we go for the future?

There's any number of features to add to fmcs. For example, it's pretty straight-forward to add support for a "*" as an alternate symbol for, say, a ring atom. This would help identify cases where there's a single atom substitution. It's also possible to require a certain initial subgraph to be part of the result, add additional pruning heurstics, and recode the algorithm in C++. This is an engineering task which "just" requires time and funding.

That said, I came into this problem because of thinking about canonical subgraph enumeration, only to find that isomorphism searching was more useful for this topic. However, I think canonical subgraph enumeration is a fruitful research area for the future.

The main reason why is that canonical subgraphs can be precomputed, and processed as fast string comparisons.

[slide 71]
For example, when clustering N compounds, it's pretty easy to compute, say, the canoncial subgraphs with up to 7 atoms. There's only on average few thousand per structure. To compare structure i with j, first look at the intersection of their subgraphs. If the MCS is 6 atoms or less, then you have the answer. Otherwise, you know the MCS has to be at least of size 7.

This might speed up clustering by quite a bit.

[slide 72]
NextMove Software has done the most advanced work in this field, to my knowledge. They precompute the canonical subgraphs for all of the structures in the data set, with links between each parent and child. If a child has multiple parents then that child is a common subgraph to all of those parents.

[slide 73]
Finding the MCS between any two structures then reduces to a graph traversal of a (very large) lattice to find the highest common node.

[slide 74]
I've also been thinking about using canonical subgraphs as perfect substructure keys for substructure search. Again, generate all canonical subgraphs for the structures in the database, up to a given size, and use these as keys for the inverted index. When a structure comes in, enumerate its subgraphs, look up the corresponding inverted indices, and find the intersection. If a query substructure doesn't have an inverted index, then no superstructure for the query exists in the database.

What's pretty neat about this is an off-the-shelf database like Lucene should be able to handle this sort of data without a problem.

[slide 75]
On the other hand, most people don't want to change their database environment to include a new database engine, so the third idea I've had is to improve substructure screen key selection.

There are two main types of substructure screens: hash-based like the Daylight fingerprints and feature based like the MDL MACCS keys. The former tend to be more common because they handle arbitrary chemistry and are easy to tuned to improve selectivity. The latter are usually made by hand, and the only other well-known example I can think of is the CACTVS substructure keys used by PubChem.

[slide 76]
Substructure screens are a sort of decision tree, and it's best if each bit can provide a discrimination of 50%. If that were the case then a 128 bit fingerprint would be able to provide very good screening ability. However, hash bits are correlated, which is why need more bits, while feature key bits tend to have low density - only about 150 of the unique fragments in PubChem have a density above 10%, and many of these are highly correlated.

Unique canonical subgraphs can be used to characterize the universe of possible decisions. I've wondered if it were possible to analyze a set of representative queries and targets to improve screening selection. Importantly, a bit should be able to have multiple fragments associated with it, in order to improve its discrimination ability.

[Note: This is rather like optimizing the 20 questions problem, only all 20 questions must be selected beforehand. Quoting from Wikipedia: "Careful selection of questions can greatly improve the odds of the questioner winning the game. For example, a question such as "Does it involve technology for communications, entertainment or work?" can allow the questioner to cover a broad range of areas using a single question that can be answered with a simple "yes" or "no".]

Unfortunately, I have not been able to come up with a good algorithm for this idea.

[slide 77]
This research involved a lot of software, and interaction with many people. I thank all of them for their help and support. Also, this presentation required digging into the early MCS literature. I thank the Chalmers University of Technology library for having most of the literature still on the shelves in paper, which meant I could do this research without hitting expensive paywalls all the time.

Comments?

Let me know if you have any comments or questions.

Is TDD better than Apps Hungarian? #

Greg Wilson and Jorge Aranda's essay Two Solitudes describes the different viewpoints and additudes between software engineering researchers and practicioners. Here is one of the complaints:

At industry-oriented gatherings, it seems that a strong opinion, a loud voice, and a couple of pints of beer constitute "proof" of almost any claim. Take test-driven development, for example. If you ask its advocates for evidence, they'll tell you why it has to be true; if you press them, you'll be given anecdotes, and if you press harder, people will be either puzzled or hostile.

Most working programmers simply don't know that scientists have been doing empirical studies of TDD, and of software development in general, for almost forty years. It's as if family doctors didn't know that the medical research community existed, much less what they had discovered. Once again, the first step toward bridging this gulf seemed to be to get each group to tell the other what they knew.
Greg worked with Andy Oram to collect and edit the essays in Making Software: What Really Works, and Why We Believe It, as a way to present some of that research evidence to software practicioners. The Two Solitudes essay refers to that book, saying its "topics range from the impact of test-driven development on productivity (probably small, if there is one at all) to whether machine learning techniques can predict fault rates in software modules (yes, with the right training)."

Why is there a debate over TDD?

Did you catch the "probably small, if there is one at all" part? Why then is there a continued debate over the effectiveness of TDD when the conclusion from reseach is that there is at best only a small improvement in productivity? The Two Solitudes essay goes into some of the possible reasons. You should read it.

That got me to thinking about other well-accepted or well-rejected practices: version control software and Apps Hungarian notation. We as practicioners generally accept that version control software is a Good Thing and generally view Apps Hungarian as a negative. If we "know" that version control systems are useful, then can software engineering researchers find experimental evidence to support or reject the claim? If they have, then it might be an example of what a successful research experiment might look like. If it can't, then perhaps it says something about the weakness of the current experimental approaches compared to personal experience.

We can ask the same about other development practices. Can software testing show if Apps Hungarian improves productivity over other variable naming systems?

With those experiments in place we could start asking questions like "are the gains from introducing TDD to the process more like the gains in introducing Apps Hungarian, or more like the gains in introducing Mercurial?" Or more crudely, "is TDD more effective than Apps Hungarian?"

(For "TDD" put pair programming, unit tests, coverage analysis, version control systems, or most any other practice.)

So let's ask the software engineering researchers if they can provide evidence for those beliefs of the practicioners. Perhaps they've done it already?

Do version control systems make people more productive?

As far as I can see (which isn't very far), no one has done experimental research to tell if a version control system makes people more productive. I asked Greg, and he was a bit chagrined to reply that he didn't know of any published research either. Yes, he promotes that everyone use a version control system, despite not knowing of research evidence to back that belief.

Curious, isn't it?

My challenge to researchers is, can you carry out experiments to test the effectiveness of version control software for different types of projects?

This research, like TDD, is complicated by many real-world factors. There are alternatives to using version control software:

different development scenarios: different types of developers: and even different needs:

It sounds like a lot of work to test these possibilities when we already know the answer, but that's my point. If research methods can't address this question then how do we trust those methods on other topics, like TDD or pair-programming?

Perhaps the research will point out something interesting. Perhaps it will find that research projects which take less than three months and where there is only one developer are 10% more effective using automatic versioning file system over a manual version control system. If that were true, perhaps we should recommend that most scientists use a versioning file system instead of version control.

Or perhaps we should develop better connections between the various file system monitoring tools (like inotify) and the version control systems, to give an automatic/no-touch interface to the version control system.

For that matter, some people start continuous integration tests based on checkins. Why not do CI based on filesystem changes? Obviously some will commit less often, but perhaps productivity will improve.

I can continue to conjecture, but of course, without evidence it's little better than blather. The world is full of blather. That's why we need research.

Does Apps Hungarian make people more productive?

Hungarian Notation is a classic topic. Quoting from Michael Feathers on the c2.com wiki page:

Hungarian notation inspires some of the most bitter religious wars among programmers. Detractors claim that it is indecipherable and that no real standards exist for its usage. Adherents claim that it becomes more likeable with use, and the instant mnemonic recognition of type that one gains with familiarity aids program comprehension immensely.

Hungarian notation seems to be in decline. Microsoft's .Net "General Naming Conventions" document says outright "Do not use Hungarian notation." Nor is Hungarian the de facto style in any of the few dozen languages that I'm familiar with, though some libraries, especially those with a Microsoft heritage, do use it.

This despite the efforts of people like Joel Spolsky to promote a distinction between Apps Hungarian and System Hungarian. The former was is "extremely valuable" while the latter is from "the dark side.")

The Wikipedia page lists other "notable opinions" both for and against.

Some people say it's useful in untyped languages. Others say it Hungarian Notation just gets in the way. Is there any research to back up any of these opinions, or are variable naming schemes yet another unexplored field?

Does TDD make people more productive?

I'm picking on TDD, but this section can be applied to many other topics.

In any TDD debate you'll hear from people who find it useful, and those who don't. You'll hear about people who tried TDD but didn't find it useful, and you'll hear from others who want to help the first group do TDD correctly. And you'll hear from people who tried git and found it confusing ... though they often move to other version control systems instead.

Look at the old Hungarian discussions and you'll see the same patterns, including people that tried it, liked it, won't move away, and believe that others just don't understand Hungarian correctly.

It looks like the effectiveness of TDD has had many more studies then the effectiveness of either version control systems or Hungarian Notation, so let's use that as the reference. Is the difference in effectiveness between using PEP-8 style names and Apps Hungarian more or less than the difference between TDD and coverage-based testing? Is the difference in effectiveness between using version control instead of manual backups more or less than the difference between TDD and test-last?

And more importantly, are the answers to those questions opinions based on experience or are they backed by evidence? Can we use these experiments to calibrate our understanding of what software engineering researchers can identify?

Feedback

Do you know of any research which tests the effectiveness of version control systems over other approaches? What about of App Hungarian over other approaches? Perhaps you have a "strong opinion,", "loud voice," or have had a "couple of pints of beer" and want to leave a comment?

I sent the seed of this essay to Greg, and he posted it on "It will never work in theory." I think that's the best place to continue this dicussion, so I hope to see you there.

I'll end with a quote from Peter Medawar's "Advice to a Young Scientist":

I cannot give any scientist of any age better advice than this: the intensity of a conviction that a hypothesis is true has no bearing over whether it is true or not. The importance of the strength of our conviction is only to provide a proportionately strong incentive to find out of the hypothesis will stand up to critical evaluation.

Postscript

The above essay places the entire burden on software engineering researchers to do new experiments. I don't mean for things to be so one sided. This was also meant for software practicioners.

Software practioners claim a lot of things. Some, like version control systems make sense. Hungarian notation, not so much. I don't think TDD makes much sense, and the research seems to back me up.

If you are software practioner then take that final quote from Medawar to heart. You can make claims about your method X or approach Y, but in the back of your head, remember that I'll be asking you "how do you know that your approach is any better at improving productivity than switching to Apps Hungarian?" You won't be able to answer that question ... unless X or Y is "Apps Hungarian," of course.

At best you'll point to success cases, and then I'll point to the success of Microsoft Word and Excel in the 1990s, and to shops like Fog Creek.

Of course, your proper response will be "but you use version control, right, and where's your evidence for that.". Sigh. Researchers? Help!



Interview with Igor Filippov about OSRA #

A new "Molecular Coding" podcast is out! I interviewed Igor Filippov about OSRA. We recorded it at ICCS in summer of 2011 but I didn't do the editing and transcript until the last couple of days. Good news on the personal side. After about two years of temporary housing, we have finally moved into some place more permanant, in Trollhättan, Sweden. (Yes, these sentences are related.)

Tomorrow I leave for the 2012 German Conference on Chemoinformatics, in Goslar, Germany. I present the fmcs work with Janna Hastings on Monday at 12:05pm, and I have a poster about chemfp at poster P38.

I hope to see some of you there. Let me know if you listen to the podcast!



PyRSS2Gen for Python3 #

Nine years ago I wrote and released PyRSS2Gen. It's a Python library for creating RSS2 feeds. You give it a Python data structure as input, and out pops XML on the other side.

It does one thing, and it does it reasonably well. People still use the package. A cursory search finds that packages for it in Debian and Ubuntu, examples of using it for various projects, and 16 questions about it in StackOverflow.

Not bad for something which only took a few days to develop.

I don't get much mail about it. People really only send email when something's broken, and PyRSS2Gen is a small enough that the few bugs it did have were quickly fixed for the 1.0.0 release.

Times change. PyRSS2Gen-1.0.0 wasn't Python3 compatible. The end of the world was neigh. But then I got a patch from Graham Bell with the two things which needed to change. The world was saved!

For your enjoyment, update to PyRSS2Gen-1.1. The module is a single file which is compatible with Python2.3 and later; and where "later" includes Python3.



The Structure Query Collection #

Finally! Long time readers might recall my long-term goal of improving small molecule search performance. There's my chemfp project for high-performance Tanimoto searches, and I have ideas on how to optimize substructure keys. But they suffer from a lack of a good performance benchmark.

Well, I still don't have a benchmark, since that requires targets. What I do have are several sets of structure queries, in a form that you can easily use to make your own benchmarks. I call it the Structure Query Collection, or SQC for short.

So you want to know if search is faster

There are dozens, if not hundreds of papers on ways to improve search performance. Let's suppose you have a faster way to do a substructure search. (By this I mean that that someone sketches a compound, which your search program interprets as a substructure query.) To show improvement, you need a baseline. The easiest is to select a random set of compounds from the database itself, to use a queries. A problem is that all of the queries will have at least one match - the query compound itself. Real-world queries will sometimes not find matches. Perhaps you can partition your data so that some of the compounds aren't in the and can be used as queries. What's a good ratio for that?

There's even a more fundamental problem: I think people who do a substructure search tend to use smaller structures than what's in the database. If the benchmark queries and targets come from the same set of compounds, then the queries are probably not characteristic of how the dataset will be used. This is only a suspicion: am I right?

The search for real-world data

The only way to find out is to look at real-world query sets. I know people at pharmas who of course have logs from internal searches, but it's considered confidential information since it may contain proprietary compound structure data. Even if the dataset was filtered so that only the public structures are released, there still may be enough for a competitor to get an idea of the types of structures the company is interested in. So that's a no.

PubChem by law can't reveal any information about the user-submited queries. I contacted ChemSpider, but at the time they didn't have enough queries to be worthwhile. In general, companies which provide search don't want to reveal their queries since they get paid by people who want that information to be secret. And so on.

(BTW, if you have a chemical query data set that you can make public, email me and let me know so that I can add it to the collection.)

BindingDB queries

But all is not lost. Last year I talked with Michael Gilson of BindingDB. They have no promises or expectation of privacy of user-submitted data, and he had no problem with getting me that information. More specifically, he asked Tiqing Liu to do the data extraction for me.

That meant some munging of the log files, and a bit of back and forth while I figured out what Tiqing did and he could figure out what information I wanted. The result was a set of SMILES, sketched in Marvin, for each of three classes of search: 3.3 million for exact search, 550,000 for similarity search, and 14,000 for substructure search.

(While I first asked for this data last year, this topic is very much a background task and I didn't have the time to follow up on it until a couple of weeks ago.)

SMARTS queries

That covers searches where the input is a sketched structure. What about SMARTS searches? Rather fewer people sketch or write SMARTS queries directly. Instead, SMARTS are more often part of the automated workflow for feature counts and structure alerts. While I would love to get my hand on real-world user-created SMARTS query sets, it's also useful to have workflow query sets too. Luckily, I know of a few.

A few weeks ago, Ehrlich and Rarey published Systematic benchmark of substructure search in molecular graphs - From Ullmann to VF2. This has a set of SMARTS patterns extracted from various sources. They used it to benchmark two different subgraph isomorphism implementations. The SMARTS patterns are available in the supplementary information, although it took me a while to figure out how to get the patterns.

Greg Landrum had developed a similar benchmark for RDKit, but it was somewhat hidden away in its source distribution and not many people knew about it.

For a classical touch, RDKit defines 162 of the 166 MACCS keys as SMARTS patterns. The MACCS keys are widely used as a cheminformatics fingerprint, so this could be used to benchmark fingerprint generation. SQC actually includes the tweaked version of these SMARTS which I put into chemfp. I changed a few of the SMARTS to be more cross-toolkit friendly.

The "Structure Query Collection"

I've collected all of these together into one spot, called the Structure Query Collection. Not a very creative name, I know, but it is descriptive. There are currently the three BindingDB structure sets and the three SMARTS query sets.

Each set contains the raw data or relevant subset of the data as I got it from the respective sources, as well as source details. The raw data needs processing - at the very least to put the SMILES or SMARTS into a easily used form - so I also included the details, including software, for how I extracted and cleaned up the data so it can be used in a benchmark.

First exploration: substructure query sizes

Earlier I conjectured that structures used as a substructure query will tend to be smaller than the molecules in the target set. Now I have the data to test that hypothesis.

I'll use my OpenSMILES-ragel project, which has a command-line program to count the number of atoms and bonds in a SMILES string. It does very little SMILES processing, and mostly assumes that the SMILES are correct beyond the basic syntax level.

Here I get the number of atoms in each substructure query:

% smiles_counts < sqc/BindingDB_substructure/BindingDB_substructure.dat  | head
atoms: 20 bonds: 22
atoms: 21 bonds: 23
atoms: 24 bonds: 27
atoms: 17 bonds: 19
atoms: 21 bonds: 24
atoms: 23 bonds: 26
atoms: 23 bonds: 26
atoms: 20 bonds: 23
atoms: 10 bonds: 11
atoms: 16 bonds: 18
Or to get the average and standard deviation:
% smiles_counts < sqc/BindingDB_substructure/BindingDB_substructure.dat | 
   awk '{sum+=$2; sumsq+=$2*$2} END {print sum/NR, "+/-", sqrt(sumsq/NR - (sum/NR)**2)}'
17.8607 +/- 11.1145

For representative targets, I'll use the same ZINC subset from Ehrlich and Rarey's supplemental file "supp3.zip", in datasets/zinc_lead-like_2011-02-12_first100k.smi. This contains some SMILES which the OpenSMILES grammar doesn't accept (it uses an odd SMILES dialect I've not seen before), but I can filter out the failures by removing lines where the number of atoms is "-1".

% SMILESFILE=~/databases/Ehrlich_and_Rarey_VF2_Ullmann/datasets/zinc_lead-like_2011-02-12_first100k.smi
% smiles_counts < $SMILESFILE | head
atoms: 20 bonds: 21
atoms: 17 bonds: 17
atoms: 19 bonds: 20
atoms: -1 bonds: -1
atoms: 22 bonds: 23
atoms: -1 bonds: -1
atoms: 17 bonds: 17
atoms: 14 bonds: 16
atoms: 21 bonds: 23
atoms: 22 bonds: 24
% smiles_counts < $SMILESFILE | fgrep -v -- -1 |
   awk '{sum+=$2; sumsq+=$2*$2} END {print sum/NR, "+/-", sqrt(sumsq/NR - (sum/NR)**2)}'
20.238 +/- 2.92336
Since I wasn't sure if there was a bias in the '-1' reports, I double-checked this by changing the failure cases to something which was no longer a valid SMILES but where smiles_counts would still report the same number of atoms. First, I check that the transformation works, and then I redo the count:
% perl -pe 's/\)(\d|%\d\d)+/)/g' $SMILESFILE | fgrep -- -1
% perl -pe 's/\)(\d|%\d\d)+/)/g' $SMILESFILE | smiles_counts |
   awk '{sum+=$2; sumsq+=$2*$2} END {print sum/NR, "+/-", sqrt(sumsq/NR - (sum/NR)**2)}'
20.4499 +/- 2.90956

So, BindingDB substructure queries average 18 atoms while the ZINC targets average about 20 atoms. This agrees with my hypothesis, but I would need to get a copy of BindingDB for better assurance.



Optimizing substructure keys #

Advertisement: This essay describes a research project I've slowly been working on during the last 18 months. I believe I can do very fast substructure searches of large compound data sets. I would like to make it more real, but that requires funding. Contact me if you want to collaborate and can pay for it.

There are two related goals in my quest for an inverted index library, which is unfortunate because what I want for one is not necessarily the best for the other. One goal is to see if I can do PubChem-scale substructure searches with "interactive" performance (under 0.1 seconds). I think I can do that with a lot of precomputation based on substructure enumeration and optimized key development. Based on the feedback from various people, I think this is best handled within an existing database, and most probably PostgreSQL.

The first goal requires developing new database substructure search methods. My other goal is to see if I can optimize substructure keys within the framework used in most existing substructure search programs.

The existing approaches use either a explicit substructure-based fingerprint like the MACCS or CACTVS/PubChem keys, or an exhaustive hash-based fingerprint like the Daylight fingerprints. The substructure-based ones are hand-optimized and small (under 1000 bits), but don't do well with unusual chemistry. The exhaustive ones can handle diverse chemistries, but are larger (typically 1024-4096 bits).

I want to see if I can use my subgraph enumeration approach to do target- and query-based key optimization. That is, given a set of structures in the database and a characteristic set of queries, I would like to develop short, dense fingerprints which can optimize those queries.

Problem: I lack a query set

I'm not there yet. The biggest problem is that I lack any sort of representative sample of the types of queries people do. People at pharmas don't want to contribute their data sets because it's hard to remove all of the proprietary queries from the dataset. By law, PubChem is not allowed to release this data.

Do you have a good (1,000+) query set that I can use for a benchmark? If yes, email me!

So until I get a good query set, I'll have to assume that the targets themselves are representative queries. This of course is incorrect. By definition, a substructure search will likely be smaller than what's in the database. Plus, if someone uses a sketch program to make the query, then they are going to build it up an atom or ring at a time, so the first dozen queries or so will likely be small and have lots of matches.

Optimizing the query keys

At its heart, a substructure query is a subgraph isomorphism test. This is slow. A substructure filter is a quick test which tries to reduce the number of required isomorphism tests. For example, if an "N" (nitrogen) exists in the query but does not exist in the target then there's no need to do the full isomorpism test.

However, if "N" does not exist in the query then it might still match a target molecule which has an "N" in it. For example, the query "O" should still match the target "N#[N+][O-]" (nitrous oxide).

So if the query feature is "1" then the target feature must be a "1", otherwise the isomorphism test is required. This is a simple decision tree; all compounds with featurei take one branch, and all which don't take the other. Since I'm assuming that the queries are sampled from the targets, then what I want is to end up with the set of features which minimize the size of the number of nodes at the end of the decision tree.

Last year I tried a genetic algorithm to find this, but it didn't converge well, and the results weren't pleasing. For example, if I decide that I want to add 32 bits to a fingerprint, then I have to redo the search from scratch and I end up with a completely different set of fingerprints.

A greedy optimization algorithm

I came up with a greedy optimization instead. The base of the decision tree has all of the records in a single node. I find the "most divisive" feature, defined as the feature which best separates the tree into two branches, with about half of the records in one branch and half in the other. Each iteration finds the features which most evenly tries to split all of the nodes. I use the scoring function:

def score(current_leaves, feature):
    score = 0
    for node in current_leaves:
        num_on = len(node & feature)
        num_off = len(node) - num_on
        score += (num_on-num_off)**2
    return score
This does a lot of set intersection tests; about 30 million per iteration in my benchmark. Actually, since all I care about is the number of elements in the intersection, and not the set itself, I wish that Python had a "set.intersection_size()" method.

I use the scoring function to find the most divisive feature. This is the one which generates the smallest score. I use that feature to split all of the current leaves into "haves" and "have nots." As an optimization, I've decided that if there are fewer than 10 elements in a node then I don't care about further splitting that node. That's small enough that I can easily do a subgraph isomorphism test.

There are a few other optimizations. I ignore duplicate fingerprints, and I ignore features with fewer than 10 records in them. Set intersections are expensive, so I also do a short-circuit evaluation while computing the scoring function and abort the calculation if the current score is ever worse than the current best score. I also sort the nodes so the ones with the largest size go first. This increases chance of early short-circuiting.

I do some other optimizations, but you'll need to see the code for all the details. Yesterday I found that pypy 1.9's set intersection was about twice as fast as the one in cpython, so that's what I used to run the code.

Preliminary results

I used the same benchmark set (37 MB 7zip compressed) from yesterday's essay, which is a small subset of my full data set. From previous analysis, I know it has high local fingerprint diversity, so I figure it's a good enough start. Here's the subset of output containing the feature keys and the amount of time it took to find each one:

0 Cccc 5.5 seconds
1 CCO 3.6 seconds
2 CN 3.8 seconds
3 C=O 3.8 seconds
4 CCCC 5.1 seconds
5 cN 4.7 seconds
6 Cl 9.0 seconds
7 CC(C)C 10.2 seconds
8 n 8.2 seconds
9 C=C 10.5 seconds
10 cO 10.1 seconds
11 S 11.3 seconds
12 COC 11.7 seconds
13 cc(c)c 24.3 seconds
14 CC(C)O 24.1 seconds
15 O 24.5 seconds
16 CN(C)C 27.4 seconds
17 C1CCCCC1 41.4 seconds
18 CC 36.8 seconds
19 CCC(=O)O 44.1 seconds
20 F 55.2 seconds
21 CCCCCC 52.1 seconds
22 Br 48.2 seconds
23 CCC 52.3 seconds
24 C 51.1 seconds
25 C(CO)O 53.5 seconds
26 CccC 64.2 seconds
27 CCNCC 66.1 seconds
28 P 60.2 seconds
29 CCC(C)C=O 59.6 seconds
30 c1ccccc1 68.4 seconds
31 N 74.9 seconds
32 C1CCCC1 80.9 seconds
33 cCO 85.1 seconds
34 NO 96.5 seconds
35 C(CCO)CO 97.8 seconds
36 C(CN)N 96.4 seconds
37 cc(c)cc(c)c 99.7 seconds
38 CC(C)(C)C 99.2 seconds
39 I 102.6 seconds
40 CcccC 208.8 seconds
41 CCCCCCC 100.7 seconds
42 cc(c)nc 98.0 seconds
43 [Si] 93.7 seconds
44 CNC 95.3 seconds
45 CC(=O)C 94.5 seconds
46 c(cO)O 91.7 seconds
47 CCCCC 89.6 seconds
48 cCc 90.5 seconds
49 CC(C)(C)O 82.6 seconds
50 [Na] 82.6 seconds
51 o 84.3 seconds
52 [As] 85.6 seconds
53 NN 84.6 seconds
54 C(O)O 81.6 seconds
55 c(n)n 83.3 seconds
56 [Se] 80.4 seconds
57 CC=CC 80.1 seconds
58 COCCCOC 79.6 seconds
59 c(cCl)cCl 78.1 seconds
60 C=N 105.0 seconds
61 C#C 59.6 seconds
62 [Cr] 58.5 seconds
63 c(ccN)cN 58.1 seconds
64 CccccC 58.0 seconds
65 CC(C)C(C)C 54.9 seconds
66 ccc 52.7 seconds
67 B 53.3 seconds
68 CCCOCCC 55.1 seconds
69 CccO 51.0 seconds
70 [K] 48.7 seconds
71 CCN(CC)CC 48.6 seconds
72 CCCC=C(C)C 47.4 seconds
73 [Te] 47.8 seconds
74 cc=O 47.4 seconds
75 [Ti] 47.4 seconds
76 O=S(=O)O 47.7 seconds
77 CCC(CC)CO 46.7 seconds
78 [Fe] 46.8 seconds
79 c1ccncc1 46.7 seconds
80 CCOCCOC 52.5 seconds
81 c(ccN)ccN 35.2 seconds
82 [Al] 34.5 seconds
83 CC(C)N 34.2 seconds
84 [Sn] 32.4 seconds
85 c(cCl)Cl 32.4 seconds
86 CSC 31.9 seconds
87 C=CC=C 30.9 seconds
88 c(ccO)ccO 30.3 seconds
89 [Mo] 30.0 seconds
90 s 30.0 seconds
91 C(Cl)Cl 29.3 seconds
92 [Ca] 28.7 seconds
93 CccN 28.3 seconds
94 C(CO)CO 27.3 seconds
95 [W] 26.7 seconds
96 O(P)P 26.8 seconds
97 [Cu] 27.0 seconds
98 CCC(=O)CC 27.5 seconds
99 CCCNCCC 26.8 seconds
100 CO 28.0 seconds
101 [Pb] 19.6 seconds
102 CCOC(=O)C 19.6 seconds
103 [Zn] 18.8 seconds
104 N=N 18.6 seconds
105 [Mg] 18.2 seconds
106 CC(C)CC(C)C 18.0 seconds
107 [Hg] 17.7 seconds
108 C=CCCO 17.2 seconds
109 [V] 17.1 seconds
110 [H] 17.0 seconds
111 cc(c)c(c)c 17.1 seconds
112 c(ccO)cO 16.8 seconds
113 [Zr] 16.5 seconds
114 CC(=O)N 16.5 seconds
115 c(ccCl)cCl 15.9 seconds
116 cc(c)cN 15.8 seconds
117 [Mn] 15.6 seconds
118 COP(=O)OC 15.5 seconds
119 Ccnc 15.4 seconds
120 [Ni] 15.7 seconds
121 CCCCCC=O 12.4 seconds
122 C1CC1 12.1 seconds
123 cOc 11.9 seconds
124 [Co] 11.7 seconds
125 c1cccc1 11.7 seconds
126 [Sb] 11.5 seconds
127 [Pt] 11.5 seconds
128 C(N)N 11.4 seconds
129 O=S 11.0 seconds
130 [Nb] 10.8 seconds
131 C(CCO)CCO 10.8 seconds
132 [U] 10.6 seconds
133 c(cBr)cBr 10.7 seconds
134 CCc 10.6 seconds
135 [Ba] 10.2 seconds
136 CC(C)OC(C)C 10.0 seconds
137 C[Si](C)C 10.1 seconds
138 ccNcc 9.8 seconds
139 [Ru] 9.5 seconds
140 C=CCC=C 9.7 seconds
141 CN(C)(C)C 7.5 seconds
142 SS 7.3 seconds
143 C(F)(F)F 7.2 seconds
144 [Re] 7.0 seconds
145 ccnccn 6.9 seconds
146 CC(C)(C)CCO 6.8 seconds
147 [Cd] 6.7 seconds
148 CCC(C)CC 6.7 seconds
149 CCSCC 6.5 seconds
150 [Ge] 6.5 seconds
151 Cc(c)c(c)c 6.4 seconds
152 [Sr] 6.3 seconds
153 C(COCCO)O 6.2 seconds
154 Ccc(C)cC 6.0 seconds
155 c(c(cCl)Cl)Cl 6.0 seconds
156 OO 6.0 seconds
157 ccnc=O 5.9 seconds
158 C(CCl)Cl 5.8 seconds
159 [Pd] 5.7 seconds
160 c(cO)N 5.8 seconds
161 [Ag] 4.7 seconds
162 C1CCCCCC1 4.7 seconds
163 [Ta] 4.6 seconds
164 [Ce] 4.6 seconds
165 C(CC=O)C=O 4.5 seconds
166 cccCl 4.4 seconds
167 CCC(CC)O 4.3 seconds
168 [Gd] 4.2 seconds
169 [Li] 4.2 seconds
170 cc(c)c(c)O 4.2 seconds
171 C=CCO 4.1 seconds
172 CP 4.1 seconds
173 C#N 4.0 seconds
174 COcc(c)OC 3.9 seconds
175 [Bi] 3.8 seconds
176 c(cN)cS 3.8 seconds
177 [Y] 3.8 seconds
178 CC(C)(C)C=O 3.7 seconds
179 c(ccBr)cBr 3.7 seconds
180 [Yb] 3.8 seconds
181 [In] 3.1 seconds
182 CC(C)CO 3.1 seconds
183 Cc(c)cccO 3.0 seconds
184 COCC(O)O 3.0 seconds
185 [La] 2.9 seconds
186 c(cN)N 2.9 seconds
187 [Eu] 2.9 seconds
188 CCC(CC)CC 2.8 seconds
189 [Ga] 2.8 seconds
190 cCCc 2.8 seconds
191 COC(CO)CO 2.7 seconds
192 [Th] 2.7 seconds
193 C=C(C)C 2.7 seconds
194 c(cF)cF 2.6 seconds
195 C[Si]C 2.6 seconds
196 CncccnC 2.6 seconds
197 cc(c)nc(c)c 2.6 seconds
198 C(CNCCN)N 2.5 seconds
199 [Tl] 2.5 seconds
200 C=C(C)C(C)(C)C 2.5 seconds
201 [Ho] 2.2 seconds
202 [Au] 2.2 seconds
203 CC(C)S 2.2 seconds
204 [Nd] 2.1 seconds
205 cc(cCl)c(c)Cl 2.1 seconds
206 CCC(CC)OC 2.1 seconds
207 c(ccS)ccS 2.1 seconds
208 C(=O)OC=O 2.0 seconds
209 [Be] 2.0 seconds
210 CC(C)c 2.0 seconds
211 [Tb] 2.0 seconds
212 COc 1.9 seconds
213 CCCCO 1.9 seconds
214 c(cBr)Br 1.9 seconds
215 CcccccC 1.8 seconds
216 C(COP)COP 1.8 seconds
217 C(CBr)Br 1.8 seconds
218 [Rh] 1.7 seconds
219 CCCCNC=O 1.7 seconds
220 C(CO)N 1.7 seconds
221 O[Si]O[Si]O 1.5 seconds
222 COcccOC 1.5 seconds
223 CC(C)CC=O 1.4 seconds
224 Cn(c)c(c)c 1.4 seconds
225 [Sm] 1.4 seconds
226 CCC(CO)CO 1.4 seconds
227 C(Cl)(Cl)Cl 1.4 seconds
228 CS 1.3 seconds
229 [Cs] 1.3 seconds
230 cc(c)c(cCl)Cl 1.3 seconds
231 CC1CCC(C)C1 1.3 seconds
232 cc(c)c(c)F 1.2 seconds
233 C(S)S 1.2 seconds
234 cnn 1.2 seconds
235 O=P(O)O 1.1 seconds
236 CCOCOCC 1.1 seconds
237 O=[As]O 1.1 seconds
238 CccccN 1.1 seconds
239 [Er] 1.0 seconds
240 C1CCNCC1 1.0 seconds
241 C(=O)CO 0.7 seconds
242 CC(=C(C)C)C 0.6 seconds
243 [Tm] 0.6 seconds
244 cc1ccncc1 0.6 seconds
245 N=O 0.6 seconds
246 C[Si](C)(C)[Si] 0.6 seconds
247 O=POPOP=O 0.6 seconds
248 CCCOC(C)C 0.6 seconds
249 Ccc(c(C)c)O 0.6 seconds
250 C(CC=O)CC=O 0.6 seconds
251 c(cN=O)cN=O 0.6 seconds
252 c(c(cBr)Br)Br 0.6 seconds
253 C(=CCO)CO 0.6 seconds
254 CCCCCCN 0.5 seconds
255 [Dy] 0.5 seconds
256 c(cCl)OccCl 0.5 seconds
257 CC(C)(F)F 0.5 seconds
258 c(cI)cI 0.5 seconds
259 Cc(c)ccCO 0.5 seconds
260 CCCCC(=O)O 0.5 seconds
261 [Ir] 0.4 seconds
262 c(ccCl)cN 0.4 seconds
263 cc(c)cccs 0.4 seconds
264 cc(c)oc(c)c 0.4 seconds
265 cc(c)c(c)Cl 0.4 seconds
266 COCCC(O)O 0.4 seconds
267 [Pu] 0.4 seconds
268 CCCCOC=O 0.4 seconds
269 cc(c)c(c)S 0.3 seconds
270 c(cn)cncN 0.3 seconds
271 C=CCCC=C 0.3 seconds
272 [Rb] 0.3 seconds
273 CNCCCNC 0.3 seconds
274 C1CC2CCC1C2 0.3 seconds
275 C(=O)ccC(=O)O 0.3 seconds
276 CCCSCCC 0.3 seconds
277 C(C=O)C=O 0.3 seconds
278 CCcccCC 0.3 seconds
279 p 0.3 seconds
280 COc(c)cc=O 0.3 seconds
281 cc1cc(c)sc1 0.2 seconds
282 [Pr] 0.2 seconds
283 CNCO 0.2 seconds
284 CCCC(C)C 0.2 seconds
285 CC[Sn](CC)CC 0.2 seconds
286 C(=O)NC=O 0.2 seconds
287 CC(cc(c)c)O 0.2 seconds
288 c(ccSN)cN 0.2 seconds
289 Ccccc(c)c 0.2 seconds
290 C(C=O)CO 0.2 seconds
291 CCP(CC)CC 0.2 seconds
292 [Hf] 0.2 seconds
293 CN(C)CccO 0.1 seconds
294 O([Si])[Si]O[Si]O[Si] 0.1 seconds
295 CCCCCOC 0.1 seconds
296 CCcccOC 0.1 seconds
297 C1CCC1 0.1 seconds
298 CC(C(C)(C)C)O 0.1 seconds
299 cc1cncc1c 0.1 seconds
300 c1cccccc1 0.1 seconds
301 CCCNcc=O 0.1 seconds
302 CSccc(c)c 0.1 seconds
303 [Os] 0.1 seconds
304 CC(C[Se]C)N 0.1 seconds
305 CCC(CC)Cl 0.1 seconds
306 CCC=CCC 0.1 seconds
307 C(=O)C=O 0.1 seconds
308 CCccCC 0.1 seconds
309 c(cc=O)c=O 0.1 seconds
310 CC(c)c 0.1 seconds
311 CCCOC1CC1 0.1 seconds
312 O=Cl(=O)O 0.1 seconds
313 c(cN(=O)O)cO 0.1 seconds
314 C(CCl)O 0.1 seconds
315 C(C(CO)(O)O)O 0.1 seconds
316 CCC(CC)Br 0.1 seconds
317 cc(cBr)cBr 0.1 seconds
318 C(ccccO)O 0.1 seconds
319 CCCCCCO 0.1 seconds
320 CCCN(C)P 0.1 seconds
321 COCcc(c)c 0.0 seconds
322 C[Si](C)([Si])[Si] 0.0 seconds
323 Cc(c)cc(c)S 0.0 seconds
324 C=C(C)CC(C)c 0.0 seconds
325 CBr 0.0 seconds
326 C(CC(Cl)Cl)CCl 0.0 seconds
327 C(S(=O)=O)S(=O)=O 0.0 seconds
328 C1=CCCCC1 0.0 seconds
329 C1CNCCN1 0.0 seconds
330 C=Cc(c)c(c)c 0.0 seconds
331 CC(c)N(C)Cc 0.0 seconds
332 CCC(=O)OCC 0.0 seconds
333 CcncccF 0.0 seconds
334 O(P)POPOP 0.0 seconds
335 c(c(cF)F)N 0.0 seconds
336 cc(c)cccI 0.0 seconds
337 C(C=O)COC=O 0.0 seconds
338 CCC(CC)N 0.0 seconds
339 CCCC(=O)S 0.0 seconds
340 CC[Pb](CC)CC 0.0 seconds
341 COP(OC)OC 0.0 seconds
342 C[Si](C)(C)C[Si] 0.0 seconds
343 Ccc(C=O)cC 0.0 seconds
344 cc1ccocc1 0.0 seconds
345 ccscc(c)c 0.0 seconds
346 CC(C)(C)COC 0.0 seconds
347 cc(c)cccN 0.0 seconds
348 cc(c)cccO 0.0 seconds
349 C(CO)COC=O 0.0 seconds
350 COc(c)cOc 0.0 seconds
351 [as] 0.0 seconds
352 c(ccCl)ccCl 0.0 seconds
This says that 353 bits are all that's needed in order to put 202,632 unique fingerprints into bins with at most about 10 unique fingerprints in a given bin.

I think that's quite amazing given that most people use fingerprints which are at least twice and usually at least four times that length.

Also, it took 1.75 hours to generate this, and most of the time is spent doing len(set1 & set).

Is it useful?

I really wish I had a good query set to use for tuning - and another dataset to use for validation. I don't, but Greg Landrum made a synthetic one for testing the RDKit/PostgreSQL cartridge. Last December we worked on optimizing the fingerprint filters. I used an earlier version of this program to generate 54 query bits, which he fused with his hash-based fingerprint. Greg reported:

Tacking your most recent 54 bits, interpreted as SMILES, onto the end
of a shorter version of my new fingerprint gives:
[06:10:40] INFO: FINISHED 50001 (41150823 total, 1325699 searched, 1118055 found) in 80.30
[06:10:40] INFO:   screenout: 0.03, accuracy: 0.84

The original performance for the shorter fingerprint was:
[05:04:23] INFO: FINISHED 50001 (41150823 total, 1693315 searched, 1118055 found) in 90.90
[05:04:23] INFO:   screenout: 0.04, accuracy: 0.66

so those bits definitely help with the screenout.

Putting the bits at the beginning of the fingerprint gets a bit more speed:
[06:14:48] INFO: FINISHED 50001 (41150823 total, 1325699 searched, 1118055 found) in 71.09
[06:14:48] INFO:   screenout: 0.03, accuracy: 0.84
I think a 20% speedup is nothing to sneeze at.

I would still like an inverted index library

The greedy algorithm does a lot of set intersections, partitions sets into subsets, and removes old sets. I feel like this is not what databases are designed for, That's why I'm looking for an inverted index library which lets me manage boolean set operations myself. Then again I'm not a database person and I don't know if my belief is true.

I ran my greedy feature selection algorithm until it ran out of data, at only 353 bits. I want to have 1024 substructure bits. This means I need to feed my algorithm a lot more data. That's fine - my test set is only 1/200th of the data I have. However, my implementation running the test set takes up 2 GB of memory on my 12 GB desktop machine, and took 6300 seconds (1h45m) to run.

My concern is that a distributed memory implementation of my algorithm is more complex to write than a shared memory one. The Amazon high-memory instances have 68 GB of RAM. I estimate I need 400 GB with Python sets, and under 40 GB with a more compact bitset implemented in C. This seems eminently doable, which is why I'm looking to see if someone has already done the work.

Based on my searching, and the lack of pointers by others, the answer is "no."



In search of an inverted index library #

I'm looking for a Python library for managing inverted indices. I've been using Python sets, which are nice and fast, but they take up too much memory. Lists use 1/10th the space but are a lot slower and still not compact enough. The most likely solution so far is Xapian. Are there other options I should consider? If so, email me or comment on the Reddit thread I started on this topic.

Let explain what I want and what I've done so far.

Sparse feature fingerprints

I'm working with chemical fingerprints. That's not unusual for me. What's new is that these fingerprints are not fixed length but are of indefinite length. For each molecule I find all fragments with no more than 7 atoms, and put the fragment into a canonical representation, here a canonical fragment SMILES. Each unique representation gets its own unique integer identifier. The end result is a fingerprint which looks like:

0:4,1:3,5:2,18:2,28:2,124,47298,53119 254846
The original SMILES for this is CCO.CCO.O=[V].Cl. I picked this structure because it has a very small fingerprint.

This fingerprint line says that PubChem record 254846 has 8 substructures ("features"). The features are stored along with a count of the number of times it's found. Here, features '5', '18', and '28' are found in two difference places in the molecule, while features '124', '47298', and '53119' are found only once. Also, the feature field and the identifier field are separated by a tab.

I kept track of the original canonical description. Here's the breakdown:

        canonical
 bit#  description
 ----  -----------
    0    C
    1    O
    5    CC
   18    CO
   28    CCO
  124    Cl
47298    O=[V]
53119    [V]
With a bit of back and forth you can verify that it's correct.

Overall I have about 25 million records and 2 million canonical values, using a subset of PubChem's 35 million compounds. The average fingerprint contains about 300-400 features, though value ranges from 1 to about 2,000 features. These are "sparse" fingerprints because because the number of features per record divided by the number of features total is very small (350/2,000,000) - small enough that it's more efficient to store the list of feature bits (0, 1, 5, etc.) than it is to store the bit pattern of 0s and 1s.

The features are very long-tail. Carbon ("C") exists in nearly every record, in both aromatic and non-aromatic forms. "O", "cc", "ccc", and "CC" are also very popular. But about 1/2 of the features only exist in one record. Here's more on the fragment distribution pattern.

Why do I want an inverted index?

Feature fingerprints are old news in cheminformatics. One application is to improve substructure search performance. Suppose you are searching for a phenol (structure "c1ccccc1O"). You could do a substructure isomorphism test with every single record, but that's slow. Instead, there are some obvious improvements.

You know that if a record matches then it must have at least 7 atoms and 7 bonds, at least 6 carbons, and at least 1 oxygen. It's easy to precompute this information, so the set labeled "at least 6 carbons" contains the set of all records with at least 6 carbons, and so on. When a query comes in, analyze the structure based on those predetermined features. This ends up with a list of sets, one for each feature. Find the intersection of those sets, and you've reduced your search space, potentially by a lot.

This is exactly when one should use an inverted index.

(There are other optimizations. If the query exists already as one of the precomputed sets then there's no need to any additional substructure search.)

Feature "folding"

Most cheminformatics systems "fold" their sparse bits into a fixed bit length, like 1024 bits. For example, if a feature is assigned the canonical id of "458793484" then some programs will apply modulo 1024 to that to get the folded bit id "524". This saves a lot of space, at the cost of extra subgraph isomorphism search time.

I've been wondering, do I really need to fold the bits? Why can't I keep all of the information around? My thoughts are informed by Managing Gigabytes, which points out that there are efficient ways to encode the inverted index. Can I apply those techniques to this problem?

Working code

The first step is to establish that there is a problem. I'll use a subset of 10 files, which represents about 0.5% of my actual data set, and write some code to handle it. Here's the final code and here's the code with the benchmark data set (37 MB, compressed with 7zip).

The simplest inverted index implementation uses Python sets.

from collections import defaultdict

class InvertedIndex(object):
    def __init__(self):
        # Map from a feature string like "ccO" to an integer feature id
        self.feature_to_id = {}
        # Map from an integer feature id to a set of record ids
        self.inverted_indices = defaultdict(set)

    def get_feature_id(self, feature):
        try:
            return self.feature_to_id[feature]
        except KeyError:
            n = len(self.feature_to_id)
            self.feature_to_id[feature] = n
            return n

    def add_record(self, id, feature_ids):
        for feature_id in feature_ids:
            self.inverted_indices[feature_id].add(id)

    def search(self, features):
        # These are *features*, not *feature_ids*.
        # If the feature wasn't seen before then there are no entries
        # and this will raise a KeyError, meaning nothing found.
        try:
            terms = [self.inverted_indices[self.feature_to_id[feature]] for feature in features]
        except KeyError:
            return set()
        terms.sort(key=len)
        return set.intersection(*terms)
Very simple, and it works pretty well. There's a bit of API trickiness to get the feature id for a given feature. This simplifies the loader and handles integer interning. (Interning is outside of the discussion of this essay.)

Second is the code parse the data files and load the index. The file format looks like:

#FPC1
#bit-0=13120 C
#bit-1=12986 c
#bit-2=12978 cc
 ...
#bit-46924=1 COCNCP=O
0:6,1:21,2:21,3:22,4:23,5:2,6,...,36858,36869,37247,37429:4,37431,39942   26350001
0:6,1:21,2:21,3:22,4:23,5:2,6,...,36858,36869,37247,37429:4,37431,39942   26350002
 ...
0:5,1:12,2:12,3:12,4:12,5:3,6:3,...,25998,32031,32086:2,33182,33998,38357    26374982
There's a version line, some header lines starting "#bit-", and the fingerprint lines. The fingerprint line uses the sparse fingerprint I mentioned earlier, followed by a tab, followed by the record identifier. (Here it's a PubChem id, which is always an integer.)

Note: the bit definitions are not the same across different files. The header lines show that bit 0 corresponds to the feature "C" for this file (and it exists in 13120 records), and bit 1 is "c" (in 12986 records), but in another file I see that bit 0 is the feature "cc" and in a third it's "ccc(c)CCN". So the parser needs to be smart and map the per-file bit numbers to its own internal fragment identifier.

The details of the parser aren't important to this essay, and left to the implementation.

Benchmark

I need a benchmark. I'll load 10 data files (links to the 37 MB benchmark file) to make the inverted indices. I also need some query data. I reused the parser to extract a subset (1%) of the records in a fingerprint file. Actually, two fingerprint files. One timing set reuses one of the data files used to build the invertex indices, which means that everything query will match, while the other uses a new data file, which should have many fewer matches.

Again, see the code for full details.

When I run the benchmark I get:

== Sets ==
Load 0 Compound_000000001_000025000.counts.fpc
Load 1 Compound_000025001_000050000.counts.fpc
Load 2 Compound_000050001_000075000.counts.fpc
Load 3 Compound_000075001_000100000.counts.fpc
Load 4 Compound_000100001_000125000.counts.fpc
Load 5 Compound_000125001_000150000.counts.fpc
Load 6 Compound_000150001_000175000.counts.fpc
Load 7 Compound_000175001_000200000.counts.fpc
Load 8 Compound_000200001_000225000.counts.fpc
Load 9 Compound_000225001_000250000.counts.fpc
Loading took 63.0 seconds and 2092896256 memory
100 66575 315.0 per second
200 100909 337.5 per second
all exist: 227 searches took 0.7 seconds (counts 114990) 
100 2323 1161.5 per second
200 3865 1360.9 per second
some exist: 230 searches took 0.2 seconds (counts 4048) 
Some of this is progress information. The important parts are the load time (63.0) seconds, the total amount of memory used in the index (2.1 GB), and the search times of 0.7 and 0.2 seconds for the two test sets. (I use 'psutil' to get the memory usage; note that the exact meaning is OS dependent.) Also, the match counts of "114990" and "4048" are useful checks that each inverted index implementation is correct.

This output shows the problem. I'm using 2.1 GB of memory for 10 data files. I only have 12 GB on this desktop machine, and remember, the full data set is 200 times bigger. Do you have a 500 GB machine I can use?

Use an array instead of a set

Python objects have overhead. For example, the Python integer 300 on my desktop requires 24 bytes, while I know I have about 2**25 records so could easily get away with 4 bytes.

>>> import sys
>>> sys.getsizeof(3)
24
The other 20 bytes handle things like a reference to the Python integer class object and the reference count.

The Python set data type uses a lot of space because it's implemented as a hash, and many of the internal slots are empty in order to improve lookup performance. Quick testing shows that a set uses about 30-40 bytes for each element in the set.

It's not quite as bad as what this because the set elements contain Python references, which are 8 byte pointers on my 64-bit Python. But that's still about 42 bytes of space used to store at most 4 bytes of data. I should be able to do better.

The usual way "to do better" in Python, when storing a list of C-like integers is to use an 'array.array.' This is similar to a Python list, except that the contents are all of the same type, so there's no per-item overhead to store the type information. (Note: pje described this approach on the Reddit thread.

The downside is that there's no equivalent to a set.intersection. For now, I'll convert an array into a set, and do the normal set intersection. Here's the code:

# Make sure I can handle 4 byte integers
import array
_test = array.array("I", (0,))
try:
    _test[0] = 2**31+100
    array_code = "I"
except OverflowError:
    array_code = "L"

def make_unsigned_array():
    return array.array(array_code, ())

class InvertedIndexArray(object):
    def __init__(self):
        # Map from a feature string like "ccO" to an integer feature id
        self.feature_to_id = {}
        # Map from an integer feature id to an set of record ids
        self.inverted_indices = defaultdict(make_unsigned_array)


    def get_feature_id(self, feature):
        try:
            return self.feature_to_id[feature]
        except KeyError:
            n = len(self.feature_to_id)
            self.feature_to_id[feature] = n
            return n

    def add_record(self, id, feature_ids):
        for feature_id in feature_ids:
            self.inverted_indices[feature_id].append(id)

    def search(self, features):
        assert features
        try:
            terms = [self.inverted_indices[self.feature_to_id[feature]] for feature in features]
        except KeyError:
            return set()
        terms.sort(key=len)
        # Set conversion is expensive; it's faster to pass an iteratable to intersection_update.
        result = set(terms[0])
        for term in terms[1:]:
            result.intersection_update(term)
            # Short-circuit the evalution when the result set is empty.
            # Gives a 3x speedup for the "some exists" benchmark, with no
            # performance hit for the "all exist" benchmark.
            if not result:
                break
        return result

How well does it work? Here's the benchmark output using the array-based inverted index implementation, trimmed somewhat to make it easier to read:

== Sets ==
Loading took 63.0 seconds and 2092896256 memory
all exist: 227 searches took 0.7 seconds (counts 114990) 
some exist: 230 searches took 0.2 seconds (counts 4048) 
== Arrays ==
Loading took 51.2 seconds and 165552128 memory
all exist: 227 searches took 41.7 seconds (counts 114990) 
some exist: 230 searches took 16.1 seconds (counts 4048) 
The new version takes less 10% of the memory as the old, which is about what I predicted. It's nice to see the confirmation. It's also a bit faster to build the data structure. But on the other hand, search is over 60x slower. At 0.1 seconds per query, this performance is not acceptable the types of highly interactive searches I want to do.

There are of course hybrid solutions. This version spends a lot of time to convert an array to a set, which is then discarded. I convert the 'C', 'c' and a few other inverted indices a lot, so what I could do is use an LRU cache for the top 1,000 or so sets.

But let's get back to my problem. My data set is 200x bigger than the test set, so simple scaling of 166MB says I'll need 33 GB of RAM in the best of cases. I really would like this to work on my desktop, instead of renting some Amazon compute node every time I want to try things out.

Stop the presses! pypy 1.9 just came out

The pypy-1.9 release notice says:

This is good because in previous benchmarking I found that pypy sets were rather slower than cpython's own sets.

I re-ran the benchmark using pypy 1.7 and 1.9. Here are the results, first for 1.7:

== Sets ==
Loading took 278.6s seconds and (psutil not installed) memory
all exist: 227 searches took 1.0 seconds (counts 114990)
some exist: 230 searches took 0.2 seconds (counts 4048)
== Arrays ==
Loading took 32.7s seconds and (psutil not installed) memory
all exist: 227 searches took 299.6 seconds (counts 114990)
some exist: 230 searches took 335.6 seconds (counts 4048)
and then for 1.9:
== Sets ==
Loading took 47.9s seconds and (psutil not installed) memory
all exist: 227 searches took 0.5 seconds (counts 114990)
some exist: 230 searches took 0.1 seconds (counts 4048)
== Arrays ==
Loading took 35.4s seconds and (psutil not installed) memory
all exist: 227 searches took 186.2 seconds (counts 114990)
some exist: 230 searches took 72.9 seconds (counts 4048)
The good news is that 1.9 set operations are faster — 5x faster to load, and 2x faster to search. In fact, the search time is faster than cpython's! The bad news is it looks like pypy's intersection_update isn't yet as well optimized. Also, manual inspection shows that pypy 1.9 uses 2.1 GB of memory, which is about the same as cpython's. pypy on its own isn't the right solution.

Is there an inverted index library for Python?

Inverted indices are used in all sort of search engines, and there are many well-known ways to improve intersection performance and decrease memory use. Some of the ways I know of storing bit sets are:

However, I haven't found good Python bindings for these. (I wrote some Python/Judy bindings years ago. Unfortunately, it's buggy, and I would have to start from scratch to get it working again.)

Another possibility is to store the 'on' bits as a sorted list. Sorted lists are nice because the intesection corresponds to a merge sort, in strictly linear time. Other methods are described in Experimental Analysis of a Fast Intersection Algorithm for Sorted Sequences. There was also a recent posting on how to take advantage of SSE instructions for fast sorted list intersection. Since the fields are sorted, it's also possible to compress the data by, for example, storing deltas using something like Elias gamma coding or perhaps an encoder tuned to the data set.

Here too I've not been able to find an existing library which handled this for me, much less one with Python bindings.

Search engine components

Actually, I'm wrong. These are available, as part of a larger search engine package like Lucene and Xapian. Lucene is very well known, but I would have to mix the Java runtime and the chemistry code that I'm using from C++. Doable, but it feels like too much weight.

Instead, I downloaded, compiled, and installed Xapian and its Python bindings and wrote an inverted index based on its API. (Justinsaccount was the first to suggest this in the Reddit thread.)

Xapian uses the filesystem to store the database, so this version uses a lot less local memory than the previous two inverted indices. I wonder how much of the data set is stored in memory vs. how much is read from disk. I imagine there are tuning variables, and it looks like there's an in-memory option I could also do. I am quite the newbie with this library.

Here's my implementation:

try:
    import xapian
except ImportError:
    xapian = None

class InvertedIndexXapian(object):
    def __init__(self, dbname):
        self.db = xapian.WritableDatabase(dbname, xapian.DB_CREATE_OR_OPEN)
        
    def get_feature_id(self, feature):
        return feature
    
    def add_record(self, id, feature_ids):
        try:
            doc = self.db.get_document(id)
        except xapian.DocNotFoundError:
            doc = xapian.Document()
        for feature_id in feature_ids:
            doc.add_boolean_term(feature_id)
        self.db.replace_document(id, doc)

    def search(self, features):
        enquire = xapian.Enquire(self.db)
        query = xapian.Query(xapian.Query.OP_AND, features)
        enquire.set_query(query)
        matches = enquire.get_mset(0, 200000)
        return matches
and when I run the benchmark I get this timing information:
Loading took 349.9 seconds and 46465024 memory
all exist: 227 searches took 15.7 seconds (counts 114990) 
some exist: 230 searches took 20.7 seconds (counts 4048) 
Queries are about 3-12x slower than the cpython implementation, and 5-20x slower than pypy. But on the other hand, this uses only 90 MB of RAM, although it does build a 340 MB data file. Simple scaling suggests that I'll need 65 GB of disk space for the entire data set. That's quite doable.

The Xapian load time is a lot slower, but as the data is persistent, reload time from a built database is trivially fast. The documentation says I can set XAPIAN_FLUSH_THRESHOLD to a larger value for better indexing performance. Are there other tuning variables I should consider?

Xapian might be the way to go. Value ranges might be useful when I think about how to handle fragment counts. I'll need to do more evaluation. For example, I don't like that it's an order of magnitude slower than Python sets. After all, Python sets aren't optimized for this task, so I expected faster search times.

Sadly, there isn't much documentation about Xapian - and nowhere near like what there is for Lucene. (For example, there appears to be no page which describes the tuning environment variables. There's XAPIAN_MAX_CHANGESETS and XAPIAN_FLUSH_THRESHOLD .. are these or something else relevant for what I want?)

I still want a good inverted index library for Python

Xapian might work well for this project, but there are other projects I work on which use inverted indicies. In one project I try to optimize the ~1024 fixed patterns used in a traditional feature-based substructure screen. Both my greedy algorithm and my genetic algorithm for this used a lot of temporary intersections to evaluate a given selection. I'm not sure how well it would work on top of a persistent database.

If you know of a good library, please email me or comment on the Reddit thread I started on the topic.



Testing hard algorithms #

I developed software to find a maximum common subgraph (MCS) given a set of molecules represented as a chemical graph. It's called fmcs. My previous three essays were about the background of the MCS problem, introducing fmcs, and an example of when MCS is used.

What I didn't describe was the mental effort it took to develop this program. This is the second time I've written code to find the multiple structure MCS, and both times it took a couple of months and put my head in a very strange place. You would think the second time is easier, but it means that I spent more time adding features and doing things that my first version couldn't begin to handle. (Did someone just mutter "second system effect"? Pshaw!)

This time too I had a better understanding of the development process. I think I know why it's so much harder than most of the software I develop.

In this essay, I reflect on some of the reasons why this is a hard problem to test and I consider how unit tests, or any other incremental-based testing approach, are not well-suited to a certain class of complex algorithm development. While unit tests can provide a basic sanity check during development, they fail to provide the test coverage which one might expect from applying them to algorithmically simpler modules. Further, development methodologies built primarily on unit tests, like test-first style test-driven development, aren't that applicable. Instead, other methods, like system testing and a deep understanding of the algorithm and possible problems, are required. This strengthens my view, explored in Problems with TDD that "TDD is a weak method for developing those tests of confidence."

How would you implement an MCS algorithm?

Think about the problem for a moment. How would you write an algorithm to find the largest common subgraph between two graphs?

Take your time. This really is a hard problem. I dug up some of the early papers from the Journal of Computer Documentation. People ended up simplifying the problem by, for example, not supporting rings.

Finished thinking?

You probably came up with a graph walking algorithm which tries different match pairings and uses backtracking or lazy evaluation to search all of graph space. The more mathemtically inlined might have converted this into a maximum clique algorithm.

In both cases there are a lot of arbitrary decisions to make. Which pairings should you investigate first? Are there times when you cann prune the search space? Did you make any assumptions about the nature of the chemical graph (eg, that it's topologically a planar graph)?

How would you test your MCS algorithm?

Back in the late 1990s, I almost never wrote automated tests, and never wrote an extensive test suite. Nowadays I'm pretty thorough, and use coverage-guided techniques to get good, extensive tests through some semi-permanent API layer that will allow me to refactor the internals without changing the tests.

I couldn't do that here.

There are a lot of heuristics, and some of them are only triggered under unusual circumstances, after a bunch of combinitorial possibilities. Outside of a few minor components, I couldn't figure out how to write unit tests for the code.

I ended up pushing most of the testing into validation testing of the complete system, which meant I wrote some 1,000+ lines of code without strong testing. Moreover, I used a new approach to the MCS problem, so the algorithm I was working on doesn't have the track record of the standard clique-based or backtracking approaches.

So stress factors included not knowing if the algorithm would work, and not being able to develop enough test cases to provide good validation during development.

As a rule of thumb, it's easiest to fix bugs which are caught early. Unit tests and evolutionary prototyping are two of the techniques that people use to tighten the feedback loop between specification, implementation, and valiation. I think another stress factor is propotional to the size of the feedback loop.

What testing did I do?

I mean, I did have tests during development. I came up with a few examples by hand, I did a substructure search of a large data set and I verified that the MCS code found that substructure, but I know that's not enough tests. I know this because after six weeks of development and over 1000 lines of code, I spent another three weeks doing a large amount of post-development testing, and found several bugs. In the process of writing this essay I also found that four days of that development work ended up making things slower, so I'll have to remove it.

I did most of my tests based on the ChEMBL data: 10,000 random pairs of structures, the k=2, k=10, and k=100 nearest neighbors, and the k<=100 neighbors with similarities of at least 0.95, 0.9, and 0.8. I also did, at the end, tests based on the ChEBI structure ontology. There were easily 20,000 different machine-generated test cases, although in most cases I didn't know the expected results beforehand.

What bugs did I find?

What bugs did I find? I think it's educational to characterize a few of them.

Typo caught by an assertion check

One of the simplest bugs was a poorly formatted string. I used "%d" when I should have used ""%%%d". A bit of jargon for those in the know; I generate SMARTS strings for the intermedate subgraphs. If there are more than 9 open rings then the syntax goes from a single digit closure number to a closure number like "%10". I forgot to include the '%' for that case, and probably because the '%' was already there for the format string.

This wasn't triggered by my random-pairs test nor my various similarity-search based tests. Only when I ran through the ChEBI data, did I get an assertion failure when RDKit refused to accept my SMARTS string. That was the first time where I had a SMARTS with 10 unclosed rings.

As it happens, this error could have been caught by a unit testing, as some people practice unit tests. It's a four line function which takes a string and a number. Testing it is trivial. I didn't test it because I feel that testing directly against internal functions inhibits refactoring. I prefer to test against higher-level, more stable APIs.

My view is that it's usually easy to come up with high-level test cases which trigger a certain function call. But not in this case. The MCS search algorithm, while deterministic, uses so many arbitrarily defined values that I couldn't on my own come up with a test case with at least 10 open ring closures. And even if I did, a new search heuristic might change things so that only, say, 7 open ring closures were needed.

I felt that my system testing and the assertion check would be enough to identify if there was a problem, and it did. A low-level unit test might have helped, especially as I still don't have a specific test for that failure.

I think the right thing to do is add that failure as a specific test case, and use monkey-patching to insert wrong code for a repeat of that test case. The first one tests that the code is correct, and the second tests that the test case is still exercising the code under test.

Cross-validation testing

The first MCS papers are about as old as I am. Many people have written implementions, although relatively few are are available to me both at no cost and with no prohibitions on using it to develop a new MCS algorithm. (Some commercial companies don't like people using their software to write new software which is competitive to it, or even to use their software for benchmark comparisons.)

I tested pairs of structures against SMSD and I did more extenstive tests against Indigo's MCS implementation.

This is cross-validation testing. It's a relatively rare technique because the cost of producing multiple identical implementations usually isn't worth the benefits. Even here the results aren't exactly identical because of differences in how the toolkits perceive chemistry, and more specifically, aromaticity. I ended up spending a lot of investigation time staring at cases with different answers and trying to figure out if it was a chemistry problem or an MCS algorithm implementation problem.

I found the SMSD had a bug in one of its options, which I reported back to the author. The code had been fixed internally but not pushed to the outside world. Its default mode and fmcs matched quite well, except for a couple of chemistry differences. The new version is out now - I need to test it again.

The only problem I found in the Indigo code was a part of their setup code which didn't check the timeout. That's also been fixed after I reported it.

The cross-validation with Indigo found problems in my code. For example, I was often getting smaller MCSes then Indigo. After looking at them, I figured out that my code didn't correctly handle the case when a molecule was fragmented after a bond was removed because its bondtype wasn't in all of the structures.

Why didn't my hand-written test cases find it? None of them had a case where there was a bond in the "middle" of a structure chosen as the reference structure, and where the MCS was not in the first fragment.

My code usually got the right answer when using highly similar structures, for obvious reasons. It was only the random pair testing where the problem really stood out.

Could I create a simple unit test for this error? Perhaps, but it's not easy. I don't know which of the two fragments will be first - it depends on so many arbitrary decisions which could change as the algorithm is improved. The only test I can think of for this was to generate a diverse set of tests, make sure some fail if the code isn't implemented correctly, record the results, and make sure that future tests never find a worse (or better?) test case.

Bad heuristic to determine the maximum possible subgraph growth

Most of the MCS implementation is heuristics. There's a branch and bound search, there's subgraph canonicalization to reduce the number of substructure tests, and so on. Each of these is supposed to help make the code faster.

One of the tests takes the current subgraph and the list of "outgoing" bonds to see how much of the remainder of the graph is accessible for growth from the current subgraph. The rest of the molecule might not be accessible because it's on another fragment, but it also might not be accessible due to an earlier decision to exclude certain parts of the molecule from future growth. (My algorithm tests all subgraphs which include a given bond, then all subgraph which don't include the given bond.)

It took a couple of days to think of the algorithm, write the code, and get it working. I then did the timing tests to find out it was 1% faster, to get the same answers.

Does that mean my code worked? I only had a suspicion that the new algorithm should be faster. Perhaps the overhead of searching for the accessible atoms was too costly?

After looking at my implementation - a lot - I finally realized that I told the algorithm to exclude the subgraph from consideration for growth but had forgotten to also exclude the set of previously excluded bonds from consideration. With two changed lines, the overall performance doubled for the random-pairs case. Most of the time it's faster and sometimes its slower, so it takes an aggregate of tests to measure this correctly.

I don't think this heuristic could have been written as a unit test. No, I take that back. It could have, but it would have required some careful thought to set up. Not only was this part of the code not "built for test", but setting up the right conditions requires a mental understanding of the problem which I know I didn't have when I was doing the development.

BTW, I am not saying that unit tests can't be used to measure performance. Some algorithms have simple timing characteristics where it's easy to say that the code must complete by a given time, or that it must be at least twice as fast as a reference implementation. Occasionally these tests will hiccup due to unusual loads on the test machine, but not usually enough to be a problem.

Indeed, the fmcs code has some unit tests for the timeout code. I found a pair of structures which takes over 10 seconds to find the MCS, I set the timeout to 0.1 second, and assert that no more than 0.5 seconds elapsed during the function call. (I don't require a high precision for this code.) I do worry though that future changes to the MCS search code might speed things up by a faster of 20. On the other hand, the effect of broken timeout code is very obvious when running the validation suite, so I'm not going to worry about it.

As is widely acknowledged, this stretches the idea of a "unit test." Unit tests are supposed to be fast; preferably several hundred or more per second. The goal is that these tests run often, perhaps even after every save. Stick in a few 0.1 second timeouts into the system and it bogs everything down, which discourages people from running the tests so often.

But let's go back to this new code. On average, over a large number of tests, the performance is 50% faster. What's a good test case? Can I pick one structure or a small set of structures? Does the speedup occur only when there are large molecules? Only for molecules with lots of internal symmetry? Only for those which are easily fragmented?

Only now, when writing this essay, did I find good test cases. When I use a 10 second timeout using the k=10 nearest neighbors tests, I get 8 timeouts in the first 32 records using the old algorithm, and only 1 timeout using the new algorithm. The time for the very first test goes from over a minute (I finally gave up) using the old algorithm to 0.08 seconds using the new one.

Obviously (in retrospect) the right solution is to pull out a couple of those from my validation suite and turn them into test cases.

I never would have come up with these test cases before implementing the new code - up until ten minutes ago I thought that the new code only added a factor of two the code, not possible factors of 1,000!

Canonicalization: an in-depth analysis

I'm not finished with testing. I haven't fully characterized all the parts of the implementation. For that matter, there are heuristics I can still add. Here I'll describe what I did to analyze subgraph SMARTS canonicalization. I conclude by finding that while it works, it slows down the code and several days of development work ought to be removed.

My MCS algorithm enumerates subgraph of one structure, converts that into a SMARTS string, and tests if the pattern exists in other structures.

There are many alternate ways to make a SMARTS string from a subgraph. Given the linear chain of C O N, there's the reasonable CON or NOC variations, the more confusing O(C)N and O(N)C, and crazy variations like C1.N1O and O%987.N7.C%98. It's easy to make the non-insane versions be generating a spanning tree. The question is where to start the search and how to handle branches.

I believe that there are many duplicate patterns in a query. For example, a benzene ring will generate many "ccc" queries. I can minimize the number of substructure matches by caching earlier SMARTS match results. But caching doesn't work if sometimes I get a CON and sometimes I get a NOC. The solution is to generate a "canonical" SMARTS for the subgraph - a unique representation for all of the variations.

I implemented the CANGEN algorithm from SMILES. 2. Algorithm for generation of unique SMILES notation by Weininger, Weininger, and Weininger J. Chem. Inf. Comput. Sci., 1989, 29 (2), pp 97-101 in order to choose a canonical SMARTS. (But see Assigning Unique Keys to Chemical Compounds for Data Integration: Some Interesting Counter Examples, Neglur, Grossman, and Liu, 2nd International Workshop on Data Integration in the Life Sciences (DILS 2005), La Jolla.)

Canonicalization is notoriously hard to get right. It too has a lot of tricky corner cases. For example, in the late 1990s Daylight tracked down a reported bug in their canonicalization implementation. They algorithm requires a stable sort algorithm, but they used an unstable sort. Their own internal testing never found the problem - it was a customer who found the Solaris and IRIX boxes occasionaly returned different values.

There could be similar bugs in my canonicalization algorithm. I don't know. The usual solution is to generate large numbers of test cases, randomize the order of the atoms and bonds, and verify that all of them are the same. Here too it may be that only certain rare topologies, or rare bond patterns, triggers an error. I'm not convinced that I could come up with the right set of test cases for it.

What makes it harder is that the underlying search algorithm doesn't need a canonical SMARTS, so a canonicalization failure doesn't ever show up in the output. I could have an infrequent bug and not notice it! Canonicalization is only there for performance reasons, but my implementation is in Python, while the substructure matching is in C++, so it might even be that the canonicalization time might be more than the time saved.

I can make that a bit more nuanced. Canonical SMARTS generation has three phases: 1) initial ranking, 2) canonicalization/tie-breaking, and 3) SMARTS generation. I can disable the canonicalization step and get a "semi-canonical" SMARTS (my initial ranking is more like a circular fingerprint of diameter 3 than the atom characteristic from Weininger et al.). I can also disable caching. Both of these are doable with only a couple of changes to the code.

This leads to three cases: canonical SMARTS with caching, semi-canonical SMARTS with caching, and semi-canonical SMARTS without caching. (There should be a fourth step, which is arbitrary SMARTS, but that requires more extensive changes.)

What I want to understand most is the effect of canonicalization on my code. Again, I have to resort to aggregate timings and statistics across a set of benchmarks. I used the ChEMBL-13 data set and generated various benchmarks. One is based on computing the MCS between 10,000 pairs selected at random, another contains the k=2, k=10, and k=100 nearest neighbors of randomly chosen fingerprints (timing 500 data sets each time), and the last is the total of 500 tests of the k<=100 compounds with Tanimoto score of at least 0.95 to randomly selected fingerprints. My results (times are reproducible to within a few percent) include the number of unique SMARTS (canonical or non-canonical) generated and the total number of substructure tests ("SS") which were carried out.

Random pair timings# unique SMARTS# SS tests
no cacheTotal: 10000/467.3s (21.4/s) Complete: 9997/437.3s (22.9/s) Incomplete: 3/30.0s9216661339292
semi-canonicalTotal: 10000/421.3s (23.7/s) Complete: 9997/391.3s (25.6/s) Incomplete: 3/30.0s921666921666
canonicalTotal: 10000/442.1s (22.6/s) Complete: 9997/412.0s (24.3/s) Incomplete: 3/30.0s709636709636
k=2 nearest neigbhors# unique SMARTS# SS tests
no cacheTotal: 500/287.3s (1.7/s) Complete: 484/127.1s (3.8/s) Incomplete: 16/160.2s264057320238
semi-canonicalTotal: 500/276.7s (1.8/s) Complete: 484/116.6s (4.2/s) Incomplete: 16/160.2s264057264057
canonicalTotal: 500/298.6s (1.7/s) Complete: 483/128.4s (3.8/s) Incomplete: 17/170.2s186682186682
k=10 nearest neigbhors# unique SMARTS# SS tests
no cacheTotal: 500/520.5s (1.0/s) Complete: 471/230.0s (2.0/s) Incomplete: 29/290.5s4476483525563
semi-canonicalTotal: 500/490.1s (1.0/s) Complete: 472/209.5s (2.3/s) Incomplete: 28/280.6s4476482872040
canonicalTotal: 500/509.3s (1.0/s) Complete: 471/218.7s (2.2/s) Incomplete: 29/290.6s3300102109332
k=100 nearest neigbhors# unique SMARTS# SS tests
no cacheTotal: 500/414.4s (1.2/s) Complete: 486/271.4s (1.8/s) Incomplete: 14/143.0s1287819932263
semi-canonicalTotal: 500/363.5s (1.4/s) Complete: 487/230.8s (2.1/s) Incomplete: 13/132.7s1287817456877
canonicalTotal: 500/361.8s (1.4/s) Complete: 488/239.5s (2.0/s) Incomplete: 12/122.4s1076486196764
k<=100 at or above Tanimoto threshold of 0.95# unique SMARTS# SS tests
no cacheTotal: 500/642.1s (0.8/s) Complete: 458/220.3s (2.1/s) Incomplete: 42/421.9s4686614388074
semi-canonicalTotal: 500/624.5s (0.8/s) Complete: 461/232.8s (2.0/s) Incomplete: 39/391.7s4686613828813
canonicalTotal: 500/640.9s (0.8/s) Complete: 460/239.2s (1.9/s) Incomplete: 40/401.8s3648023222366

Whew! That's a lot of data to throw at you. Please believe me when I say that it took two days to generate correctly. BTW, "complete" means that the MCS search algorithm went to completion, the "incomplete" means that it timed out - here after 10 seconds - and gave a partial solution.

What does this tell me? Obviously, caching is good. As expected, the "semi-canonical" solution is always better than the "no cache" case, and the canonicalization always reduces the number of substructures and substructure tests. This means the canonicalization code is doing something right.

Unfortunately, it seems like canonicalization has a big performance impact. In the pairwise tests I do 922,000 canonicalizations in Python to save 212,000 substructure tests, and I lose 21 seconds in doing so. For the k=100 nearest neighbor benchmark, which is the only one where the canonicalization code saves time, I do 129,000 canonicalizations to save 1,260,000 comparisons, and gain about 2 seconds. This suggests that each canonicalization takes as long as 10 substructure matches, and that RDKit can do 60,000 SMARTS matches per second. A quick checks using one SMARTS gives 100,000 SMARTS matches per second, which helps validate this estimate.

This means I should take out the canonicalization until it can be implemented in C++. Granted, there may be a bug in the code, like there was with an earlier hueristic. But there would have to be an order of magnitude performance increase for this to be effective, and I don't think that's likely.

Unit tests aren't enough to drive development

Which leads me back to my thesis. It would have been possible to develop a few more unit tests for the canonicalization code. There would have been some extra scaffolding in order to do that, but that's a minor cost. I suspect that many of the tests would be very implementation-dependent, and tied to specific internal function calls. I don't like this because it means that a re-implementation of this internal component in C++, which submerges many of the internal helper functions and places them out of reach of Python-based unit tests, would cause the unit tests to be un-runnable. And I am certain that those unit tests would not be rigorous enough to be confident that the code was working correctly.

Think about the question "should I write this code in Python or C++?". It's a development-time question. Test Driven Development (TDD) is a development methodology which uses unit tests to help make development-time decisions. I think TDD is most helpful when used to establish the minimum requirements for a project.

I don't see how TDD is helpful here. I can't think of any unit tests which would guide this development decision. Sometimes you can write a "spike solution" (also called a "prototype"):

A spike solution is a very simple program to explore potential solutions. Build the spike to only addresses the problem under examination and ignore all other concerns.
This sounds good, but what development style should you use to develop the spike solution? Spikes are supposed to be throw-away code, but I can't figure out how something other than a complete, tested implementation of the canonical algorithm or the remaining growth heuristics would have been useful. After all, a two line change in the latter went from "working but 1% faster" to "working and 50% faster", and I didn't even know if there was going to give a speedup in the first place.

If not TDD, what methodology do I use?

Read Peter Seibel's Unit testing in Coders at Work. The author interviewed various famous programmers and learned more about how they tested their software. You may also want to read the related comments on Hacker News.

Seibel recounts that Joshua Bloch "wrote a monstrous 'basher'" which found "that occasionally, just occasionally, the basher would fail its consistency check." Bloch then wrote tests to pin down the failure, eventually finding the fault in a system mutex. Note that under normal unit test development you assume that your underlying components are correct, so this isn't something you would normally code for.

Seibel also reports that others assemble a large amount of test cases and uses that to check their code. This is of course the technique I used for the MCS problem.

It feels as trite as saying that the secret to losing weight is diet and exercise, but my method for programming is to understand the problem, implement it, and test it until you are confident that it's good enough. That knowledge of how to do that comes from experience, practice, reflection, and discussion.

For those who scoff and call this "big design up front," I merely point you to earlier parts of this essay. When I started this code I had a rough idea that it would work. I had previously implemented substructure enumeration, and the Weininger et al. canonicalization algorithm and SMARTS generation, so I knew that the components were possible, but I didn't know how they went together. Instead, I let the code development itself guide me. I thought some, I implemented code, I "ran the code in my head", I reflected on what the tricky parts were, I thought about how to make them more clear, and I tried various was to improve the code.

That was a lot to keep in my head, I wasn't sure if the result would be fast enough, and I had no way good way to test its usefulness until most of the code was in place. No wonder it was mentally taxing!

I believe tests - including unit tests implemented during development - are important. I don't believe that unit tests are good enough to guide complex algorithm development. However, most programming (over 95%?) is not complex algorithm development; complicated? yes, but not complex.

I once jokingly said that TDD is not useful if there's two or more embedded for loops. That's a bit of a simplification, but not far from my feelings. Some classes of problems, like complex algorithm development and security analysis, require a different attitude towards programming, emphasizing "what can go wrong?" and "how can I improve my confidence that the code is working correctly?" It's my view that this doubt-based philosophical attitude is missing from most discussions of software development practices.

And perhaps as fundamental, when I have to be that critical of my own work and probe it for failures and be open to the possibility of nasty gotchas lurking in the deep dark corners, then some of that doubt passes over into my personal life. I empathize with the idea of not living in that "strange place" all the time, and I can see why this isn't a common attitude.

Comments

This essay is meant to be a thoughtful reflection on the difficulties of programming, using the MCS problem to provide specific structure. If you want to leave comments about the MCS portion then use the MCS comment site. Otherwise, leave a comment on testing hard algorithms.



Topologically non-planar compounds #

John MacCuish, from Mesa Analytics, pointed out that the MCS problem takes polynomial time if the graphs are planar. He writes: "Are there graphs in your likely sets, that are not planar? I have never seen a non-planar small molecule drug for example, but they may be out there. Tests for planarity are also in P. Of course it doesn't mean that solutions in P will be faster than the usual methods since N is small an the overhead for planar MCS may be large, such that N may need to be large for the planar method to beat the non-planar heuristics."

This leads to a couple of questions. One is, what is the shape of the run-time of my MCS algorithm? I don't actually know. Some tests take a very long time, but that's not enough to establish that it's polynomial for real-world compounds or exponential. I'm not going to research this now.

Another is, are there real-world small molecules which are non-planar, in the topological sense, and not the chemistry sense where all of the non-hydrogen atoms line on or near a plane?

Previous work in topologically non-planar compounds

A quick literature seach finds Synthesis of the first topologically non-planar molecule, which says:

We report here the synthesis and characterization of the tris-ether 2,5,14-trioxahexacyclo-[5.5.2.1.24,10.O4,17.O10,17]-heptadecane 3; this topologically unique (graph theory) molecule is prepared via a novel intramolecular rearrangement of either of two isomeric propellane spiro-epoxides, 1 and 2.

There's also Topological stereochemistry. 9. Synthesis and cutting "in-half" of a molecular Möbius strip by Walba, Homan, Richards, and Haltiwanger in New. J. Chem., 1993, 17, 661-681.

Quoting from the above link to Modern Physical Organic Chemistry by Ansyln and Dougherty:

We mention briefly here another topological issue that has fascinated chemists. For the overwhelming majority of organic molecules, we can draw a two-dimensional representation with no bonds crossing each other. ... It may seem surprising, but most molecules have planar graphs.

Recent efforts have produced chemical structures that successfully realize many interesting and novel topologies. A landmark was certainly the synthesis of a trefoil knot using Sauvage's Cu+/phenanthroline templating strategy.... Vögtle and co-workers have described an "all organic" approach to amide-containing trefoil knots, and have been able to separate the two enantiomeric knots using chiral chromatography. Another seminal advance in the field was the synthesis and characterization of a 'Möbius strip' molecule..."

So it's well-established that there are topologically non-planar structures. But are they in compound databases which I can access?

Searching PubChem for topologially non-planar compounds

Take a look at Scaffold Topologies II: Analysis of Chemical Databases by Wester, Pollock, Coutsias, Allu, Muresan and Oprea in PMC 2010 January 15., Published in final edited form as: J Chem Inf Model 2008 July; 48(7): 1311-1324.

Only 12 nonplanar and 2,099 spiro node topologies (all of which are planar) are present in the merged database. 9 of the nonplanar topologies are found only in PubChem and the total number of molecules represented by such topologies in the merged database is a mere 44, agreeing with Walba's assessment concerning the rarity of chemicals with nonplanar graphs.

This establishes that in 2008 there were no more than 44 topologically non-planar structures in PubChem. Can I find them?

I don't think any of the database search engines support this capability, so I need to write a program.

For that I need a method for planarity testing. Various sources say that the linear time algorithm is "widely regarded as being quite complex", but that one of the linear time planarity algorithms is part of Sage

Sage is a comprehensive mathematical software system, which uses Python. There's a pre-built binary distribution for my OS, which I downloaded. It includes Python, the IPython shell, and everything else it needs, so it really is a system and not a set of Python modules.

With Sage it's easy to make a graph and test for planarity.

% sage
----------------------------------------------------------------------
| Sage Version 5.0, Release Date: 2012-05-14                         |
| Type notebook() for the GUI, and license() for information.        |
----------------------------------------------------------------------
sage: d = {1: [2,3], 2: [3], 3: [4, 6], 4: [5]}
sage: g = Graph(d)
sage: g.is_planar()
True
sage: 
sage: k3_3 = {1: [2,3,4,5,6], 2: [3,4,5,6], 3: [4, 5, 6]}
sage: Graph(k3_3).is_planar()
False
sage: 

The dictionary I use as input to "Graph" contains an upper-triangle connection matrix. So to test if a molecule is topologically planar, I just need to convert its connectivity information into a dictionary of the right form, turn the dictionary into a Graph, and test the graph for is_planar().

Roadblock! My Python 2.6 modules and Sage's 2.7 Python don't mix

The prebuilt binaries include its own Python distribution, and when you use "sage", it replaces the PYTHONPATH with its own settings. This means when I run sage I don't have access to the cheminformatics tools I've already set up for my system:

% sage
----------------------------------------------------------------------
| Sage Version 5.0, Release Date: 2012-05-14                         |
| Type notebook() for the GUI, and license() for information.        |
----------------------------------------------------------------------
sage: import rdkit
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)

/Users/dalke/<ipython console> in <module>()

ImportError: No module named rdkit
sage: from openeye.oechem import *
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)

/Users/dalke/<ipython console> in <module>()

ImportError: No module named openeye.oechem
sage: os.environ["PYTHONPATH"]
'/Users/dalke/ftps/sage/local/lib/python'
I tried to force it to include the right path, and that failed.
% printenv PYTHONPATH
/Users/dalke/ftps/openeye//wrappers/v2011.Oct.1/python/:/Users/dalke/envs/RDKit_2011_12-svn
% sage
   ...
sage: import sys
sage: sys.path.insert(0, "/Users/dalke/envs/RDKit_2011_12-svn")
sage: import os
sage: os.environ["DYLD_LIBRARY_PATH"]
'/Users/dalke/ftps/sage/local/lib:/Users/dalke/ftps/sage/local/lib/R/lib::/Users/dalke/ftps/openeye//wrappers/libs:/Users/dalke/envs/RDKit_2011_12-svn/lib:/Users/dalke/ftps/sage/local/lib/R/lib'
sage: from rdkit import Chem
Fatal Python error: Interpreter not initialized (version mismatch?)

------------------------------------------------------------------------
Unhandled SIGABRT: An abort() occurred in Sage.
This probably occurred because a *compiled* component of Sage has a bug
in it and is not properly wrapped with sig_on(), sig_off(). You might
want to run Sage under gdb with 'sage -gdb' to debug this.
Sage will now terminate.
------------------------------------------------------------------------
/Users/dalke/ftps/sage/spkg/bin/sage: line 312: 56741 Abort trap              sage-ipython "$@" -i
What happened here is that Sage ships with Python 2.7, while the locally installed cheminformatics toolkits use Python 2.6.

I decided to use another technique. I would have sage call out to another program to handle the cheminformatics. For each structure it will output a line containing the identifier, the SMILES, and the needed upper-triangle dictionary data structure. One line of output will look like:

('15', 'OC1C2(C(C3C(C4(C(CC3)CC(=O)CC4)C)CC2)CC1)C', {0: [1], 1: [2, 19],
 2: [3, 20, 17], 3: [4, 18], 4: [5, 9], 5: [6, 16], 6: [7, 15, 14],
 7: [8, 10], 8: [9], 10: [11], 11: [12, 13], 13: [14], 16: [17], 18: [19]})
All the sage code needs to do is read those lines, extract the data, pass the graph into Sage for analysis, and print out those which are non-planar.

Even this wasn't as easy as I thought. The PYTHONPATH environment variable persists in spawned processes, which the python subprocess I started uses the wrong PYTHONPATH. I ended up setting the environment variables myself - including PYTHONHOME - before it would work correctly:

import sys
import os
env = os.environ.copy()
env["DYLD_LIBRARY_PATH"] = "/Users/dalke/ftps/openeye//wrappers/libs:/Users/dalke/envs/RDKit_2011_12-svn/lib"
env["PYTHONPATH"] = "/Users/dalke/envs/RDKit_2011_12-svn"
env["RDBASE"] = "/Users/dalke/envs/RDKit_2011_12-svn"
env["PYTHONHOME"] = "/System/Library/Frameworks/Python.framework/Versions/2.6"

import subprocess

# Import the "Graph" constructor
from sage.all import *

# Use the version of Python for which RDKit was built to use
p = subprocess.Popen(["/usr/bin/python2.6", "pubchem_connectivity.py"],
                     env = env,
                     stdout = subprocess.PIPE)
                 
for i, line in enumerate(p.stdout):
    id, smiles, d = eval(line)
    G = Graph(d)
    if not G.is_planar():
        print "======= Found one!!!"
        print id, smiles
    if i % 100 == 0:
        sys.stderr.write("%d ...\r" % (i,))
        sys.stderr.flush()

Now I just need the "pubchem_connectivity.py" program to generate the connectivity information.

Call another program to generate the connectivity information

I've been using RDKit these days, because the funding I got for the MCS project said I needed to use RDKit. I usually use OEChem, which (among other things) has much faster structure parsers than RDKit. Since I'm going to read tens of millions of structures, I wanted a way to reduce the number of structures to process.

Now, if a graph has no cycles then it's always planar. Even if it has one, two, or three cycles, it's still impossible for it to be non-planar. The number of cycles in a single component graph is E-V+1, which is the number of edges (bonds), minus the number of vertices (atoms), plus 1. So a simple test is to exclude molecules where the number of bonds is less than three greater than the number of atoms.

Mind you, this is an exclusion test which rejects graphs that cannot be planar. There are plenty of molecules with dozens or rings which are topologically planar. Also, PubChem has plenty of records with multiple components, which means I may miss a few cases. But mind you, my goal is to find some non-planar structures in PubChem, not find all of them.

I had previously converted my local copy of PubChem to a set of compressed SMILES files. A few months ago I wrote a set of Ragel definitions for SMILES, which includes a demonstration of how to use count the number of atoms and bonds in a SMILES file. I can use that unmodified as a co-process to do high-speed counting for me.

import glob
import subprocess
import gzip
import sys
from collections import defaultdict
from rdkit import Chem

# Start a co-process to count the number of atoms and bonds in a SMILES string
p = subprocess.Popen(["/Users/dalke/opensmiles-ragel/smiles_counts"],
                     stdin = subprocess.PIPE,
                     stdout = subprocess.PIPE)

filenames = glob.glob("/Users/dalke/databases/pubchem/*.smi.gz")
for i, filename in enumerate(filenames):
    msg = "Processing %d/%d\n" % (i+1, len(filenames))
    sys.stderr.write(msg)
    sys.stderr.flush()

    for line in gzip.open(filename):
        # Read a line from the data file
        smiles, id = line.split()

        # Send it to the co-process
        p.stdin.write(smiles + "\n")
        p.stdin.flush()

        # Get the counts (looks like "atoms: 34 bonds: 38")
        line = p.stdout.readline()
        _, atoms, _, bonds = line.split()
        num_atoms = int(atoms)
        num_bonds = int(bonds)

        # Skip records which cannot be non-planar
        # (Note: assumes single component structures!)
        if num_bonds < num_atoms + 3:
            continue

        # Extract the topology into upper-triangle dictionary form
        mol = Chem.MolFromSmiles(smiles)
        if mol is None:
            continue
        d = defaultdict(list)
        for bond in mol.GetBonds():
            b1 = bond.GetBeginAtomIdx()
            b2 = bond.GetEndAtomIdx()
            if b1 < b2:
                d[b1].append(b2)
            else:
                d[b2].append(b1)

        # print to stdout so the Sage program can read it
        data = (id, smiles, dict(d))
        print repr(data)
BTW, if you haven't been paying attention, I have one process which tests if a SMILES string has enough bonds in it, another to convert structure information into a simple topology, and a third program to do the graph planarity test. "Bailing-wire and chewing gum!" to repeat an old phrase.

The structures!

It took many hours for my computer to chug along (while I slept). I ended up with 224 of the non-planar SMILES in the subset of 28.5 million PubChem structures I have on my machine. (In other words, bear in mind that this is not a complete search!)

Here are a few structures to show you what they look like:

That first structure, silicon nitride, is a ceramic, and which cannot be expressed in SMILES. (That is "[C]" is an equally poor SMILES representation of graphite as it is for diamond.) The others look like progressively more realistic chemical representations.

The non-planar structures were a bit too verbose to show as a SMILES file. Instead, here's the full list of identifiers, which you can easily use to get the structures yourself (or follow the hyperlink to look at the non-planar depictions).

390566, 498002, 636755, 3084099, 4868274, 5104674, 6712449, 10019039, 10882690, 10895017, 10898366, 10994362, 11027681, 11126005, 11131973, 11187418, 11340053, 11350583, 11360285, 11384163, 11407796, 11672382, 14381365, 14381430, 14381432, 14381435, 16132679, 16132681, 16132995, 16132999, 16133150, 16133259, 16133262, 16133397, 16133412, 16133413, ee16133414, 16133878, 16145580, 16146229, 16146230, 16148442, 16148609, 16148632, 16148888, 16148900, 16149007, 16149114, 16149222, 16149238, 16149361, 16149362, 16149482, 16149579, 16149602, 16149827, 16149958, 16150054, 16150193, 16150399, 16150419, 16150654, 16150658, 16150833, 16150994, 16151169, 16151337, 16151360, 16151567, 16151729, 16151960, 16152362, 16152566, 16152608, 16152699, 16152729, 16153098, 16153207, 16153649, 16154275, 16154327, 16154453, 16154584, 16154971, 16155013, 16155014, 16155015, 16155069, 16155075, 16155076, 16155130, 16155140, 16155152, 16155193, 16155194, 16155197, 16155399, 16155442, 16155443, 16155456, 16155607, 16155626, 16155630, 16155631, 16155632, 16155633, 16155652, 16155784, 16155884, 16155916, 16155917, 16155920, 16155972, 16156042, 16156080, 16156082, 16156145, 16156198, 16156260, 16156261, 16156264, 16156265, 16156312, 16156411, 16156413, 16156417, 16156432, 16156495, 16156539, 16156544, 16156545, 16156609, 16156610, 16156613, 16157557, 16214951, 17749011, 21597602, 21597607, 21597610, 21597611, 21770498, 22294696, 22835058, 22835161, 22835262, 22835624, 22835636, 22835637, 23327291, 23584643, 23726086, 23727886, 23955822, 24764125, 24770227, 24770228, 24770229, 24770290, 24770291, 24871221, 24940071, 25200029, 44239114, 44303783, 44303799, 44303821, 44382489, 44397933, 44397934, 44566282, 44566284, 44575206, 44575207, 44575208, 44592641, 44592642, 44592645, 44592646, 44592647, 44592648, 44592945, 44593582, 44606373, 44606374, 46882313, 46882314, 46882315, 46882316, 46882317, 46882318, 46882319, 46882320, 46882321, 46882322, 46882323, 46882324, 46882325, 46882326, 46882327, 46882328, 46891923, 46895835, 49799159, 49799160, 49873810, 50897242, 50900298, 50900299, 50900300, 50919058, 51004304, 51026319, 52945815, 52952313, 52952314, 52952315, 52952316, 52952317, 53468167, 53468168, 53468169, 53468170, 53468171

Still, having some SMILES on-hand is nice so here are some of the shorter SMILES strings. At the least, you can use these as test-cases for an MCS search engine, and perhaps force a linear-time planar-graph-only method to fail :)

O1C23C4C5(C16OCC7C4(CC(C2(OC(=O)C3(CC5C8C6CC=C9C8(C(=O)C=CC9)C)O)C)OC7=O)C)O 390566
O=C1N2C34N5CCC(C3CCC4(C1)C=CC5)c6c2cccc6 498002
O=C1N2C34N5CCC(C3CCC4(C1)C=CC5)c6c2cc(cc6)OC 636755
[Si]123N4[Si]56N1[Si]4(N25)N36 3084099
C123C45C16C24C6C7C(C35)CC(C(C7)C)C 4868274
O1C23C4C5(C16OCC7C4(CC(C2(OC(=O)C3(CC5C8C6CC=C9C8(C(=O)C=CC9)C)O)C)OC7=O)C)O 5104674
O1C23C4C5(C16OCC7C4(CC(C2(OC(=O)C3(CC5C8C6CC=C9C8(C(=O)C=CC9)C)O)C)OC7=O)C)O 6712449
O=C1N2C34N5CCC(C3CCC4(C1)C=CC5)c6c2cc(cc6)OC 10019039
O=C1N2C34N5CCC(C3CCC4(C1)C=CC5)c6c2cc(c(c6)OC)OC 10882690
O1C2C34C5C(CN(C3C(C5OC)c6c(cc(c(c6)OC)O)C4C1)CC)(C(C2)O)CO 10895017
O1c2c3cccc2Cc4c5c(ccc4)Cc6c7c(ccc6)Cc8c(c(ccc8)C3)OCCOCCN(CCOCCO5)CCOCCOc9c(cccc9)OCCOCCN(CCOCC1)CCOCCO7 10898366
O1C2C34C5C(C(C2)OC(=O)C)(CN(C3C(C5OC)c6c(cc(c(c6)OC)O)C4C1)CC)COC 10994362
C123C45c6c(cccc6)C17c8c(cccc8)C2(c9c(cccc9)C3(c1c4cccc1)c1c7cccc1)c1c5cccc1 11027681
OC1N2C34N5CCC(C3CCC4(C1)C=CC5)c6c2cc(c(c6)OC)OC 11131973
O1c2c3c(ccc2)OB4Oc5c6c(ccc5)OB1Oc7c(c(ccc7)O4)N63 11187418
C123C45C6C7C8C19C1C(C4C4C%10C22C%11C(C5C5C%12C33C(C6C5)C8CC5C9C(C2C(C35)CC%12%11)CC1%10)C4)C7 11340053
O1C2C34C5C(CN(C3C(C5OC)c6c(cc(c(c6)O)O)C4C1)CC)(C(C2)O)COC 11350583
C123C4C5C1C6C5(C4C26)C=CC#CC=CC78C9C1C7C2C1(C9C82)C=CC#CC=C3 11360285
C123C4C5C1C6C5(C4C26)C=CC#CC#CC=CC78C9C1C7C2C1(C9C82)C=CC#CC#CC=C3 11384163
O1C2C34C5C(CN(C3C(C5OC)c6c(cc(c(c6)OC)O)C4C1)CC)(C(C2)O)COC 11407796
O(c1cc2c(cc1OC)C34C56C27c8c(cc(c(c8)OC)OC)C5(c9c3cc(c(c9)OC)OC)c1c(cc(c(c1)OC)OC)C6(c1c7cc(c(c1)OC)OC)c1c4cc(c(c1)OC)OC)C 11672382
O(C12NC(=O)C3C4C56C1C7C8C59C15C6(C3C3C1C(C9C43)C#N)C2C7C5C8=O)C(=O)C 14381365
O1C2C3C45C67C2C8C3C9C42C6(C8C9=O)C3C4C7C(C5C4C2C3C#N)C1=O 14381430
O1C2C3C45C67C2C8C3C9C42C63C8C9OC(=O)C4C2C2C5C(C7C2C34)C1=O 14381432
BrC12C34C56C7(C8C9C5C5C3C9C1C8C(=O)OC1C2C2C4C(C6C2C71)OC5=O)Br 14381435
O1c2c3c4c5c6c7c3c(cc8c7c(cc6Oc9cc(ccc9)OCCOCCOc3cc1ccc3)C(=O)N(C8=O)C(CCCCCC)C)Oc1cc(ccc1)OCCOCCOc1cc(ccc1)Oc5cc1c4c(c2)C(=O)N(C1=O)C(CCCCCC)C 16214951
BrC12C34C56C7(C8C9C5C5C3C9C1C8C(=O)OC1(C2C2C4C(C6C2C71)OC5=O)Br)Br 21597602
BrC12C34C56C7C8C9C3C7C(=O)OC3C4C4C1C(=O)C(C5(C8C(C29)C(=O)OCC)Br)C4C63 21597607
BrC12C34C56C7C8C9C5C5C3C9C1C8C(=O)OC1C2C2C4C(C6C2C71)OC5=O 21597610
O1C2C3C45C67C2C8C3C9C42C6(C8C9O)C3C4C7C(C5C4C2C3C(=O)O)C1 21597611
O1C23OC(OC4C25C6C7(C1=O)C(C3OC(OC5C(=C)C4CC6)(C)C)C(CCC7)(C)C)(C)C 21770498
ClC1=CC2OC3C1C4OC5C2C4C3C5 22294696
O=C1N2C34N5CCC(C3CCC4(C1)C=CC5)c6c2cc(cc6)OC 23327291
O1c2cc3c4cc2OCCOCCOCCOc5c(cc6c(c5)c(c7c(c6C)cc8c(c7)OCCOCCOCCOc9cc2c(cc9OCCOCCOCCO8)C4(c4c(cccc4)C32C)C)C)OCCOCCOCC1 23584643
O1C2(OCC34C5C2(CCC3OC(=O)C67C4C(OC51)CC(C6)C(=C)C7=O)C)C 23727886
O=C1N2C34N5CCC(C3CCC4(C1)C=CC5)c6c2cc(c(c6)OC)OC 23955822
O1C2C3C45C6C1OC(C6(CCC4OC(=O)C37CC(C2)C(=C)C7=O)C)OC5 24764125
O=C1N2C34N5CCC(C3CCC4(C1)C=CC5)C6=C2CC(=C(C6)OC)OC 24871221
O1C2C34C56C(C(C2(C=C5)OC)C(O)(CCCC)C)CCN(C6Cc7c3c1c(cc7)O)CC4 44303783
O1C2C34C56C(C(C2(C=C5)OC)C(O)(CCCC)C)CN(C6Cc7c3c1c(cc7)O)CC4 44303799
O1C2C34C56C(C(C2(C=C5)OC)C(O)(CCCC)C)CCN(C6Cc7c3c1c(cc7)O)CC4 44303821
S(=O)(=O)(OCC12C3C4C(C1)CC(C3)CC4C2)N 44382489
O=C1N2C34N5CCC(C3CCC4(C1)C=CC5)c6c2cc(c(c6)OC)OC 44592945
O1C23C4C5(C16OCC7C4(CC(C2(OC(=O)C3(CC5C8C6CC=C9C8(C(=O)C=CC9)C)O)C)OC7=O)C)O 49873810
O=C1N2C34N5CCC(C3CCC4(C1)C=CC5)c6c2cccc6 50897242
O=C1N2C34N5CCC(C3CCC4(C1)C=CC5)c6c2cc(cc6)OC 50919058
O1C23C4C5(C16OCC7C4(CC(C2(OC(=O)C3(CC5C8C6CC=C9C8(C(=O)C=CC9)C)O)C)OC7=O)C)O 51004304
O=C1N2C34N5CCC(C3CCC4(C1)C=CC5)c6c2cc(cc6)OC 51026319

Feel free to leave a comment.



Copyright © 2001-2013 Andrew Dalke Scientific AB