Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2013/07/27/varkony_reconsidered

Varkony Reconsidered

This is an expanded version of the talk "Varkony Reconsidered: Subgraph enumeration and the Multiple MCS problem", which I gave at the 6th Joint Sheffield Conference on Chemoinformatics on 23 July 2013. (In the programme I called it "Varkony Revisited", but I later decided that "Reconsidered" was a more appropriate term.)

The major goal of the presentation was to show how the 1979 paper of Varkony, which is part of the MCS literature, is better seen as an early paper in frequent subgraph mining. My own MCS algorithm, fmcs, is a variation of that approach, and is best seen as an intellectual descendent of an algorithm by Takahashi. Both the Varkony and Takahashi papers are relatively uncited in the MCS literature, so I spend some time explaining how they fit into the larger context.

The slide number references refer to the accompanying PDF slides. See the slides for full citation references and images.

Early MCS history

[slide 1]
Today I'm going to talk about the maximum common substructure problem.

[slide 2]
Here's an example of the MCS problem in its simplest form. You have two molecules and you want to find the largest connected substructure which is common to both structures. I won't be talking about disconnected substructures.

[slide 3]
Talking about the MCS problem here is a bit like bringing coal to Newcastle. It seems that half of the significant papers in the field came from this university. One of the earliest ones is this 1967 paper by Armitage et al..

[slide 4]
Armitage proposed using the MCS as a way to identify a reaction mapping, that is, to identify the conserved subgraph between the product and the reactant.

[slide 5]
If you read the paper you'll see that it also talks about using the MCS as a measure of chemical similarity, that is, a larger MCS should indicate that two molecules are more similar than smaller ones.

[slide 6]
This early effort wasn't successful. It only handled acyclic molecules, in part because of the very limited computer hardware they had. They report that their Ferranti Mercury computer had 480 words. I researched the computer, and it looks like it was built around 1960, and had 10-bit words. It was the first Ferranti with valves instead of tubes.

Multiple structure MCS and Varkony

[slide 7]
I'm actually not as much interested in the pairwise MCS as I am the multiple structure MCS, where there are two or more structures.

[slide 8]
The 1970s and 1980s saw a lot of work on the MCS problem, and the problem space was pretty well developed. You can see that in this 1979 paper by Varkony et al.

[slide 9]
In it they motivate the reason for working on the multiple structure MCS: "molecules displaying similar activities do so because they share some similar feature or features." Given a set of compound, the frequent presumption is that the MCS contributes to the activity.

[slide 10]
By the late 1970s we knew the main ways to compare the atoms and bonds in the molecular graphs. For example, you may want carbons to match only carbons, but have fluorines match any other halogen. In the general case, the chemist can define atom types so that atoms match only other atoms of the same type.

Bonds comparisons have similar flexibility. You might require bonds to match other bonds with the same type, or to any bond type. You can also look at ring membership, so that ring bonds only match ring bonds and not chain bonds.

The Varkony paper mentioned all of those types of topology comparisons, and they are still present in nearly all modern MCS tools, though of course most add additional refinements.

[slide 11]
This talk is titled "Varkony Reconsidered" because I think it sets the early groundwork for an approach that I and a few others have taken in working on the MCS problem. In order to set the stage, I think it's important to describe the Varkony algorithm.

It starts by assigning atom and bond labels to all of the structures and ordering the structures in a way that should improve the overall performance.

For each structure, get the list of subgraphs with just two atoms and one bond. These are called the nuclei, to "convey the sense of growth ... in subsequent steps." Canonicalize the nuclei, which is easy in this case because there are only two ways to represent a subgraph with two atoms.

Now that you have the SMILES string ... sorry, I'm a modern cheminformatics researcher so I think of these as canonical SMILES strings for the subgraphs even though SMILES wasn't published for almost a decade .. now that you have the canonical subgraphs, check to see which ones are present in every structure. If a bond isn't in all of the structure then mark it as "do not participate", so it won't be used for future nuclei growth.

For the nuclei which are present, go back to the original graph and grow the original subgraphs by one atom, in every possible way. This will give you all subgraphs of size 3 atoms, where each of those subgraphs contains a subgraph of size 2 atoms that is present in all of the structures.

Canonicalize those subgraphs of size 3 and only use those subgraphs which are common across all of the molecules. Keep repeating this process until no more common subgraphs can be found.

This finds all of the common subgraphs, so the MCS is (or are) the final common subgraphs in the process.

[slide 12]
It's pretty easy to see that method this will work, but that it might not be so fast. Still, using 1970s hardware they were able to fine the MCS of three marine sterols in between 45 seconds and a several minutes, depending on how it was parameterized.

As a side note, it's interesting to read these papers to see how the technology has changed over time, and to see how the implementation language affects the algorithm and timings. In this case they used two different programming languages. Canonicalization was done in Algol while the rest of the code was in the much slower INTERLISP. I interpret this as the authors saying that their algorithm is good, and asking us to not take the timings at face value because the implementation gets in the way.

The computer was a 1972-era PDP-10 hosted by the SUMEX facility at Stanford. The SUMEX (Stanford University Medical Experimental Computer) project promoted research in artificial intelligence to bio-medical problems, and made their hardware available to others over the network.

Some other MCS approaches

McGregor

[slide 13]
Just a few years later McGregor published a paper on how to apply backtracking to the pairwise MCS problem. Backtracking was not a new programming concept, but in my reading of the MCS literature the McGregor paper was the one which specifically promoted it as a technique for the MCS problem, and it influenced a lot of the subsequent work.

Basically, start by mapping one edge from the first structure to the second. Extend the edge in the first structure and look for a corresponding extension in the second. There can be multiple possibilities, so pick one, but keep track of the other possibilities. Keep repeating the process until it's not possible to grow any further.

This gives a maximal common substructure. "Maximal" here means that there's no larger common subgraph which includes maximal common substructure, but there might be another maximal substructure which is even larger. A maximum common substructure is the largest maximal substructure.

[Note: The "maximal" terminology dates to at least Cone et al. 1977.]

At this point, the backtracking algorithm kicks in. It backs up and tries an alternative possibility. It keeps going forward and backtracking until it's done.

Backtracking can take a lot of time because the algorithm will find all of the ways to map any subgraph from the first structure to the second.

[slide 14]
However, if the current best maximal common substructure is of, say, size 10 atoms, then you know that any larger MCS must have at least 11 atoms. If at any point in the backtracking you can prove that there's no way to match 11 atoms then there's no reason to continue searching that part of the search tree.

Pruning makes backtracking effective.

[Note: pruning was not first introduced in the McGregor paper. Cone et al. in 1977 mention pruning as part of DENDRAL, but most papers reference McGregor as the source for the general backtrack and pruning for the MCS search in cheminformatics.]

[slide 15]
I was able to find some information about how the McGregor algorithm was first implemented. It was written in ALGOL68 for the University of Sheffield's ICL computer, which was about 5 years old by the time McGregor developed the software. It had 24 bit words, and it looks like subgraphs were represented in the algorithm using a word as a bitset for atoms and bonds, which meant the implementation was limited to at most 24 atoms and 24 bonds.

It took about 9 seconds on average to find the pairwise MCS between "a simple file of 140 reactions."

In my literature research, I found a 1986 paper which still used the ICL 1906S computer. This shows that the hardware was in use at Sheffield for over 15 years. Humorously, you can see how it comments that Algol 1968 was "not a particularly efficient language" for the 1986 algorithm, reflecting similar language from the Varkony paper.

Clique detection

[slide 16]
There's another technique to find the MCS by reducing it to the maximum clique problem.

I'm not going to go into the details of that the clique approach. I find that it's easier for me to think about backtracking than the equivalent approach of finding cliques.

[Note: The usual citation for this is to Cone et al. in 1977, though Cone cites G. Levi's 1973 paper in Calcolo titled "A note on the derivation of maximal common subgraphs of two directed or undirected graphs." As that paper costs US$40 to download and is an Italian applied math journal that at least my local library doesn't have, I've only read the first two pages, which are available online for free. That's enough to see that it's primarily a math paper and not a cheminformatics paper, which is why people usually cite Cone. Interstingly, the word "clique" does not exist in that paper. Who first said "clique" in a cheminformatics paper?]

[slide 17]
Let's jump forward a couple of decades. In 2002, Raymond and Willett published a very informative review paper on the various MCS solutions. This is after Raymond had implemented the RASCAL MCS program, which uses the clique approach.

[slide 18]
In the paper they recommend using the clique approach, in large part because mathematicians and computer scientists have spent a lot of time figuring out how to make clique detection be fast.

Papers can be viewed as a very long, slow conversation about a topic. Cao et. al point out that certain types of non-transitive comparison are hard to do with a clique algorithm. "For example, allowing a bromine atom to be matched to a chlorine atom when they are attached to an aromatic ring, but not elsewhere, is much harder to do with association graphs." Cao also points out that there there have been improvements to the graph backtracking methods, such as the VF algorithm. In other words, just because Raymond and Willett might have been right in 2002, that doesn't mean that a clique solution is the best one for now.

[Note: libmcs from ChemAxon uses the clique approach plus heuristics to implement an approximate pairwise MCS. As the poster titled "Making the Most of Approximate Maximum Common Substructure Search", which was next to mine at the Sheffield conference, Péter Kovács discussed the effort it took to handle non-transitive comparisons.]

Multiple structure MCS through pairwise reduction

[slide 19]
The Raymond and Willett review paper focuses on pairwise MCS algorithm and not the multiple structure MCS. They point out that most methods can be extended to handle a collection of graphs.

This is true. Any pairwise method which finds maximal substructures can be made to support multiple structures through a reduction method: the maximal common substructures of the first pair are used as reference structures to compare against the third, and so on. This can be combined with pruning methods to make the search be practicable.

[Note: I developed this pairwise reduction technique for my own MCS algorithm in the late 1990s. Researching it now, it looks like Denis M. Bayada and A. Peter Johnson published this approach in "Detection of the Largest Patterns Common to a Set of Molecules, Using 2D Graphs and 3D Skeletons", which is a chapter in "Molecular Similarity and Reactivity: From Quantum Chemical to Phenomenological Approaches, 243-265." As a side-side note, it also looks like they didn't quite get the key differences between Varkony and Takahashi, but I haven't yet talked about that topic.]

[slide 20]
Pairwise reduction is an indirect solution. There's a straight-forward direct extension of the clique program for multiple structures but it takes memory which grows exponentially with the number of molecules. This problem was resolved in the MultiMCS paper of Hariharan et al.

My two MCS implementations

[slide 21]
My first attempt at the MCS problem was in the late 1990s. I worked for a company called Bioreason. We developed a method to combine the MCS with clustering to build a dendrogram "where each node contains a set of molecules and a chemical fragment common to the molecules." I wrote that MCS code using a pairwise backtracking algorithm combined with the the reduction technique to handle multiple structures.

[Note: Actually, this was the second MCS implementation at Bioreason. Brian Kelley write the first, as an approximate method based on a genetic algorithm. We used this method during testing, to make sure that the GA never found a larger common substructure than the exact method.]

The experience with that project lead to ClassPharmer, which in turn became the basis for MedChemStudio, available from Simulations Plus. Even though there's probably no remaining code of mine in the project, I'm still pleased that there's a direct lineage from code I wrote in the 1990s to that used by an tool which people are still using.

fmcs

[slide 22]
In mid-2011, Roche asked if I would implement an MCS method for RDKit. I no longer had access to my original MCS work, but that's okay because I had been working on a new idea based on subgraph enumeration. The result of this is 'fmcs', which is now distributed as part of RDKit.

It has a pretty standard set of MCS options. It defines a common substructure through a SMARTS pattern definition, which gives pretty good control over how atoms and bonds should be matched. Arbitrary atom class assignment is possible through a bit of a hack; you can overload the isotopes field as an arbitrary integer label to represent the atom class type. Sadly, there is no corresponding hack to define arbitrary bond classes, but that need hasn't come up in my discussion with scientists, nor mentioned in previous MCS papers.

[slide 23]
The most novel difference in fmcs that it can find the MCS that matches at least M of N molecules. If M=N then this is the normal multiple structure MCS. For example, the MCS between all of vanillin, zingerone, and capsaicin has 9 heavy atoms ...

[slide 24]
... while the MCS which is in at least two of those structure has 13 atoms.

[slide 25]
The fmcs algorithm is similar to that of Varkony. I pick a reference molecule from the set and generate its subgraphs. For each subgraph, I convert it into a SMARTS pattern, then do a subgraph isomorphism check to see if matches the other molecules.

The largest SMARTS pattern that matches is an MCS for all N structures. Of course, if M<N then I need to enumerate subgraphs from a few more reference structures, while on the other hand I don't have to check all of the other molecules to see if a given subgraph SMARTS matches. The major difference at this level is that Varkony uses canonicalization to determine if two subgraphs match, while I use a subgraph isomorphism search. I chose this because I could build on top of the existing VF2-based SMARTS matcher in RDKit to do the backtracking search.

Just like with the Varkony method, you can intuitively see that it will work, but it might be slow. The methods to make it fast are nearly identical to how to speed up any other MCS program.

[slide 26]
This is a new algorithm, and it's my algorithm, so I'll go into the algorithm in a bit more detail.

(1) First, I create atom and bond types. These are based on the atom and bond SMARTS. The atom type is just the atom SMARTS while the bond type is based on canonical SMARTS representation for the atom type of one end, the bond type, and the other atom type; of the two possibilities, use the alphabetically smallest one.

[slide 27]
Once I have the canonical bond types, I can remove (or mark as "do not participate") the bonds which don't occur at least M times, because there's no way they can be part of the MCS. Bond removal may cause fragments.

Order the molecules by the size of the largest fragment, such that the smallest largest fragment is first. This is a common technique, which should help improve the overall performance.

[slide 28]
(2) There are generally two ways to search for an MCS: either start by looking for one bond match then two bonds, three, and so on, or to start from the smallest largest fragment and start removing bonds.

I decided to be optimistic and try the smallest largest fragment first before starting by building up the subgraphs. I expect that the input to the MCS algorithm will be similar structures or structures which are already known to have a common core. In that case, the pruning from step 1 might have been enough so that the smallest largest fragment is the actual MCS.

So, I take the largest fragment, turn it into a SMARTS pattern, and do the subgraph match to see if there are at least M-1 matches. If so, the algorithm is done.

subgraph enumeration

If not, then it's time to start enumerating the substructures.

[slide 29]
The literature shows that subgraph enumeration is not obvious. The Varkony algorithm suffered from generating duplicate structures, I haven't come across a description of the enumeration method when looking through the cheminformatics literature, and it's not typically covered in a graph theory course.

[slide 30]
The basic concept is quite simple. Start by generating all subgraphs which use the first bond, and then generate all subgraphs which don't use the first bond. Repeat the process for the second edge, then the third, and so on. This gives you all 2**n ways to make a subgraph, including the empty graph, and does so uniquely.

This simple method generates both connected and disconnected subgraphs but fmcs only uses connected subgraphs so I use a slightly different algorithm.

[slide 31]
I create an initial set of seeds and put them into a priority queue. The terminology and idea of a 'seed' is very similar to the 'nucleus' in Varkony. A seed contains three things: the set of bonds which are in the seed, the set of bonds which may not be used for future growth, and the set of bonds connected to the seed which are available for future growth.

If an initial seed contains bond i then bonds 1 to i are specifically excluded from future growth. This prevents duplication.

[slide 32]
For each iteration, I pick a seed and grow it. If there are n possible outgoing edges then there are 2n-1 possible ways to grow the seed into a new seed. For example, if there are two outgoing edges then I could choose both edges, choose the first, or choose the second. These new seeds go back into the priority queue. The n outgoing edges from the parent graph are marked as unavailable for future growth by the child, and new outgoing edges are found.

[slide 33]
I sort the priority queue of seeds by size. As an unusual variation compared to most MCS algorithms, the fmcs algorithm can grow by multiple bonds per step, then backtrack if that fails. I wrote it this way because of the heritage of how I came up with the algorithm. The fmcs approach is optimistic, but I've not tested to see if it's better than a more conservative single bond expansion.

[slide 34]
I also implement some of the usual pruning methods in MCS programs. If I can show that fragment growth can never beat the current largest subgraph then I don't bother to enumerate that part of the graph. There's also some optimization that I haven't yet added, like keeping track of the number of maximum number of bond types common to all structures and using that limit as a more fine-grained upper bound.

SMARTS subgraph isomorphism instead of canonicalization

[slide 35]
Then comes the key difference from Varkony. I convert the subgraph into a SMARTS pattern and check that it exists in at least M-1 of the remaining structures.

If it doesn't, then discard this seed and don't try to grow it, because no larger seed can be part of the MCS.

However, if it does, then add its children to the priority queue. If the seed is larger than the current best common substructure size, then it's an MCS candidate. fmcs supports an option that was in MultiMCS, which is that the if a ring bond from a structure is in the MCS then the corresponding bond in the MCS must also be in a ring. This option is called "complete rings only." It's implemented with a special post-processing step.

If the candidate passes these requirements, then it's the new best MCS.

[slide 36]
There's an optimization I can do as a consequence of turning the subgraph query into a SMARTS. I can easily cache the subgraph isomorphism search results so if I see something like "[#6]-[#6]-[#6]" again, I can reuse the cached result rather than do the isomorphism tests again.

[slide 37]
If you think about this a bit you might realize that there are multiple ways to represent the same search using SMARTS. Perhaps I can save time by canonalizing the SMARTS strings, in order to reduce the number of duplicate subgraph isomorphism searches.

[Note: The general category of "canonical SMARTS" is NP-complete, as Roger Sayle pointed out in a 1997 Daylight EuroMUG meeting. However, the fmcs atom expressions are simple, so canonicalizing them is trivial.]

Greg Landrum added a feature to RDKit so I could test this out. I found out that canonicalization in general took too much time. Remember, canonicalization is only needed if there is backtracking where different subgraphs are isomorphic and represented differently, and importantly, only when the canonicalization time is faster than the isomophism search time. The combination of these factors was rare in my test benchmarks. For example, in one experiment I needed to find the MCS over over 1,000 structures before canonicalization was faster. I don't expect that to be a common case.

[slide 38]
Instead, I developed what I've been calling semi-canonical SMARTS. For each atom in the subgraph, I assign it an initial atom invariant based on the atom type and the lexicographically ordered list of its subgraph bond types. This is a circular fingerprint of size 2, and corresponds to a single iteration of the Weininger et al. CANGEN algorithm. These invariants are fast to generate, and unique enough to gave usually canonical subgraphs.

[slide 39]
Finally, in my code I check for a timeout. This is, after all, an NP-hard problem, and there are times would I would rather give it a 30 second timeout for an approximate solution than spend hours waiting for a result.

MCS benchmark timings

[slide 40]
Now for the scary question: How well does this new algorithm work?

To examine that, I had to create a benchmarks which reflect how I think the MCS program might be used. (Actually, I would have used existing benchmarks, but the only ones I could find were too small to be informative.)

I started with a cleaned up version of ChEMBL 13 and generated 7 different benchmarks. The first selects two different compounds at random and finds the MCS between them. The others are based on fingerprint similarity searches. I found the k=2, k=10, and k=100 nearest neighbors for fingerprints picked at random, and I found the k=100 nearest neighbors for thresholds of 0.95, 0.90, and 0.80.

The benchmarks and the code used to generate them are under the BitBucket account for fmcs.

[slide 41]
The benchmarks include cases which cause all of the MCS programs to time out after 30 seconds. When timeout occurs, fmcs will report the largest substructure found by that point, which is usually pretty good, and it means tha fmcs can also be used as an approximate MCS program with a fixed upper bound in CPU time. Not all MCS programs can report a partial size.

[slide 42]
I was able to time fmcs against three other methods: extractCommonScaffold from the Indigo C++ cheminformatics toolkit, the MultiMCS maximum clique Java program I mentioned earlier, and MoSS, a frequent subgraph mining Java program that I'll mention in a bit.

It's difficult to compare the results directly. Unfortunately, most cheminformatics toolkits like to reperceive structures on input so they follow consistent aromaticity rules. This is what the Daylight toolkit did, and many follow its lead. While everyone agrees that "C1=CC=CC=C1" should be perceived as "c1ccccc1", they all differ with more complex fused multi-ring systems. Since the MCS search methods depend on bond types, different perceptions of an input structure can sometimes lead to different MCS results.

I inspected many of the differences until I started to go cross-eyed. As far as I can tell, the different programs only differ due to a difference in aromaticity perception.

[slide 43]
It's important to note that this benchmark does not give a perfect head-to-head comparison. Indigo's extractCommonScaffold and MoSS (when configured to find substructures with 100% support) return all maximal subgraphs, which mean they are doing more work than fmcs, which only finds one maximum subgraph. Also, these 4 benchmarks were done on three different computers, so there can be a factor of 2 or more variation just because of the different hardware.

Still, it's enough to point out that fmcs isn't all that bad. I've circled the fastest times in red and the second fastest in yellow. It looks like MoSS is overall the fastest algorithm, but not always.

[slide 44]
fmcs definitely isn't the fastest method, but I'm okay with that. For one, most of fmcs is written in Python. A profile shows that less than 5% of the time is spent in C++. Presumably, if more of fmcs were in C++ then there might be a 5-10x speedup... which sounds suspiciously like what the authors of the Varkony paper said about their implementation!

Also, there are some additional pruning heuristics that I haven't yet added.

MCS threshold profile

[slide 45]
I mentioned that fmcs finds the MCS of M of N structures, but I didn't give any real justification for why that might be useful. I wanted to see if I could use the MCS to find misclassifications in a cluster and give some sort of automatic quality score. It's pretty common to use similarity or some other means to group structures together, then try to find the MCS of the group. Unfortunately, the MCS is a rather fragile definition. A single misclassified structure can cause a very big difference in the result. I wondered if an MCS definition which might leave one or two structures out would provide useful information.

[slide 46]
I used ChEBI as my test case. ChEBI is a database of small molecules placed into an ontology by human curators. Some of the ontology terms are structure related. While terms like "contains two rings", cannot be defined as a SMARTS pattern, others can in principle be defined by SMARTS.

There are occasional misclassifications in the ChEBI ontology. For example, one of the nine 1,4-benzodiazepinones shown here is incorrect. If you look at the depictions for a bit you might be able to figure out which one is wrong.

[slide 47]
Try to find the MCS of all 9 and you'll see that that the expected benzodiazepinone core structure isn't found.

[slide 48]
But bring it down a step and find the MCS which is in at least 8 structures and you'll see the expected benzodiazepinone core in 8 of the structures, and quickly see the one misclassified structure. In this case it's a 1,5-benzodiazepinone.

[slide 49]
What I then did was compute the MCS size (as the number of atoms) for values of M from 4 to 9. (I actually used a binary search here so I didn't have to test all possible values of M. The MCS for M=6 is the same size as the MCS for M=8, which meant there was no need to compute the size for M=7.)

The plot of M vs. MCS size gives me a threshold profile. An abrubt dip in the profile indicates that there's something I might need to look at.

I computed the profile with three different topology parameters. One profile requires that element types and bond types match, another only requires that element types match, and the last only looks at the untyped topology. This gives me different sorts of information which can help identify the reason for a mismatch.

[Note: the pure topology profile never proved useful.]

[slide 50]
To see if MCS threshold profiles can be useful, I developed a prototype tool to explore the ChEBI ontology. The ontology terms are on the left. Click on a term and it shows the corresponding profiles. The profile points are selectable. Clicking on a point shows the SMARTS pattern for that point, displayed on the right using Karen Schomburg's excellent SMARTSview. The structure members in the term are displayed on the bottom, sorted so the structures which don't contain the SMARTS are shown first, and strutures which match have the match highlighted.

[slide 51]
The benzodiazepinone mismatch showed up directly in the profile based on atom and bond types. Sometimes it's the difference between two different MCS profile types which gives a useful signal. There appears to be a registration problem with the pyrazoles because five of them are not perceived as being aromatic by RDKit. In this case the profile based on bond types has a dip while the one which ignores bond types stays constant. This is often a signal that there is an aromaticity problem.

[slide 52]
The monocarboxylic term is interesting because it shows some of the different types of issues that come up with this approach. The SMARTS [#6]-[#6](=[#8])-[#8] matches 1424 of 1435 structures. However, this SMARTS requires a carbon in the R1 position, and there are two false negatives with structures which have a nitrogen in that position. (An obvious next step, which I haven't done, would be to use the SMARTS pattern to find additional structures which should have been classified as monocarboxylic.)

There are four true negatives, where the structure simply isn't a monocarboxylic acid. Finally, five of the structures show an implementation bug. The original ChEBI structures are in an SD file, which has a mechanism to denote an R-group. My prototype tools uses SMILES strings, which doesn't have an R-group notation. The conversion from SD records to SMILES converted the R-group into a "*", which is treated as its own atom type.

[slide 53]
I examined many ChEBI terms. Nearly all of them either had no problem or were grouped by a structural term that could not be described by a SMARTS. With practice, I found that the MCS profile could help me identify possible problem areas, but I ended up being unable to come up with an automatic quality score.

Frequent subgraph mining

[slide 54]
Here's where the history gets interesting.

I presented the fmcs algorithm last fall at the German Conference on Cheminformatics (GCC). Two people asked me if the fmcs solution wasn't just the same as frequent subgraph mining.

[slide 55]
I had no idea - I hadn't heard of frequent subgraph mining. So I read some of the papers from that field. They nearly all come from data mining, and use a somewhat different and perhaps more difficult language than I'm used to, so it took me a while to figure them out.

[slide 56]
The best known frequent subgraph mining programs seem to be gSpan and MoSS. Reading them I get the strong sense that these are written by people who don't really know the underlying chemical motivation. For example, I interpret the original MoSS paper to say that they don't really understand why chemists are interested in closed molecular fragments, but they have a conjecture. Still, since they know that's what chemists want, there are some optimizations they can do do handle that.

[slide 57]
These papers introduce new relevant terms for me, which I should apply to the fmcs description. Where I say "in at least M structures", they say "has a support of M." A closed molecular fragment is a fragment with support M where there is no larger fragment which contains the fragment and which also has support M. In my terms, it's a maximal common substructure to at least M of the structures.

[slide 58]
Frequent subgraph mining is often used to find discriminative fragments, that is, fragments which are in the actives but are rarely shared by the inactives. This is actually a more sophisticated approach than most of the MCS paper, which tend to focus on actives and ignore information about inactives. I was a bit chagrined when I read about this because while obvious in retrospect, I had never really thought about that issue in the MCS context.

gSpan and MoSS

[slide 59]
The gSpan and MoSS programs look at all of the possible embeddings across all of the structures. This can be generously be seen as an extension of the McGregor method to N structures. They start with the empty set then work on all unique embeddings using one atom, then one bond, two bonds, and so on. At each point they keep track of the support for that embedding. If it's too low then they prune.

The fundamental difference between gSpan and MoSS is how the graph embedding tree is traversed; gSpan uses a breadth-first search (Apriori method) while MoSS uses a breadth-first search (Eclat method).

[slide 60]
The problem with either approach is that the number of embedded graphs can be huge. In a toy case there might be only a couple of ways to match a simple pattern like "O-S-C" ...

[slide 61]
... but in more complex cases, like with rings, it can lead to "prohibitively high search complexity."

The original MoSS paper partially handled the complexity through a local solution, which prevented some of the explosion where a benzene ring can cause 12 different embeddings.

[slide 62]
Subsequent worked solved it through clever use of subgraph canonicalization. MoSS keeps track of the canonical representation for each substructure. When there are two different embeddings for the same substructure it uses the one which is lexicographically smallest. It does this by introducing a postfix version of SMILES, which is a perfect match to the depth-first machinery. The method is quite aware that the addition of a new bond can sometimes reorder the canonical representation, so it carefully monitors for all of those possible cases.

I've been doing cheminformatics for quite a while, and have never come across this approach before. I think it's brilliant and insighful thinking.

[slide 63]
It's my hypothesis that there's another way to describe the MoSS algorithm. Instead of thinking of the canonicalization as a way to remove duplicates, an alternative is to generate all canonical subgraphs in lexicographic order, and check for valid support across all of the structures.

If so, then that's part of the same intellectual family as the Varkony paper, where MoSS overcame many of the problems that Varkony had.

[slide 64]
What's amazing nabout this is that MoSS is designed as a frequent subgraph mining program, and not an MCS program. It finds the MCS as an edge case by setting the support threshold to 100%, but it can do much more. Yet in the timings, its performance was at least comparable to dedicated MCS programs!

Equally amazing is that it doesn't seem that the frequent subgraph mining people realize that they are doing a superset of the MCS algorithm. That connection is not mentioned in any of the frequent subgraph mining papers that I read, and you'll see later that gSpan avoids using subgraph isomorphism because it's NP-hard, seemingly not realizing that gSpan is itself solving an NP-hard problem.

"ad hoc" MCS methods

[slide 65]
Let's go back to the Raymond and Willet review paper. They classified Varkony is as an "ad hoc" method. I used to think that was a bit odd, since fmcs can be seen as a variant of the Varkony approach, and fmcs's performance is pretty good.

[slide 66]
It wasn't until I learned more about frequent subgraph mining that I realized that Raymond and Willett are right. I think that Varkony et al. is miscategorized as an MCS paper, when it's actually a frequent subgraph mining program, which finds all subgraphs with 100% support. As such, it does a lot more work than it needs to in order to find the MCS, including a lack of pruning methods.

With that in mind, I reread the Varkony paper for the fourth or so time and saw the single line that validated my belief in my interpretation: "There are three options to display results of the search for structural similarities: (a) All. The program reports all common substructures of all sizes."

That's why I say that the Varkony paper should be seen as one of the earliest frequent subgraph mining papers.

The change to find all substructures with a given support is trivial, but as "frequent subgraph mining" didn't start until about 20 years later they likely didn't realize that it could be an interesting new field.

[slide 67]
Going again back to the review paper, Raymond and Willett also mention in passing that there was a modified version of the Varkony algorithm subsequently proposed by Takahashi in 1987. I read through that paper, and realized that it replaced canonical subgraph generation with subgraph ismorphism. This is the intellectual predecessor to fmcs!

[slide 68]
Unfortunately, it's hard to figure that out from the paper. The English is difficult to follow and the subgraph isomorphism algorithm they use comes from Sussenguth 1965, which I hadn't heard of before. I'm surprised that a 1987 paper wouldn't have used the 1976 Ullmann method.

Based on the text, the Sussenguth algorithm is apparently very slow. Takahashi says that it's "the most time-consuming phase of the compuing flow", and that canonicalization is "an indispensable prerequisite" for performance. However, in my own tests, subgraph isomorphism testing is fast and canonicalization is not required. I conclude this is due to RDKit's use of VF2 instead of the Sussenguth algorithm.

Thus, I think most people who read Takahashi believe that it, like Varkony, requires canonicalization, when it actually doesn't. The only reason I figured this out was that I implemented fmcs and showed that to be the case.

Putting those details aside, it was quite exciting to read that someone else had previous worked on this approach before, and that I had managed to extend it in new and interesting ways.

[Note: I've researched a bit more of the data mining literature, and found that a few of them, like the book "Mining Graph Data" edited by Diane J. Cook, Lawrence B. Holder with the chapter "Discovery of Frequent Substructures" by Xifeng Yan and Jiawei Han (the gSpan authors) reference Takahashi as a graph related method in chemiformatics, but didn't notice that Takahashi is actually a variation on the Varkony paper from 10 years previous.]

Why so little previous work like fmcs?

[slide 69]
The obvious followup question is, why haven't any other people tried the subgraph isomorphism approach before? I've shown that fmcs has decent performance, and it's not that hard to understand.

Searching the literature gives some possible clues. Going back to the Armitage paper from 1967, they say "substruture search is useless here, unless one searches for every substructure of one compound in the struture of the second." That, combined with pruning, is the fmcs algorithm. But in reading the paper, it looks to be a throw-away line meant to imply that it's a completely unfeasible approach. Which, to be fair, it was for the hardware of that era.

But more than that, we needed to develop a better understanding of backtracking and pruning, and improved subgraph searches with first Ullmann and then VF2 before the fmcs approach could be practicable. By the late 1990s though, these were pretty well understood.

I suspect it was a combination of other factors. The clique approach is quite elegant, and Raymond and Willett showed that it was faster than the usual McGregor backtracking approach. There aren't that many people working on the MCS problem in chemisty, and even fewer working on the multiple structure MCS problem, so progress is slow.

Then there's the shaky feeling that the Takahashi/fmcs approach is too complicated to work well. fmcs goes from subgraph to SMARTS and back to subgraph, and potentially wastes a lot of time re-finding the same substructure only to add a single bond to it. And it's true - I can conceive of faster implementations. But on the other hand, a lot of work is put into making SMARTS matchers go fast, and that's work I don't need to redo.

The original gSpan paper gives another clue. The computer scientists avoided using subgraph isomorphism tests because it's "an NP-complete problem, thus pruning false positives is costly." It's always a bit scary to invoke NP-complete problems.

While they are correct, they didn't notice two things. First, in practice isomorphism matching of molecules is very fast, and second, as I showed, when an algorithm like gSpan or MoSS is used to find all substrutures with 100% support, it ends up identifying the maximum common substructure ... which is itself NP-complete.

[Note: This comment in the gSpan is one reason why I don't think the frequent subgraph mining researchers quite realize its connection to the MCS problem. Another is that I asked Thorsten Meinl about it, and he doesn't know of anyone who has specifically pointed that out.]

Future work

[slide 70]
So, where do we go for the future?

There's any number of features to add to fmcs. For example, it's pretty straight-forward to add support for a "*" as an alternate symbol for, say, a ring atom. This would help identify cases where there's a single atom substitution. It's also possible to require a certain initial subgraph to be part of the result, add additional pruning heurstics, and recode the algorithm in C++. This is an engineering task which "just" requires time and funding.

That said, I came into this problem because of thinking about canonical subgraph enumeration, only to find that isomorphism searching was more useful for this topic. However, I think canonical subgraph enumeration is a fruitful research area for the future.

The main reason why is that canonical subgraphs can be precomputed, and processed as fast string comparisons.

[slide 71]
For example, when clustering N compounds, it's pretty easy to compute, say, the canoncial subgraphs with up to 7 atoms. There's only on average few thousand per structure. To compare structure i with j, first look at the intersection of their subgraphs. If the MCS is 6 atoms or less, then you have the answer. Otherwise, you know the MCS has to be at least of size 7.

This might speed up clustering by quite a bit.

[slide 72]
NextMove Software has done the most advanced work in this field, to my knowledge. They precompute the canonical subgraphs for all of the structures in the data set, with links between each parent and child. If a child has multiple parents then that child is a common subgraph to all of those parents.

[slide 73]
Finding the MCS between any two structures then reduces to a graph traversal of a (very large) lattice to find the highest common node.

[slide 74]
I've also been thinking about using canonical subgraphs as perfect substructure keys for substructure search. Again, generate all canonical subgraphs for the structures in the database, up to a given size, and use these as keys for the inverted index. When a structure comes in, enumerate its subgraphs, look up the corresponding inverted indices, and find the intersection. If a query substructure doesn't have an inverted index, then no superstructure for the query exists in the database.

What's pretty neat about this is an off-the-shelf database like Lucene should be able to handle this sort of data without a problem.

[slide 75]
On the other hand, most people don't want to change their database environment to include a new database engine, so the third idea I've had is to improve substructure screen key selection.

There are two main types of substructure screens: hash-based like the Daylight fingerprints and feature based like the MDL MACCS keys. The former tend to be more common because they handle arbitrary chemistry and are easy to tuned to improve selectivity. The latter are usually made by hand, and the only other well-known example I can think of is the CACTVS substructure keys used by PubChem.

[slide 76]
Substructure screens are a sort of decision tree, and it's best if each bit can provide a discrimination of 50%. If that were the case then a 128 bit fingerprint would be able to provide very good screening ability. However, hash bits are correlated, which is why need more bits, while feature key bits tend to have low density - only about 150 of the unique fragments in PubChem have a density above 10%, and many of these are highly correlated.

Unique canonical subgraphs can be used to characterize the universe of possible decisions. I've wondered if it were possible to analyze a set of representative queries and targets to improve screening selection. Importantly, a bit should be able to have multiple fragments associated with it, in order to improve its discrimination ability.

[Note: This is rather like optimizing the 20 questions problem, only all 20 questions must be selected beforehand. Quoting from Wikipedia: "Careful selection of questions can greatly improve the odds of the questioner winning the game. For example, a question such as "Does it involve technology for communications, entertainment or work?" can allow the questioner to cover a broad range of areas using a single question that can be answered with a simple "yes" or "no".]

Unfortunately, I have not been able to come up with a good algorithm for this idea.

[slide 77]
This research involved a lot of software, and interaction with many people. I thank all of them for their help and support. Also, this presentation required digging into the early MCS literature. I thank the Chalmers University of Technology library for having most of the literature still on the shelves in paper, which meant I could do this research without hitting expensive paywalls all the time.

Comments?

Let me know if you have any comments or questions.

Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2013 Andrew Dalke Scientific AB