[ previous | newer ]     /home/writings/diary/archive/2011/01/13/faster_subgraph_enumeration

## Faster Subgraph enumeration

This is part 2 of a series on subgraph enumeration.

1. Simple subgraph enumeration algorithm
2. Faster subgraph enumeration
Rather than reading the details of how the new, improved algorithm works you could just download dfs_subgraph_enumeration.py. Short version: it's about 3.5 times faster than the simple_subgraph_enumeration.py implementation from part 1. If you want to run the test suite then download those both implementations as well as test_subgraph_enumeration.py.

## Subgraph generation does not require checks for duplicates

The subgraph enumeration algorithm I described the other day in part 1 works and it's easy to understand, but it isn't fast. It generates duplicate subgraphs and at every pass it considers all bonds, including those which are already in the subgraph.

Here's another simple subgraph enumeration algorithm which doesn't have these problems. A molecule has N atoms and M bonds. Therefore, there are 2M-1 ways to select at least one bond from the list of bonds. Each of these is a valid and unique subgraph, although it might not be connected. Select only those which are connected and you'll have the set of all connected subgraphs containing at least 1 bond. To that add the N subgraphs containing only one atom.

The result is an algorithm which generates all connected subgraphs, and which does not need duplication testing. It does, however, need connectedness testing, which the earlier algorithm did not.

## Subgraph enumeration without checks for duplicates or connectivity

What if we're careful? Another way to think of the algorithm is to think of all the subgraphs which have bond b1 plus those which don't have b1. This is a divide-and-conquer strategy. I know that any subgraph which has b1 must be connected to that bond, so I only need to look at the other bonds which come from the terminal atoms. And I know that subgraphs which don't have b1 must either have b2 or not have b2.

I think of this in similar terms to my first algorithm with seeds and ways to grow a seed. There are a set of seeds, which are:

• subgraphs which include b1,
• subgraphs which include b2 but not b1,
• subgraphs which include b3 but not b1, or b2,
• subgraphs which include b4 but not b1...3,
...
• subgraphs which include bM but not b1...(M-1).
and the way to grow a seed is to select bonds which are connected to a seed but which haven't already been considered for inclusion in the seed. I call one of these bonds an "extension", because it extends a subgraph.

The tricky part is to figure out how to select those bonds. After several attempts (and do bear in mind that I went through a lot of attempts and iterations to figure out this algorithm) I decided on the following:

1. Start with a single edge, and the set of edges which have already been considered. (If this is the graph made from bi then the set of edges which never again need consideration is b1..i.)

2. Look at the set of all extensions which are 1 away from the initial bond. (This is the same as the bonds connected to the atoms which are at the ends of b1.) There are 2n-1 combinations containing at least one extension. Make the corresponding 2n-1 new subgraphs. (If you only want subgraphs up to size k then do the check here.) This stage uses all the possibilities of incorporating the 1-away edges into the graph, which means they do not need to be considered in subsequent stages.

3. For each of these 1-away subgraphs, find all of the extensions which are 2 away from the initial bonds. (This is the same as the bonds connected to the atoms which were newly added during the previous step, and remember that there's no need to check the bonds b1...i nor the bonds checked in step 2. All possibilities regarding them have already been regarded and there's no need to use them again.) Generate the 2n-1 combinations of extensions and use them to make all of the 2-away graphs.

4, Use the 2-away graphs to make the 3-away graphs,

... and so on.

Keep iterating until no more expansions are possible, either because there are no more bonds to consider or because an expansion would be too large. In my code I ignore subgraphs with more than k atoms by skipping expansions which would exceed that limit.

### Implementation: getting things set up

The core implementation is simple. It starts with support for subgraphs of size 0.

```def generate_subgraphs(mol, k=5):
if k < 0:
raise ValueError("k must be non-negative")

# If you want nothing, you'll get nothing.
if k == 0:
return
```
For size 1 it returns singleton subgraphs, where a Subgraph class is identical to the "Seed" from last time; it contains "atoms" and "bonds" as frozensets.
```    # Generate all the subgraphs of size 1
for atom in mol.GetAtoms():
yield Subgraph(frozenset([atom]), frozenset())

# If that's all you want then that's all you'll get
if k == 1:
return
```
For size 2 I just iterate over all of the bonds, and get the atoms at the end of each bond. I'll use this to make the seeds for future extensions.
```    # Generate the intial seeds. Seed_i starts with bond_i and knows
# that bond_0 .. bond_i will not need to be considered during any
# growth of of the seed.
# For each seed I also keep track of the possible ways to extend the seed.
seeds = []
considered = set()
for bond in mol.GetBonds():
subgraph = Subgraph(frozenset([bond.GetBgn(), bond.GetEnd()]),
frozenset([bond]))
yield subgraph
extensions = find_extensions(considered, subgraph.atoms, subgraph.atoms)
if extensions:
seeds.append( (considered.copy(), subgraph, extensions) )
```
As you can see, a seed has:
• The set of bonds which have already been considered
• The subgraph itself
• The possible ways to extend the subgraph into a new subgraph

Some of the atoms in a subgraph were added during the previous iteration. The subgraph can only grow from bonds which are connected to those new atoms and which weren't previously considered. The "find_extensions" function (below) returns a list of all possible extensions, where an extension is represented as the 2-ple (bond, to_atom) and to_atom is None if and only if both atoms of the bond are new_atoms. This can happen in C1CC1 in the expansion of CCCC to C1CCC1 since the final extension is the ring bond which closes two atoms which are in the previous subgraph.

I use the term "internal extension" when the new bond connects two atoms which are already in the subgraph. I have to be careful because internal extensions will appear twice; once for each atom. I use a set so I don't get duplicates, and at the end add those back into the list of extensions.

```def find_extensions(considered, new_atoms, all_atoms):
extensions = []
internal_extensions = set()
for atom in new_atoms:
for outgoing_bond in atom.GetBonds():
if outgoing_bond in considered:
continue
other_atom = outgoing_bond.GetNbr(atom)
if other_atom in all_atoms:
# This this is an unconsidered bond going to
# another atom in the same graph. This will
# come up twice, so prevent duplicates.
else:
extensions.append( (outgoing_bond, other_atom) )

# Add the (unique) internal extensions to the list of extensions
extensions.extend((ext, None) for ext in internal_extensions)
return extensions
```

### Implementation: growing subgraph "seeds"

For larger subgraphs I do depth-first search by getting the last element of the "seeds" stack. (If I switch to a collections.deque and pop from the front then this becomes a breadth-first search.)

```    while seeds:
considered, subgraph, extensions = seeds.pop()

# I'm going to handle all 2**n-1 ways to expand using these
# sets of bonds, so there's no need to consider them during
# any of the future expansions.
new_considered = considered.copy()
new_considered.update(ext[0] for ext in extensions)

for new_atoms, new_subgraph in all_subgraph_extensions(subgraph, extensions, k):
assert len(new_subgraph.atoms) <= k
yield new_subgraph

# all of the ways to expand from the old atoms, then
# there's no other way to expand and I'm done.
if not new_atoms:
continue

# Start from the new atoms to find bonds which can be used
# for future extensions.
new_extensions = find_extensions(new_considered, new_atoms, new_subgraph.atoms)

if new_extensions:
seeds.append( (new_considered, new_subgraph, new_extensions) )
```
That really is all there is to the main algorithm. Although the function "all_subgraph_extensions" does take some explaining.an

### Implementation: making all possible extensions from a subgraph

The all_subgraph_extensions function generates the new subgraphs, which are extensions of the input subgraph. It goes through all 2n-1 combinations, excepting those which add too many atoms to the subgraph, and merges each combination with the input.

```def all_subgraph_extensions(subgraph, extensions, k):
# Generate up to 2**(len(extensions)-1) new subgraphs which are
# the possible extensions of the old subgraph. None of the new
# subgraphs will have more than k atoms.
assert len(subgraph.atoms) <= k
assert extensions
new_atoms_limit = k - len(subgraph.atoms)
# For each possible extension which is small enough
for new_atoms, combination in all_combinations(extensions, new_atoms_limit):
# Make the new subgraph
atoms = frozenset(chain(subgraph.atoms, new_atoms))
assert len(atoms) == len(subgraph.atoms) + len(new_atoms), "duplicate atom?"
bonds = frozenset(chain(subgraph.bonds, (ext[0] for ext in combination)))

# Also yield the new atoms so they can be used in the seed
yield new_atoms, Subgraph(atoms, bonds)
```
I need to generate all the combinations. For that I use a recursive function.
```def _all_combinations(extensions, last, i):
if i == last:
yield []
yield [extensions[i]]
else:
for subcombination in _all_combinations(extensions, last, i+1):
yield subcombination
yield [extensions[i]] + subcombination
```
Here you can see it in action.
```>>> list(_all_combinations("ABC", 2, 0))
[[], ['A'], ['B'], ['A', 'B'], ['C'], ['A', 'C'], ['B', 'C'], ['A', 'B', 'C']]
>>> len(list(_all_combinations("ABC", 2, 0)))
8
>>>
```
The first item will always be the empty list, which isn't a valid extension, so I'll always throw it away using iterator.next(). I'll also throw away any extensions which add too many atoms.
```def all_combinations(extensions, limit):
# Generate all (2**len(extensions))-1 ways to combine the
# extensions such that there is at least one extension in the
# combination and no combination has more than 'limit' atoms.
# Yield combinations as (set of new atoms, list of extensions)
n = len(extensions)
assert n >= 1
it = _all_combinations(extensions, n-1, 0)
it.next()  # the first contains no extensions; ignore it
for combination in it:
atoms = set(ext[1] for ext in combination if ext[1] is not None)
if len(atoms) > limit:
continue
yield atoms, combination
```
I include the list of new atoms in the yield statement since eventually they will be included as part of the new seed.

### Implementation: the code (this is not the final version!)

The canonical SMARTS generator and the self-tests are essentially unchanged from last time so I won't describe them. You can download the entire file as slower_dfs_subgraph_enumeration.py. (Why is this "slower"? Because in a bit I'll show a somewhat faster version of the same algorithm.)

## Cross-comparison testing

While this algorithm looks simple, it took me several days to develop. I was glad to have the simple algorithm which I could use to cross test because I kept finding cases I got wrong. I wrote a test program called "cross_test.py" which generates SMARTS counts for both algorithms and if they differ it gives a nice description of where they differ.

```# Compare my new DFS-based subgraph enumeration with my
# older and slower but easier to get right version.
from collections import defaultdict
from openeye.oechem import *

import simple_subgraph_enumeration
import slower_dfs_subgraph_enumeration

def smilin(smiles):
mol = OEGraphMol()
assert OEParseSmiles(mol, smiles)
return mol

test_cases = (
"C",
"CC",
"CN",
"C=N",
"C#N",
"C=CC#N",
"C1CC1",
"C1CN1",
"C1SN1",
"C1CCC1",
"C1SCN1",
"C1SNPCC1",
"c1ccccc1c2ccccc2",
"c1cc2c3cc1.c12c3cc2c3c1.c1cc2c3cc1",
"c1cc3ccc1c1cc3ccc1",

"C.C.C",
"CC.C.CCN",
"NOO.OON",
"N.O=C.[OH-].[NH4+]",
"c1ccccc1O.c1cccnc1.C1CCSCC1",

"O(CCNC(=O)c1c(onc1C)C)c1ccc(cc1)C",
"S=C(NC(C)(C)C)NNC(=O)c1cc([N+]([O-])=O)ccc1",
"S=C(NC(C)(C)C)NNC(=O)c1ccc([N+]([O-])=O)cc1",
"Brc1cc(OCC(=O)NNC(=S)NC(C)(C)C)ccc1",
"S=C(NC(C)(C)C)NNC(=O)c1ccc(C(C)C)cc1",
"Clc1c(CC(=O)NNC(=S)NC(C)(C)C)ccc(Cl)c1",
"Clc1c(CC(=O)NNC(=S)NC(C)(C)C)c(F)ccc1",
"Clc1cc(CC(=O)NNC(=S)NC(C)(C)C)ccc1Cl",
"Clc1ccc(OCC(=O)NNC(=S)NC(C)(C)C)cc1",
r"S=C(NC(C)(C)C)NNC(=O)/C=C\c1cc(F)ccc1",
r"Clc1c(/C=C\C(=O)NNC(=S)NC(C)(C)C)cccc1",
r"Clc1cc(/C=C\C(=O)NNC(=S)NC(C)(C)C)ccc1",
r"S=C(NC(C)(C)C)NNC(=O)/C=C\c1ccc(F)cc1",
"S=C(NC(C)(C)C)NNC(=O)COc1c(cccc1)C",
"S(CCC(=O)NNC(=S)NC(C)(C)C)c1ccccc1",
"Clc1c(OCC(=O)NNC(=S)NC(C)(C)C)cccc1",
"S=C(NC(C)(C)C)NNC(=O)COc1cc(ccc1)C",
"S=C(NC(C)(C)C)NNC(=O)COc1cc(F)ccc1",
"S=C(NC(C)(C)C)NNC(=O)COc1c(F)cccc1",
"Brc1ccc(OCC(=O)NNC(=S)NC(C)(C)C)cc1",
"S=C(NC(C)(C)C)NNC(=O)c1cc2CCCc2cc1",
"S=C(NC(C)(C)C)NNC(=O)c1ccc(cc1)COC",
"S(c1c(cccc1)C)CC(=O)NNC(=S)NC(C)(C)C",
r"Brc1cc(/C=C\C(=O)NNC(=S)NC(C)(C)C)ccc1",
)

def get_counts(it):
d = defaultdict(int)
for x in it:
d[x] += 1
return dict(d)

for smiles in test_cases:
mol = smilin(smiles)
for k in range(0, 10):
slow_smarts = get_counts(
simple_subgraph_enumeration.generate_canonical_smarts(mol, k))
fast_smarts = get_counts(
slower_dfs_subgraph_enumeration.generate_canonical_smarts(mol, k))
if slow_smarts != fast_smarts:
print "Error with", smiles, k
keys = list(slow_smarts) + list(fast_smarts)
keys.sort()
fmt = "%16s %5s %5s %5s"
print fmt % ("k", "slow", "fast", "==")
for k in keys:
s = slow_smarts.get(k, "-")
f = fast_smarts.get(k, "-")
print fmt % (k, s, f, s==f)
raise AssertionError
```

## Faster through better handling of "internal" and "external" extensions?

There are two types of extensions. One connects the subgraph to itself and the other expands it to include a new atom. I call these "internal" and "external" extensions. In my code I merged them into a single "extension" 2-element tuple containing bond and optional to_atom. (to_atom is None for internal extensions.)

This worked pretty well, but it precludes certain optimizations. For example, I don't have to worry about about internal extensions making the subgraph too large because internal subgraphs will never increase the atom count. For another example, if the subgraph can only grow by one atom and I have one external extension and two internal extensions, then there's no need to include the counting overhead to ensure that the three extensions will grow too large.

If I track the internal and external extensions separately then I can be a bit more clever about generating the new subgraphs. I'll change "find_extensions" to return both objects:

```def find_extensions(considered, new_atoms):
internal_extensions = set()
external_extensions = []
for atom in new_atoms:
for outgoing_bond in atom.GetBonds():
if outgoing_bond in considered:
continue
other_atom = outgoing_bond.GetNbr(atom)
if other_atom in all_atoms:
else:
external_extensions.append( (outgoing_bond, other_atom) )

return list(internal_extensions), external_extensions
```
which means I'll need to change the two places which call it to look more like:
```        internal_extensions, external_extensions = find_extensions(
considered, subgraph.atoms, subgraph.atoms)
if internal_extensions or external_extensions:
seeds.append( (considered.copy(), subgraph,
internal_extensions, external_extensions) )
```
Each seed now contains four objects intead of three which means some minor changes in computing the new considered set:
```        new_considered = considered.copy()
new_considered.update(internal_extensions)
new_considered.update(ext[0] for ext in external_extensions)
```
but the real big changes are in all_subgraph_extensions. I need to handle three different cases: if only internal extensions are present, if only external extensions are present, or if both types are present.

### Implementation: only internal extensions

Handling only internal extensions is the easiest: enumerate all combinations, none of which will have any atoms.

```    if not external_extensions:
# Only internal extensions (test case: "C1C2CCC2C1")
it = all_combinations(internal_extensions)
it.next()
for internal_ext in it:
# Make the new subgraphs
bonds = frozenset(chain(subgraph.bonds, internal_ext))
yield set(), Subgraph(subgraph.atoms, bonds)
return
```

### Implementation: only external extensions

If there are only external extensions then it's also pretty easy:

```    if not internal_extensions:
# Only external extensions
# If we're at the limit then it's not possible to extend
if limit == 0:
return
# We can extend by at least one atom.
it = limited_external_combinations(external_extensions, limit)
it.next()
for new_atoms, external_ext in it:
# Make the new subgraphs
atoms = frozenset(chain(subgraph.atoms, new_atoms))
bonds = frozenset(chain(subgraph.bonds, (ext[0] for ext in external_ext)))
yield new_atoms, Subgraph(atoms, bonds)
return
```

### Implementation: both internal and external extensions

Finally, if there's at least one of each extension type then I need to generate the cross-product of all internal and all external extensions. That's easy with itertools.product:

```>>> import itertools
>>> list(itertools.product("123", "AB"))
[('1', 'A'), ('1', 'B'), ('2', 'A'), ('2', 'B'), ('3', 'A'), ('3', 'B')]
>>>
```
and the actual code is
```    external_it = limited_external_combinations(external_extensions, limit)
it = product(all_combinations(internal_extensions), external_it)
it.next()
for (internal_ext, external) in it:
# Make the new subgraphs
new_atoms = external[0]
atoms = frozenset(chain(subgraph.atoms, new_atoms))
bonds = frozenset(chain(subgraph.bonds, internal_ext,
(ext[0] for ext in external[1])))
yield new_atoms, Subgraph(atoms, bonds)
```

### Implementation: all external extension combinations

The all_combinations function is as before. The new function limited_external_combinations is a variation designed for external extensions. It keeps track of the set of atoms used by the given extension combination and doesn't return extension combinations which are too large.

```def limited_external_combinations(container, limit):
"Generate all 2**len(container) combinations which do not have more than 'limit' atoms"
return _limited_combinations(container, len(container)-1, 0, limit)

def _limited_combinations(container, last, i, limit):
# Keep track of the set of current atoms as well as the list of extensions.
# (An external extension doesn't always add an atom. Think of
#   C1CC1 where the first "CC" adds two edges, both to the same atom.)
if i == last:
yield set(), []
if limit >= 1:
ext = container[i]
yield set([ext[1]]), [ext]
else:
for subatoms, subcombinations in _limited_combinations(container, last, i+1, limit):
assert len(subatoms) <= limit
yield subatoms, subcombinations
new_subatoms = subatoms.copy()
ext = container[i]
if len(new_subatoms) <= limit:
yield new_subatoms, [ext] + subcombinations
```

### Implementation: a simple command-line application

I also removed the mainline self-test and replaced it with an example of how to generate all canonical SMARTS subgraphs:

```if __name__ == "__main__":
import sys
if len(sys.argv) == 2:
smiles = sys.argv[1]
k = 5
elif len(sys.argv) == 3:
smiles = sys.argv[1]
k = int(sys.argv[2])
else:
raise SystemExit("""Usage: dfs_subgraph_enumeration.py <smiles> [<k>]
List all subgraphs of the given SMILES up to size k atoms (default k=5)
""")
mol = OEGraphMol()
assert OEParseSmiles(mol, smiles), "Could not parse input SMILES"
OESuppressHydrogens(mol, False, False, False)
for smarts in generate_canonical_smarts(mol, k):
print smarts
```
which is used like:
```% python dfs_subgraph_enumeration.py 'S1O=C1'
S
O
C
CS
OS
C=O
C=OS
CSO
C(=O)S
C1=OS1
% python dfs_subgraph_enumeration.py 'c1ccccc1C#N' 4 | sort | uniq -c | sort -n
1 C
1 C#N
1 Cc
1 Cc(c)c
1 N
1 cC#N
2 Ccc
2 Cccc
2 ccC#N
6 c
6 cc
6 ccc
6 cccc
%
```

The final code is too long to display inline so just download dfs_subgraph_enumeration.py as well as the test code test_subgraph_enumeration.py. You'll need the optional simple_subgraph_enumeration.py for all of the tests to run.

## Performance

My goal was to make the subgraph enumeration algorithm faster. Have I managed that? I wrote a program to measure the performance of the different algorithms. The new algorithm is about 3.5 times faster than the first one, and paying careful attention to the extensions gave me 7% better performance.

```import time

from openeye.oechem import *

# These are fastest-of-three timings for my test set

# 16.4 records/sec
#from simple_subgraph_enumeration import generate_canonical_smarts

# 55.0 records/sec
#from slower_dfs_subgraph_enumeration import generate_canonical_smarts

# 59.0/sec
from dfs_subgraph_enumeration import generate_canonical_smarts

# OEChem can parse about 3,300 records per second on my machine
ifs = oemolistream("/Users/dalke/teaching/Compound_09425001_09450000.sdf")
t1 = time.time()

# Parse 100 records
for i, mol in enumerate(ifs.GetOEGraphMols()):
title = mol.GetTitle()
OESuppressHydrogens(mol, False, False, False)
for smarts in generate_canonical_smarts(mol, 4):
pass
if i == 99:
break

i += 1
dt = time.time() - t1
# Number of records processed per second
print i / dt
```
I think this algorithm uses the best approach, but there are many ways to further speed it up. For examples: using the integers from GetIdx() might be better than storing the atom/bond objects directly in the set; I could use an array of flags rather than a set; I could hard-code the enumeration for the first 10 extensions rather than depends on Python's slow stack functions; and I could rewrite the code to work in Pyrex or C++.

But this is fast enough. At 50 structures per second it would take my laptop about 12 days to process a 50 million compound database. More likely I would pop over to Amazon, rent time on 100 machines, and have it done in a few hours.

Granted, I could have done the same with 3.5 times more computers, but having a clever algorithm makes me feel good. Besides distributed computing improves throughput, not response time. If I want to generate the subgraphs as part of a query then it's better to have the processing take 1/60th of a second than 1/15th.