Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2011/01/10/subgraph_enumeration

Subgraph enumeration

This is part 1 of a series on subgraph enumeration.

  1. Simple subgraph enumeration algorithm
  2. Faster subgraph enumeration
The final code is available for download. However, bear in mind that the code in part 2 is about 3.5 times faster.

Graph enumeration (not the point of this article)

How many molecules exist with N atoms? That has no easy answer. A related graph theory question is "how many simple graphs exist with N nodes" but those include chemically impossible solutions. This is a well-studied field. If you want to go down that path, some pointers are A000602 from the Online Encyclopedia of Integer Sequences ("Number of n-node unrooted quartic trees; number of n-carbon alkanes C(n)H(2n+2) ignoring stereoisomers"), Brendan McKay's with with graph enumeration, including canonicalization with nauty. (Downstream implementations include N.I.C.E. in Sage with specializations in the canonicalization algorithm in InChI and in Indigo. A paper by Hartke and Radcliffe also explains the McKay algorithm.) For something more chemistry specific, see Mick Kappler's GenSmi project "to construct all possible compounds containing carbon, nitrogen, and oxygen with a molecular weight between 150 and 250" and Orion Jankowski's recent example code of simple enumeration using Haskell.

It's a very interesting topic, but as far as I can tell it's mostly theoretical. People conjecture about how enumeration might suggest new drug-like candidates, but I've not heard of any success along those lines. If you know of anything, please let me know.

Subgraph enumeration

Instead, I'm going to look at subgraph enumeration. That is, given a structure, I'll enumerate all its unique subgraphs up to a given size, and I'll count the number of times each unique subgraph exists.

My reason for doing this comes from my work with fingerprints. The MACCS keys and the CACTVS substructure keys were made with human insight into which patterns make good filters. I'm curious if I can use clustering, a genetic algorithm, or some other means to identify a set of substructures which are more effective at filtering. I have access to the target data (eg, PubChem), and I suspect that knowledge will be helpful.

Subgraph enumeration is not a new subject either, but I haven't come across much work using it, at least not in chemistry. There are a couple of papers by Gerta Rücker and Christoph Rücker from 2001:

On Finding Nonisomorphic Connected Subgraphs and Distinct Molecular Substructures
A computer program is introduced that first generates all connected subgraphs and then uses a combination of well-discriminating graph invariants to eliminate duplicates.
and
Substructure, Subgraph, and Walk Counts as Measures of the Complexity of Graphs and Molecules
Construction of all substructures and subgraphs is computationally demanding; therefore, two alternatives are pointed out for the treatment of large sets of compounds: (i) Often it will suffice to consider counts of substructures/subgraphs up to a certain number of edges only, information which is provided by the program much more rapidly.
but I didn't actually read those papers. (Hmm, should walk up the hill to the library.)

Initial subgraph enumeration algorithm

Graph algorithms can be tricky so I'm going to start with the simplest algorithm I can think of. This is deliberately not a fast one - that will be in my next essay. I'll develop this algorithm first then use it to cross-check the faster but more complicated one.

I'll start with a collection of "seeds". Each seed is a graph, and I'll grow the seeds using a bond to make a new seed, which I can use for future growth.

seeds = an empty collection
For each atom in the molecule:
  Mke a new seed with that atom and no bonds
  Add the seed to the collection of seeds

While there are seeds:
  Select and remove a seed from the collection
  For each bond in the molecule:
    If the bond is in the seed then:
       Skip this bond
    Else if the neither atom of the bond is in the seed then:
       Skip this bond
    Else if (one of the bond atoms is not in the seed and
             the seed is at maximum size) then:
       Skip this bond
    Otherwise:
      Create a new seed which is the old seed plus this new bond
      Add the new seed to the collection
I think you can see that this will work. There will be duplicates because if I start from "C#N" with the seeds "C" and "N" then I'll grow the "C" to "C#N" and I'll grow the "N" to "N#C", but this is a good first pass.

I know the OpenEye tools the best so I'll use OEChem to implement the algorithm. There's nothing toolkit-specific about what I'm doing.

from openeye.oechem import *

class Seed(object):
    def __init__(self, atoms, bonds):
        self.atoms = atoms
        self.bonds = bonds

# Add a new bond to the seed when both end atoms are already in the seed
def new_seed_with_bond(seed, bond):
    #print "Add bond", bond.GetIdx()
    assert bond not in seed.bonds
    atoms = seed.atoms
    bonds = seed.bonds.copy()
    bonds.add(bond)
    return Seed(atoms, bonds)

# Add a new atom to the seed, and the bond which connects it to the seed
def new_seed_with_atom_and_bond(seed, atom, bond):
    #print "Add atom and bond", atom.GetIdx(), bond.GetIdx()
    assert atom not in seed.atoms
    assert bond not in seed.bonds
    atoms = seed.atoms.copy()
    bonds = seed.bonds.copy()
    atoms.add(atom)
    bonds.add(bond)
    return Seed(atoms, bonds)

def print_seed(seed):
    for atom in seed.atoms:
        print atom.GetIdx(), OEGetAtomicSymbol(atom.GetAtomicNum())
    for bond in seed.bonds:
        if bond.IsAromatic():
            bondtype = "~"
        else:
            bondtype = "X-=#$"[bond.GetOrder()]
        print bond.GetBgnIdx(), bond.GetEndIdx(), bondtype, bond.GetIdx()

seed_count = 0
def report_seed(seed):
    global seed_count
    seed_count += 1
    print "=== New seed (%d) ===" % seed_count
    print_seed(seed)
    print "==== end ==="
        

mol = OEGraphMol()
## Various test cases
#OEParseSmiles(mol, "SC=N")
#OEParseSmiles(mol, "SC=N.[Cl-]")
OEParseSmiles(mol, "S1C=N1")
#OEParseSmiles(mol, "c1ccccc1O")

# Maximum number of atoms to allow
k = 10
assert k >= 1, "need a bit more work to handle that case"

# Start with an empty collection
seeds = []

# Set up the initial set of seeds
for atom in mol.GetAtoms():
    # Make a new seed with that atom and no bonds
    seed = Seed(set([atom]), set())
    report_seed(seed)
    # Add the seed to the collection of seeds
    seeds.append(seed)

# While there are seeds
while seeds:
    # Select and remove a seed from the collection
    seed = seeds.pop()

    # For each bond in the molecule
    for bond in mol.GetBonds():
        # If the bond is in the seed then skip this bond
        if bond in seed.bonds:
            continue
        
        new_atom = None
        # If neither atom of the bond is in the seed then skip this bond
        if bond.GetBgn() not in seed.atoms:
            if bond.GetEnd() not in seed.atoms:
                continue
            new_atom = bond.GetBgn()
        else:
            if bond.GetEnd() not in seed.atoms:
                new_atom = bond.GetEnd()

        # If one of the bond atoms is not in the seed
        # and the seed is at maximumize, then skip this bond
        if new_atom is not None:
            if len(seed.atoms) == k:
                continue
            # Create the new seed with this atom and bond
            new_seed = new_seed_with_atom_and_bond(seed, new_atom, bond)
        else:
            # Create a new seed where the atoms are already in the
            # seed but the bond is new
            new_seed = new_seed_with_bond(seed, bond)

        # Add the new seed to the collection
        report_seed(new_seed)
        seeds.append(new_seed)

I did purely manual testing to see if the algorithm worked. You can see traces of my test cases in the different OEParseSmiles calls. I did have bugs in my original code, like a report_seed(seed) instead of report_seed(new_seed). My experience with these sorts of algorithms is that the manual testing I did is good enough to catch the likely errors. At this point I should turn the manual tests into real tests, but I'm going to hold off on that until I have a better way to represent the results.

The uncommented test case is "S1C=N1". I chose that because each atom has a different symbol, one of the bonds is different than the others, and because getting the cycle right is important. When I run the above code I get 33 different subgraphs:

=== New seed (1) ===
0 S
==== end ===
=== New seed (2) ===
1 C
==== end ===
=== New seed (3) ===
2 N
==== end ===
=== New seed (4) ===
2 N
0 S
0 2 - 0
==== end ===
=== New seed (5) ===
2 N
1 C
1 2 = 2
==== end ===
=== New seed (6) ===
2 N
1 C
0 S
1 2 = 2
0 2 - 0
==== end ===
=== New seed (7) ===
2 N
1 C
0 S
1 2 = 2
0 1 - 1
==== end ===
=== New seed (8) ===
2 N
1 C
0 S
1 2 = 2
0 1 - 1
0 2 - 0
==== end ===
=== New seed (9) ===
2 N
1 C
0 S
1 2 = 2
0 2 - 0
0 1 - 1
==== end ===
=== New seed (10) ===
2 N
0 S
1 C
0 2 - 0
0 1 - 1
==== end ===
=== New seed (11) ===
2 N
0 S
1 C
0 2 - 0
1 2 = 2
==== end ===
=== New seed (12) ===
2 N
0 S
1 C
0 2 - 0
1 2 = 2
0 1 - 1
==== end ===
=== New seed (13) ===
2 N
0 S
1 C
0 2 - 0
0 1 - 1
1 2 = 2
==== end ===
=== New seed (14) ===
1 C
0 S
0 1 - 1
==== end ===
=== New seed (15) ===
1 C
2 N
1 2 = 2
==== end ===
=== New seed (16) ===
1 C
2 N
0 S
1 2 = 2
0 2 - 0
==== end ===
=== New seed (17) ===
1 C
2 N
0 S
1 2 = 2
0 1 - 1
==== end ===
=== New seed (18) ===
1 C
2 N
0 S
1 2 = 2
0 1 - 1
0 2 - 0
==== end ===
=== New seed (19) ===
1 C
2 N
0 S
1 2 = 2
0 2 - 0
0 1 - 1
==== end ===
=== New seed (20) ===
1 C
0 S
2 N
0 1 - 1
0 2 - 0
==== end ===
=== New seed (21) ===
1 C
0 S
2 N
0 1 - 1
1 2 = 2
==== end ===
=== New seed (22) ===
1 C
0 S
2 N
0 1 - 1
1 2 = 2
0 2 - 0
==== end ===
=== New seed (23) ===
1 C
0 S
2 N
0 1 - 1
0 2 - 0
1 2 = 2
==== end ===
=== New seed (24) ===
0 S
2 N
0 2 - 0
==== end ===
=== New seed (25) ===
0 S
1 C
0 1 - 1
==== end ===
=== New seed (26) ===
0 S
1 C
2 N
0 1 - 1
0 2 - 0
==== end ===
=== New seed (27) ===
0 S
1 C
2 N
0 1 - 1
1 2 = 2
==== end ===
=== New seed (28) ===
0 S
1 C
2 N
0 1 - 1
1 2 = 2
0 2 - 0
==== end ===
=== New seed (29) ===
0 S
1 C
2 N
0 1 - 1
0 2 - 0
1 2 = 2
==== end ===
=== New seed (30) ===
0 S
2 N
1 C
0 2 - 0
0 1 - 1
==== end ===
=== New seed (31) ===
0 S
2 N
1 C
0 2 - 0
1 2 = 2
==== end ===
=== New seed (32) ===
0 S
2 N
1 C
0 2 - 0
1 2 = 2
0 1 - 1
==== end ===
=== New seed (33) ===
0 S
2 N
1 C
0 2 - 0
0 1 - 1
1 2 = 2
==== end ===
You can easily see that there are a lot of duplicates. Not only do I generate every subgraph, but I go through the process of generating all possible ways to build up each subgraph. I would rather see the non-redundant subgraphs.

Post-process filter to remove duplicate seeds

I can easily filter out duplicates. My first solution was to convert the algorithm into a function so I could use it like this:

for seed in generate_substructures(mol, k=5):
    report_seed(seed)
I then wrote a wrapper to only pass through a substructure if it had never been seen before:
def unique_substructures(seeds):
    previous_substructures = set()
    for seed in seeds:
        unique_key = (frozenset(atom.GetIdx() for atom in seed.atoms),
                      frozenset(bond.GetIdx() for bond in seed.bonds))
        if unique_key in previous_substructures:
            continue
        previous_substructures.add(unique_key)
        yield seed

# Here's how to use the unique_substructures filters
for seed in unique_substructures(generate_substructures(mol)):
    report_seed(seed)

This works, and requires minimal change to the algorithm. Instead of 33 subgraphs there are now 10.

=== New seed (1) ===
0 S
==== end ===
=== New seed (2) ===
1 C
==== end ===
=== New seed (3) ===
2 N
==== end ===
=== New seed (4) ===
2 N
0 S
0 2 - 0
==== end ===
=== New seed (5) ===
2 N
1 C
1 2 = 2
==== end ===
=== New seed (6) ===
2 N
1 C
0 S
1 2 = 2
0 2 - 0
==== end ===
=== New seed (7) ===
2 N
1 C
0 S
1 2 = 2
0 1 - 1
==== end ===
=== New seed (8) ===
2 N
1 C
0 S
1 2 = 2
0 1 - 1
0 2 - 0
==== end ===
=== New seed (9) ===
2 N
0 S
1 C
0 2 - 0
0 1 - 1
==== end ===
=== New seed (10) ===
1 C
0 S
0 1 - 1
==== end ===
The problem is that the underlying code still generates all 33 subgraphs, even though 23 of them will be removed. It's much better to push the uniqueness testing lower into the code. If I create a new seed which is identical to a previously seen seed then I should discard it. It adds no new information, and bogs down the system.

Making hashable seeds (__hash__, __eq__, and frozensets)

The standard way to check for uniqueness in Python is to use a set.

>>> text = "Substructure"
>>> previous_letters = set()
>>> for c in text:
...   if c in previous_letters:
...     print "Duplicate", c
...   else:
...     print "First time", c
...     previous_letters.add(c)
... 
First time S
First time u
First time b
First time s
First time t
First time r
Duplicate u
First time c
Duplicate t
Duplicate u
Duplicate r
First time e
>>> 
I would like to say
previous_seeds = set()
...
    if new_seed in previous_seeds:
        continue
            
    yield new_seed
    # Add the new seed to the collection
    seeds.append(new_seed)
    # Prevent future duplicates
    previous_seeds.add(new_seed)
but it's not so easy. How does the set know if two elements are the same? By default it doesn't:
>>> x = set()
>>> x.add(Seed(set([1]), set()))
>>> x.add(Seed(set([1]), set()))
>>> len(x)
2
>>> 
The dict, set, and frozenset datatypes are "hashed collections", meaning that they are internally based on hash tables. Hash collections contain hashable items, meaning: By default, user-defined classes have a __hash__ and __eq__ based on the object's id. (You can get the id yourself with id(x); it's basically the instance's memory location.) This means all such instances can be stored in a hashed collection and no two objects are alike.

But as you can see, that's not what I want. I need to define new __eq__ and __hash__ methods. The first is easy:

    def __eq__(self, other):
        return (self.atoms == other.atoms and
                self.bonds == other.bonds)
the latter is a bit harder. I want a hash value which depends on the atoms and bonds. The easiest is
    def __hash__(self):
        return hash(self.atoms) ^ hash(self.bonds)
which uses the seldom-seen xor operator ("^") to combine the hashes of the set of atoms and the set of bonds.

This will not work. The 'set' data type is not hashable.

>>> x = set()
>>> hash(x)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'set'
>>> 
This is because sets are mutable. There's no guarantee that hash(x) will be constant over time, and that breaks a hashable requirement.

Instead, I'll have to switch over to a frozenset, which is an immutable datatype (meaning it cannot be modified).

>>> hash(frozenset([1,2,3,4]))
5527014739486817160
>>> x = frozenset([1,2,3,4])
>>> hash(x)
5527014739486817160
>>> x.add(3)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'frozenset' object has no attribute 'add'
>>> 
That means I need to change a few other bits of code which assume a set instead of a frozenset. For example:
    bonds = seed.bonds.copy()
    bonds.add(bond)
I'll change the three cases of set.copy() followed by set.add() into:
from itertools import chain
...
    bonds = frozenset(chain(seed.bonds, [bond]))
and I need to change the initial seed creation from
        seed = Seed(set([atom]), set())
to
        seed = Seed(frozenset([atom]), frozenset())

Duplicate removal while generating seeds

Put all of these together and I get:

from itertools import chain
from openeye.oechem import *

class Seed(object):
    def __init__(self, atoms, bonds):
        # These are frozensets
        self.atoms = atoms
        self.bonds = bonds

    # I'll put a Seed in a set to test for uniqueness.
    # This means seeds must be immutable and they must
    # implement __hash__ and __eq__. If two sets are
    # equal then they must generate the same hash and
    # __eq__ returns 1. If two sets are different then
    # __eq__ must return 0.
    
    def __hash__(self):
        # Base the hash on the xor-ed hashes of the
        # set of atoms and the set of bonds.
        return hash(self.atoms) ^ hash(self.bonds)
    def __eq__(self, other):
        return (self.atoms == other.atoms and
                self.bonds == other.bonds)

def new_seed_with_bond(seed, bond):
    assert bond not in seed.bonds
    atoms = seed.atoms
    # Since seed.bonds is a frozenset, I can't copy and modify.
    # Instead, build the frozenset input list directly.
    bonds = frozenset(chain(seed.bonds, [bond]))
    return Seed(atoms, bonds)

def new_seed_with_atom_and_bond(seed, atom, bond):
    assert atom not in seed.atoms
    assert bond not in seed.bonds
    atoms = frozenset(chain(seed.atoms, [atom]))
    bonds = frozenset(chain(seed.bonds, [bond]))
    return Seed(atoms, bonds)

def print_seed(seed):
    for atom in seed.atoms:
        print atom.GetIdx(), OEGetAtomicSymbol(atom.GetAtomicNum())
    for bond in seed.bonds:
        if bond.IsAromatic():
            bondtype = "~"
        else:
            bondtype = "X-=#$"[bond.GetOrder()]
        print bond.GetBgnIdx(), bond.GetEndIdx(), bondtype, bond.GetIdx()

seed_count = 0
def report_seed(seed):
    global seed_count
    seed_count += 1
    print "=== New seed (%d) ===" % seed_count
    print_seed(seed)
    print "==== end ==="


def generate_unique_substructures(mol, k=5):
    # If you don't want any atoms then you won't get any subgraphs
    if k < 1:
        return

    # Keep track of every seed I've seen except the seeds
    # of size 1 (since there's no way they can repeat)
    previous_seeds = set()
    
    seeds = []

    for atom in mol.GetAtoms():
        seed = Seed(frozenset([atom]), frozenset())
        yield seed
        seeds.append(seed)

    while seeds:
        seed = seeds.pop()
        for bond in mol.GetBonds():
            # If the bond is in the seed then skip this bond
            if bond in seed.bonds:
                continue

            new_atom = None
            # If neither atom of the bond is in the seed then skip this bond
            if bond.GetBgn() not in seed.atoms:
                if bond.GetEnd() not in seed.atoms:
                    continue
                new_atom = bond.GetBgn()
            else:
                if bond.GetEnd() not in seed.atoms:
                    new_atom = bond.GetEnd()

            # If one of the bond atoms is not in the seed
            # and the seed is at maximumize, then skip this bond
            if new_atom is not None:
                if len(seed.atoms) == k:
                    continue
                # Create the new seed with this atom and bond
                new_seed = new_seed_with_atom_and_bond(seed, new_atom, bond)
            else:
                # Create a new seed where the atoms are already in the
                # seed but the bond is new
                new_seed = new_seed_with_bond(seed, bond)

            # Check for duplicates
            if new_seed in previous_seeds:
                continue
            
            yield new_seed
            # Add the new seed to the collection
            seeds.append(new_seed)
            # Prevent future duplicates
            previous_seeds.add(new_seed)

mol = OEGraphMol()
#OEParseSmiles(mol, "SC=N")
#OEParseSmiles(mol, "SC=N.[Cl-]")
OEParseSmiles(mol, "S1C=N1")
#OEParseSmiles(mol, "c1ccccc1O")

for seed in generate_unique_substructures(mol):
    report_seed(seed)
I tested it out and I got the same results as the previous algorithm which generated all duplicate substructures and post-filtered the duplicates.

Notice the testing theme I'm developing here? I often start with a simple algorithm and get it working, then use it to cross-check a more powerful but also more complex algorithm.

Generating canonical SMARTS for each subgraph

I'm not done yet though. What I've done is generate all the unique substructures of an input structure. What I really want is all the unique canonical substructures, and preferably in SMARTS form. For example, if the input SMILES is "c1ccccc1O" with k=2 there should be 6 "c" matches, 1 "O" match, 6 "cc" matches, and 1 "cO" (or perhaps "Oc" depending on the canonicalization).

I don't know of a toolkit which lets me generate canonical substructures or canonical SMARTS from a subgraph, but for this problem it's easy to create one from the parts I do have. I say "for this problem" because I'm only worried about the element type and aromaticity, and not chirality, stereochemistry, or even charge.

The algorithm is simple, but not obvious. I make a new molecule which is isomorphic to the substructure, create it canonical SMILES, and use that SMILES to create the correct SMARTS.

The obvious problem is that a substructure may correspond to "ccC", which is not a valid SMILES. Most toolkits don't want to deal with that. OEChem is an exception. It does a very good job of respecting aromaticity flags even when it doesn't make chemical sense. But I want a solution which works in every toolkit. (I also ran into some canonicalization problems in OEChem for these patterns. I didn't resolve to my satisifaction and it could be a misunderstanding on my side.)

What I need is some way to encode the aromaticity in the SMILES in such a way that I can still get proper canonicalization and so that I can decode it. After a few attempts at different encodings, I decided to to use the isotope field, where isotope 1 mean "not aromatic" and 2 means "aromatic." For example, "[1C]" means "C" and "[2C]" means "c". The fragment "ccC" becomes the SMILES "[2C][2C][1C]", which is easily canonicalized and decoded as "ccC".

(If you also want the charge and hydrogen counts then you can encode those directly. With something like

new_atom.SetIsotope(old_atom.GetIsotope()*2+1 + old_atom.IsAromatic())
you can also encode the old isotope information.)

So I turn the substructure into a non-aromatic canonical SMILES but with enough information that I can turn that back into the aromatic canonical SMARTS.

Note: There is a possibility that "cc" will appear in the result when it should be "c-c" (consider two aromatics in different rings, but also bonded to each other through a non-aromatic ring bond) but that's rare. For my purposes there's no problem because "cc" will match that bond and nothing will ever generate "c-c" so it won't get double counted. Instead, "c-c" and "cc" will both be counted as "cc".

Here's the code to convert the seeds to canonical SMARTS as post-processing step:

import re

# I'm going to convert a substructure into a new molecule,
# canonicalize the new molecule, and use the canonical SMILES to
# create a canonical SMARTS for the original substructure.

# I'll preserve the aromaticity information by encoding (atomic
# number, is_aromatic) as (atomic number, isotope number) where the
# isotope == 1 for not aromatic and 2 for aromatic.  The corresponding
# SMILES will be either the form "[1*]" or "[2*]" where "*" is a valid
# atomic symbol.

# Given the SMILES I can reverse that with a string transformation.
# To make that easier I'll enumerate all possible reverse transformations.

organic_subset = set([5,6,7,8,9,15,16,17,35,53])

decode_mapping = {"[1*]": "*",  # Encode these manually
                  "[2*]": "*"}
for atomno in range(1, OEElemNo_MAXELEM):
    symbol = OEGetAtomicSymbol(atomno)
    if atomno in organic_subset:
        atom_smarts = symbol
    else:
        atom_smarts = "[%s]" % symbol
    decode_mapping["[1%s]" % symbol] = atom_smarts
    decode_mapping["[2%s]" % symbol] = atom_smarts.lower()

# Helper function to transform the SMILES back to the correct SMARTS
_atom_name_pat = re.compile(r"\[[12]\w+\]")
def replace_atom_terms(m, decode_mapping = decode_mapping):
    return decode_mapping[m.group()]


def generate_canonical_smarts(mol, k=2):
    for seed in generate_unique_substructures(mol, k):
        # Make a new molecule based on the substructure
        new_mol = OEGraphMol()
        new_atoms = {}
        for atom in seed.atoms:
            new_atom = new_mol.NewAtom(atom.GetAtomicNum())
            # 1 if not aromatic, 2 if aromatic
            new_atom.SetIsotope(atom.IsAromatic()+1)
            new_atoms[atom.GetIdx()] = new_atom
            
        for bond in seed.bonds:
            new_bond = new_mol.NewBond(new_atoms[bond.GetBgnIdx()],
                                       new_atoms[bond.GetEndIdx()],
                                       bond.GetOrder())
            # GetOrder() sets the Kekule form, but I want the aromatic
            # form.  If I don't do this I end up with terms like "c=c"
            # in my results. Setting it to a single bond works because
            # the output will always have "" as the bond (as in "CC",
            # "cC" or "cc"). The only way to get an incorrect result
            # is if the structure should be "c-c", which is rare. But
            # since the "" connection means "single or aromatic", then
            # the SMARTS won't miss it, so we're okay. As long as I'm
            # consistent.
            if bond.IsAromatic():
                new_bond.SetOrder(1)

        smiles = OECreateIsoSmiString(new_mol)
        smarts = _atom_name_pat.sub(replace_atom_terms, smiles)
        #print smiles, "->", smarts
        yield smarts

if __name__ == "__main__":
    mol = OEGraphMol()
    #OEParseSmiles(mol, "SC=N")
    #OEParseSmiles(mol, "S1C=N1")
    OEParseSmiles(mol, "c1ccccc1C#N")
    #OEParseSmiles(mol, "c1ccc1")

    for smarts in generate_canonical_smarts(mol, 3):
        print smarts
(If you are working with OEChem and you want something more sophisticated which will let you include SMARTS patterns like "X3" in your result, then there might be another solution via OESMILESFlag_SuperAtoms and SetName(). Remember, I'm available for consulting!)

Counting the unique canonical SMARTS

When I run the above I get

c
c
c
c
c
c
C
N
C#N
cC#N
Cc
Ccc
Ccc
cc
cc
ccc
ccc
ccc
cc
ccc
cc
ccc
cc
ccc
cc
and it's a simple step to turn that into counts. (Python 2.7's collections.Counter would make this even easier.)
    from collections import defaultdict
    counts = defaultdict(int)
    for smarts in generate_canonical_smarts(mol, 3):
        counts[smarts] += 1
    for v, k in sorted((v,k) for (k,v) in counts.items()):
        print v, k
The output is:
1 C
1 C#N
1 Cc
1 N
1 cC#N
2 Ccc
6 c
6 cc
6 ccc
This is easy to verify manually, and easier to verify this set than the list of atoms / list of bonds style I showed earlier.

Stress testing

Is there some way I can stress the code? Yes. I have a set of structures with explicit hydrogens, like

C(C1=C(C(=O)C(=C(C1=O)[H])[H])[H])([H])([H])[H]
S(Sc1nc2c(c(c(c(c2s1)[H])[H])[H])[H])c3nc4c(c(c(c(c4s3)[H])[H])[H])[H]
N(=O)(=O)c1c(c(c(c(c1O[H])Cl)[H])[N+](=O)[O-])[H]
and when I send one of those through it's obvious that I want to remove hydrogens from any input structure. In OEChem that's
OESuppressHydrogens(mol, False, False, False)
How about testing the patterns? My first thought was the if I find six "cccccc" then a unique SMARTS match must match six times. This turns out not to be the case because unique SMARTS matches are unique only with the set of atoms, and not the set of atoms + set of bonds. Benzene yields 6 "cccccc" substructures with my algorithm, but only has 1 unique SMARTS match. (Indigo does have search option like I want, and if I was being more rigorous I would use it to help out.)

Much as I tried, that approach does not work. I tried something else. I can check that each pattern matches as least once. Also, since the patterns are canonical, that means two different patterns will never match against the same set of atoms+bonds. This I can test!

if __name__ == "__main__":
    for line in open("/Users/dalke/databases/nci_09425001_09450000.smi"):
        line = line.strip()
        print "Process", line
        mol = OEGraphMol()
        assert OEParseSmiles(mol, line), repr(line)
        OESuppressHydrogens(mol, False, False, False)
        
        # No two patterns should match the same set of atoms+bonds
        matched_sets = {}
        for smarts in set(generate_canonical_smarts(mol, 7)):
            pat = OESubSearch()
            assert pat.Init(smarts)
            at_least_one = False
            # Search for non-unique matches so I get all possible
            # sets of atom and bond matches.
            for m in pat.Match(mol):
                at_least_one = True
                atoms = frozenset(m.GetTargetAtoms())
                bonds = frozenset(m.GetTargetBonds())
                # This should be unique, unless the pattern has
                # symmetry in which case it's only been matched by
                # this pattern.
                q = (atoms, bonds)
                if q in matched_sets:
                    assert matched_sets[q] == smarts, (smarts, matched_sets[q])
                matched_sets[q] = smarts
            assert at_least_one, smarts
        #print "Good"
That took a while, but no failures! That leads me to suspect that my algorithm is working. Yay!

Download the final code

Here's the final version with everything, so you don't need to assemble it yourself. Or you can download it as simple_subgraph_enumeration.py.

# Details available at
#   http://dalkescientific.com/writings/diary/archive/2011/01/10/subgraph_enumeration.html
# For a faster subgraph enumeration algorithm see
#   http://dalkescientific.com/writings/diary/archive/2011/01/13/faster_subgraph_enumeration.html
import re
from itertools import chain

from openeye.oechem import *

# A seed is a substructure.
# It contains a frozenset of atoms and a frozenset of molecules.
class Seed(object):
    def __init__(self, atoms, bonds):
        # These are frozensets
        self.atoms = atoms
        self.bonds = bonds

    # The following lets seeds be put in a set to test for uniqueness
    def __hash__(self):
        # Base the hash on the xor-ed hashes of the
        # set of atoms and the set of bonds.
        return hash(self.atoms) ^ hash(self.bonds)
    def __eq__(self, other):
        return (self.atoms == other.atoms and
                self.bonds == other.bonds)

# Create a new seed from an old seed plus a bond.
# Both end atoms for the bond must already be in the seed.
def new_seed_with_bond(seed, bond):
    assert bond not in seed.bonds
    atoms = seed.atoms
    bonds = frozenset(chain(seed.bonds, [bond]))
    return Seed(atoms, bonds)

# Create a new seed from an old seed plus a bond linking
# an atom already in the seed to a new atom not in the seed.
def new_seed_with_atom_and_bond(seed, atom, bond):
    assert atom not in seed.atoms
    assert bond not in seed.bonds
    atoms = frozenset(chain(seed.atoms, [atom]))
    bonds = frozenset(chain(seed.bonds, [bond]))
    return Seed(atoms, bonds)

def generate_unique_substructures(mol, k=5):
    "Generate all unique substructures (as seeds) in a molecule, up to size 'k'"
    # If you don't want any atoms then you won't get any subgraphs
    if k < 1:
        return

    # Keep track of every seed I've seen except the seeds
    # of size 1 (since there's no way they can repeat)
    previous_seeds = set()
    
    seeds = []

    for atom in mol.GetAtoms():
        seed = Seed(frozenset([atom]), frozenset())
        yield seed
        seeds.append(seed)

    while seeds:
        seed = seeds.pop()
        for bond in mol.GetBonds():
            # If the bond is in the seed then skip this bond
            if bond in seed.bonds:
                continue

            new_atom = None
            # If neither atom of the bond is in the seed then skip this bond
            if bond.GetBgn() not in seed.atoms:
                if bond.GetEnd() not in seed.atoms:
                    continue
                new_atom = bond.GetBgn()
            else:
                if bond.GetEnd() not in seed.atoms:
                    new_atom = bond.GetEnd()

            # If one of the bond atoms is not in the seed
            # and the seed is at maximumize, then skip this bond
            if new_atom is not None:
                if len(seed.atoms) == k:
                    continue
                # Create the new seed with this atom and bond
                new_seed = new_seed_with_atom_and_bond(seed, new_atom, bond)
            else:
                # Create a new seed where the atoms are already in the
                # seed but the bond is new
                new_seed = new_seed_with_bond(seed, bond)

            # Check for duplicates
            if new_seed in previous_seeds:
                continue
            
            yield new_seed
            # Add the new seed to the collection
            seeds.append(new_seed)
            # Prevent future duplicates
            previous_seeds.add(new_seed)


# I'm going to convert a substructure into a new molecule,
# canonicalize the new molecule, and use the canonical SMILES to
# create a canonical SMARTS for the original substructure.

# I'll preserve the aromaticity information by encoding (atomic
# number, is_aromatic) as (atomic number, isotope number) where the
# isotope == 1 for not aromatic and 2 for aromatic.  The corresponding
# SMILES will be either the form "[1*]" or "[2*]" where "*" is a valid
# atomic symbol.

# Given the SMILES I can reverse that with a string transformation.
# To make that easier I'll enumerate all possible reverse transformations.

organic_subset = set([5,6,7,8,9,15,16,17,35,53])

decode_mapping = {"[1*]": "*",  # Encode these manually
                  "[2*]": "*"}
for atomno in range(1, OEElemNo_MAXELEM):
    symbol = OEGetAtomicSymbol(atomno)
    if atomno in organic_subset:
        atom_smarts = symbol
    else:
        atom_smarts = "[%s]" % symbol
    decode_mapping["[1%s]" % symbol] = atom_smarts
    decode_mapping["[2%s]" % symbol] = atom_smarts.lower()

# Helper function to transform the SMILES back to the correct SMARTS
_atom_name_pat = re.compile(r"\[[12]\w+\]")
def replace_atom_terms(m, decode_mapping = decode_mapping):
    return decode_mapping[m.group()]


def generate_canonical_smarts(mol, k=5):
    "generate all canonial SMARTS in a molecule up to size 'k'"
    for seed in generate_unique_substructures(mol, k):
        # Make a new molecule based on the substructure
        new_mol = OEGraphMol()
        new_atoms = {}
        for atom in seed.atoms:
            new_atom = new_mol.NewAtom(atom.GetAtomicNum())
            # 1 if not aromatic, 2 if aromatic
            new_atom.SetIsotope(atom.IsAromatic()+1)
            new_atoms[atom.GetIdx()] = new_atom
            
        for bond in seed.bonds:
            new_bond = new_mol.NewBond(new_atoms[bond.GetBgnIdx()],
                                       new_atoms[bond.GetEndIdx()],
                                       bond.GetOrder())
            # GetOrder() sets the Kekule form, but I want the aromatic
            # form.  If I don't do this I end up with terms like "c=c"
            # in my results. Setting it to a single bond works because
            # the output will always have "" as the bond (as in "CC",
            # "cC" or "cc"). The only way to get an incorrect result
            # is if the structure should be "c-c", which is rare. But
            # since the "" connection means "single or aromatic", then
            # the SMARTS won't miss it, so we're okay. As long as I'm
            # consistent.
            if bond.IsAromatic():
                new_bond.SetOrder(1)

        smiles = OECreateIsoSmiString(new_mol)
        smarts = _atom_name_pat.sub(replace_atom_terms, smiles)
        #print smiles, "->", smarts
        yield smarts

#################### Self-test code ##################

def test_seeds():
    mol = OEGraphMol()
    OEParseSmiles(mol, "S1C=N1.[Cl-]")
    expected_indexes = [ ( (0,), ()),
                         ( (1,), ()),
                         ( (2,), ()),
                         ( (3,), ()),
                         ( (0,1), (1,)),
                         ( (1,2), (2,)),
                         ( (0,2), (0,)),
                         ( (0,1,2), (0,1)),
                         ( (0,1,2), (1,2)),
                         ( (0,1,2), (0,2)),
                         ( (0,1,2), (0,1,2)),
                         ]
    expected_indexes.sort()

    found_indexes = []
    for seed in generate_unique_substructures(mol):
        atom_indexes = tuple(sorted(atom.GetIdx() for atom in seed.atoms))
        bond_indexes = tuple(sorted(bond.GetIdx() for bond in seed.bonds))
        found_indexes.append( (atom_indexes, bond_indexes) )
    found_indexes.sort()
    assert len(expected_indexes) == len(found_indexes), (len(expected_indexes),
                                                         len(found_indexes))
    for left, right in zip(expected_indexes, found_indexes):
        assert left == right, (left, right)
    

def _make_counts(it):
    from collections import defaultdict
    counts = defaultdict(int)
    for item in it:
        counts[item] += 1
    return dict(counts)

def test_k_0():
    mol = OEGraphMol()
    OEParseSmiles(mol, "c1ccccc1C#N")
    counts = _make_counts(generate_canonical_smarts(mol, 0))
    assert counts == {}, counts

def test_k_1():
    mol = OEGraphMol()
    OEParseSmiles(mol, "c1ccccc1C#N")
    counts = _make_counts(generate_canonical_smarts(mol, 1))
    assert counts == {"c": 6, "C": 1, "N": 1}, counts

def test_k_3():
    mol = OEGraphMol()
    OEParseSmiles(mol, "c1ccccc1C#N")
    counts = _make_counts(generate_canonical_smarts(mol, 3))
    assert counts == {"c": 6, "Cc": 1, "C": 1, "N": 1, "cc": 6,
                      "C#N": 1, "Ccc": 2, "cC#N": 1, "ccc": 6}, counts

def stress_test():
    # You will need to change this to a file you have on your file system
    for line in open("/Users/dalke/databases/nci_09425001_09450000.smi"):
        line = line.strip()
        print "Process", line
        mol = OEGraphMol()
        assert OEParseSmiles(mol, line), repr(line)
        OESuppressHydrogens(mol, False, False, False)
        
        # No two patterns should match the same set of atoms+bonds
        matched_sets = {}
        for smarts in set(generate_canonical_smarts(mol, 7)):
            pat = OESubSearch()
            assert pat.Init(smarts)
            at_least_one = False
            # Search for non-unique matches so I get all possible
            # sets of atom and bond matches.
            for m in pat.Match(mol):
                at_least_one = True
                atoms = frozenset(m.GetTargetAtoms())
                bonds = frozenset(m.GetTargetBonds())
                # This should be unique, unless the pattern has
                # symmetry in which case it's only been matched by
                # this pattern.
                q = (atoms, bonds)
                if q in matched_sets:
                    assert matched_sets[q] == smarts, (smarts, matched_sets[q])
                matched_sets[q] = smarts
            assert at_least_one, smarts
    

def test():
    test_seeds()
    test_k_0()
    test_k_1()
    test_k_3()
    print "Quick self-tests passed, running stress tests"
    stress_test() # Remember to change the code to use a SMILES file you have!

if __name__ == "__main__":
    test()

Comments and Feedback

If you liked this essay, found a problem with the code, or have something else to add then go ahead and leave a comment. I would especially like to hear from people who have done non-trivial work with subgraph enumerations and in fingerprint filter generation.

Advertisement: Training courses in Leipzig in Feb 2011

Next month I will be teaching programming classes for cheminformatics in Leipzig, Germany. Python for Cheminformatics is 14-15 February and Web Application Development with Django is 16-18 February. Let me know if you're interested, and remember, there's a discount if you attend both classes!


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2013 Andrew Dalke Scientific AB