Subgraph enumeration
This is part 1 of a series on subgraph enumeration.
- Simple subgraph enumeration algorithm
- Faster subgraph enumeration
Graph enumeration (not the point of this article)
How many molecules exist with N atoms? That has no easy answer. A related graph theory question is "how many simple graphs exist with N nodes" but those include chemically impossible solutions. This is a well-studied field. If you want to go down that path, some pointers are A000602 from the Online Encyclopedia of Integer Sequences ("Number of n-node unrooted quartic trees; number of n-carbon alkanes C(n)H(2n+2) ignoring stereoisomers"), Brendan McKay's with with graph enumeration, including canonicalization with nauty. (Downstream implementations include N.I.C.E. in Sage with specializations in the canonicalization algorithm in InChI and in Indigo. A paper by Hartke and Radcliffe also explains the McKay algorithm.) For something more chemistry specific, see Mick Kappler's GenSmi project "to construct all possible compounds containing carbon, nitrogen, and oxygen with a molecular weight between 150 and 250" and Orion Jankowski's recent example code of simple enumeration using Haskell.
It's a very interesting topic, but as far as I can tell it's mostly theoretical. People conjecture about how enumeration might suggest new drug-like candidates, but I've not heard of any success along those lines. If you know of anything, please let me know.
Subgraph enumeration
Instead, I'm going to look at subgraph enumeration. That is, given a structure, I'll enumerate all its unique subgraphs up to a given size, and I'll count the number of times each unique subgraph exists.
My reason for doing this comes from my work with fingerprints. The MACCS keys and the CACTVS substructure keys were made with human insight into which patterns make good filters. I'm curious if I can use clustering, a genetic algorithm, or some other means to identify a set of substructures which are more effective at filtering. I have access to the target data (eg, PubChem), and I suspect that knowledge will be helpful.
Subgraph enumeration is not a new subject either, but I haven't come across much work using it, at least not in chemistry. There are a couple of papers by Gerta Rücker and Christoph Rücker from 2001:
On Finding Nonisomorphic Connected Subgraphs and Distinct Molecular Substructuresand
A computer program is introduced that first generates all connected subgraphs and then uses a combination of well-discriminating graph invariants to eliminate duplicates.
Substructure, Subgraph, and Walk Counts as Measures of the Complexity of Graphs and Moleculesbut I didn't actually read those papers. (Hmm, should walk up the hill to the library.)
Construction of all substructures and subgraphs is computationally demanding; therefore, two alternatives are pointed out for the treatment of large sets of compounds: (i) Often it will suffice to consider counts of substructures/subgraphs up to a certain number of edges only, information which is provided by the program much more rapidly.
Initial subgraph enumeration algorithm
Graph algorithms can be tricky so I'm going to start with the simplest algorithm I can think of. This is deliberately not a fast one - that will be in my next essay. I'll develop this algorithm first then use it to cross-check the faster but more complicated one.
I'll start with a collection of "seeds". Each seed is a graph, and I'll grow the seeds using a bond to make a new seed, which I can use for future growth.
seeds = an empty collection For each atom in the molecule: Mke a new seed with that atom and no bonds Add the seed to the collection of seeds While there are seeds: Select and remove a seed from the collection For each bond in the molecule: If the bond is in the seed then: Skip this bond Else if the neither atom of the bond is in the seed then: Skip this bond Else if (one of the bond atoms is not in the seed and the seed is at maximum size) then: Skip this bond Otherwise: Create a new seed which is the old seed plus this new bond Add the new seed to the collectionI think you can see that this will work. There will be duplicates because if I start from "C#N" with the seeds "C" and "N" then I'll grow the "C" to "C#N" and I'll grow the "N" to "N#C", but this is a good first pass.
I know the OpenEye tools the best so I'll use OEChem to implement the algorithm. There's nothing toolkit-specific about what I'm doing.
from openeye.oechem import * class Seed(object): def __init__(self, atoms, bonds): self.atoms = atoms self.bonds = bonds # Add a new bond to the seed when both end atoms are already in the seed def new_seed_with_bond(seed, bond): #print "Add bond", bond.GetIdx() assert bond not in seed.bonds atoms = seed.atoms bonds = seed.bonds.copy() bonds.add(bond) return Seed(atoms, bonds) # Add a new atom to the seed, and the bond which connects it to the seed def new_seed_with_atom_and_bond(seed, atom, bond): #print "Add atom and bond", atom.GetIdx(), bond.GetIdx() assert atom not in seed.atoms assert bond not in seed.bonds atoms = seed.atoms.copy() bonds = seed.bonds.copy() atoms.add(atom) bonds.add(bond) return Seed(atoms, bonds) def print_seed(seed): for atom in seed.atoms: print atom.GetIdx(), OEGetAtomicSymbol(atom.GetAtomicNum()) for bond in seed.bonds: if bond.IsAromatic(): bondtype = "~" else: bondtype = "X-=#$"[bond.GetOrder()] print bond.GetBgnIdx(), bond.GetEndIdx(), bondtype, bond.GetIdx() seed_count = 0 def report_seed(seed): global seed_count seed_count += 1 print "=== New seed (%d) ===" % seed_count print_seed(seed) print "==== end ===" mol = OEGraphMol() ## Various test cases #OEParseSmiles(mol, "SC=N") #OEParseSmiles(mol, "SC=N.[Cl-]") OEParseSmiles(mol, "S1C=N1") #OEParseSmiles(mol, "c1ccccc1O") # Maximum number of atoms to allow k = 10 assert k >= 1, "need a bit more work to handle that case" # Start with an empty collection seeds = [] # Set up the initial set of seeds for atom in mol.GetAtoms(): # Make a new seed with that atom and no bonds seed = Seed(set([atom]), set()) report_seed(seed) # Add the seed to the collection of seeds seeds.append(seed) # While there are seeds while seeds: # Select and remove a seed from the collection seed = seeds.pop() # For each bond in the molecule for bond in mol.GetBonds(): # If the bond is in the seed then skip this bond if bond in seed.bonds: continue new_atom = None # If neither atom of the bond is in the seed then skip this bond if bond.GetBgn() not in seed.atoms: if bond.GetEnd() not in seed.atoms: continue new_atom = bond.GetBgn() else: if bond.GetEnd() not in seed.atoms: new_atom = bond.GetEnd() # If one of the bond atoms is not in the seed # and the seed is at maximumize, then skip this bond if new_atom is not None: if len(seed.atoms) == k: continue # Create the new seed with this atom and bond new_seed = new_seed_with_atom_and_bond(seed, new_atom, bond) else: # Create a new seed where the atoms are already in the # seed but the bond is new new_seed = new_seed_with_bond(seed, bond) # Add the new seed to the collection report_seed(new_seed) seeds.append(new_seed)
I did purely manual testing to see if the algorithm worked. You can see traces of my test cases in the different OEParseSmiles calls. I did have bugs in my original code, like a report_seed(seed) instead of report_seed(new_seed). My experience with these sorts of algorithms is that the manual testing I did is good enough to catch the likely errors. At this point I should turn the manual tests into real tests, but I'm going to hold off on that until I have a better way to represent the results.
The uncommented test case is "S1C=N1". I chose that because each atom has a different symbol, one of the bonds is different than the others, and because getting the cycle right is important. When I run the above code I get 33 different subgraphs:
=== New seed (1) === 0 S ==== end === === New seed (2) === 1 C ==== end === === New seed (3) === 2 N ==== end === === New seed (4) === 2 N 0 S 0 2 - 0 ==== end === === New seed (5) === 2 N 1 C 1 2 = 2 ==== end === === New seed (6) === 2 N 1 C 0 S 1 2 = 2 0 2 - 0 ==== end === === New seed (7) === 2 N 1 C 0 S 1 2 = 2 0 1 - 1 ==== end === === New seed (8) === 2 N 1 C 0 S 1 2 = 2 0 1 - 1 0 2 - 0 ==== end === === New seed (9) === 2 N 1 C 0 S 1 2 = 2 0 2 - 0 0 1 - 1 ==== end === === New seed (10) === 2 N 0 S 1 C 0 2 - 0 0 1 - 1 ==== end === === New seed (11) === 2 N 0 S 1 C 0 2 - 0 1 2 = 2 ==== end === === New seed (12) === 2 N 0 S 1 C 0 2 - 0 1 2 = 2 0 1 - 1 ==== end === === New seed (13) === 2 N 0 S 1 C 0 2 - 0 0 1 - 1 1 2 = 2 ==== end === === New seed (14) === 1 C 0 S 0 1 - 1 ==== end === === New seed (15) === 1 C 2 N 1 2 = 2 ==== end === === New seed (16) === 1 C 2 N 0 S 1 2 = 2 0 2 - 0 ==== end === === New seed (17) === 1 C 2 N 0 S 1 2 = 2 0 1 - 1 ==== end === === New seed (18) === 1 C 2 N 0 S 1 2 = 2 0 1 - 1 0 2 - 0 ==== end === === New seed (19) === 1 C 2 N 0 S 1 2 = 2 0 2 - 0 0 1 - 1 ==== end === === New seed (20) === 1 C 0 S 2 N 0 1 - 1 0 2 - 0 ==== end === === New seed (21) === 1 C 0 S 2 N 0 1 - 1 1 2 = 2 ==== end === === New seed (22) === 1 C 0 S 2 N 0 1 - 1 1 2 = 2 0 2 - 0 ==== end === === New seed (23) === 1 C 0 S 2 N 0 1 - 1 0 2 - 0 1 2 = 2 ==== end === === New seed (24) === 0 S 2 N 0 2 - 0 ==== end === === New seed (25) === 0 S 1 C 0 1 - 1 ==== end === === New seed (26) === 0 S 1 C 2 N 0 1 - 1 0 2 - 0 ==== end === === New seed (27) === 0 S 1 C 2 N 0 1 - 1 1 2 = 2 ==== end === === New seed (28) === 0 S 1 C 2 N 0 1 - 1 1 2 = 2 0 2 - 0 ==== end === === New seed (29) === 0 S 1 C 2 N 0 1 - 1 0 2 - 0 1 2 = 2 ==== end === === New seed (30) === 0 S 2 N 1 C 0 2 - 0 0 1 - 1 ==== end === === New seed (31) === 0 S 2 N 1 C 0 2 - 0 1 2 = 2 ==== end === === New seed (32) === 0 S 2 N 1 C 0 2 - 0 1 2 = 2 0 1 - 1 ==== end === === New seed (33) === 0 S 2 N 1 C 0 2 - 0 0 1 - 1 1 2 = 2 ==== end ===You can easily see that there are a lot of duplicates. Not only do I generate every subgraph, but I go through the process of generating all possible ways to build up each subgraph. I would rather see the non-redundant subgraphs.
Post-process filter to remove duplicate seeds
I can easily filter out duplicates. My first solution was to convert the algorithm into a function so I could use it like this:
for seed in generate_substructures(mol, k=5): report_seed(seed)I then wrote a wrapper to only pass through a substructure if it had never been seen before:
def unique_substructures(seeds): previous_substructures = set() for seed in seeds: unique_key = (frozenset(atom.GetIdx() for atom in seed.atoms), frozenset(bond.GetIdx() for bond in seed.bonds)) if unique_key in previous_substructures: continue previous_substructures.add(unique_key) yield seed # Here's how to use the unique_substructures filters for seed in unique_substructures(generate_substructures(mol)): report_seed(seed)
This works, and requires minimal change to the algorithm. Instead of 33 subgraphs there are now 10.
=== New seed (1) === 0 S ==== end === === New seed (2) === 1 C ==== end === === New seed (3) === 2 N ==== end === === New seed (4) === 2 N 0 S 0 2 - 0 ==== end === === New seed (5) === 2 N 1 C 1 2 = 2 ==== end === === New seed (6) === 2 N 1 C 0 S 1 2 = 2 0 2 - 0 ==== end === === New seed (7) === 2 N 1 C 0 S 1 2 = 2 0 1 - 1 ==== end === === New seed (8) === 2 N 1 C 0 S 1 2 = 2 0 1 - 1 0 2 - 0 ==== end === === New seed (9) === 2 N 0 S 1 C 0 2 - 0 0 1 - 1 ==== end === === New seed (10) === 1 C 0 S 0 1 - 1 ==== end ===The problem is that the underlying code still generates all 33 subgraphs, even though 23 of them will be removed. It's much better to push the uniqueness testing lower into the code. If I create a new seed which is identical to a previously seen seed then I should discard it. It adds no new information, and bogs down the system.
Making hashable seeds (__hash__, __eq__, and frozensets)
The standard way to check for uniqueness in Python is to use a set.
>>> text = "Substructure" >>> previous_letters = set() >>> for c in text: ... if c in previous_letters: ... print "Duplicate", c ... else: ... print "First time", c ... previous_letters.add(c) ... First time S First time u First time b First time s First time t First time r Duplicate u First time c Duplicate t Duplicate u Duplicate r First time e >>>I would like to say
previous_seeds = set() ... if new_seed in previous_seeds: continue yield new_seed # Add the new seed to the collection seeds.append(new_seed) # Prevent future duplicates previous_seeds.add(new_seed)but it's not so easy. How does the set know if two elements are the same? By default it doesn't:
>>> x = set() >>> x.add(Seed(set([1]), set())) >>> x.add(Seed(set([1]), set())) >>> len(x) 2 >>>The dict, set, and frozenset datatypes are "hashed collections", meaning that they are internally based on hash tables. Hash collections contain hashable items, meaning:
- hash(x) must not change while x is in the collection
- If two elements x and y are equal then hash(x) == hash(y) and x == y.
- If two elements x and y are not equal then x == y must always be false. (It may be that hash(x) == hash(y).)
- For good performance, hash(x) should depend on the value of x such that it's unlikely for hash(x) == hash(y) unless x == y.
But as you can see, that's not what I want. I need to define new __eq__ and __hash__ methods. The first is easy:
def __eq__(self, other): return (self.atoms == other.atoms and self.bonds == other.bonds)the latter is a bit harder. I want a hash value which depends on the atoms and bonds. The easiest is
def __hash__(self): return hash(self.atoms) ^ hash(self.bonds)which uses the seldom-seen xor operator ("^") to combine the hashes of the set of atoms and the set of bonds.
This will not work. The 'set' data type is not hashable.
>>> x = set() >>> hash(x) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: unhashable type: 'set' >>>This is because sets are mutable. There's no guarantee that hash(x) will be constant over time, and that breaks a hashable requirement.
Instead, I'll have to switch over to a frozenset, which is an immutable datatype (meaning it cannot be modified).
>>> hash(frozenset([1,2,3,4])) 5527014739486817160 >>> x = frozenset([1,2,3,4]) >>> hash(x) 5527014739486817160 >>> x.add(3) Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'frozenset' object has no attribute 'add' >>>That means I need to change a few other bits of code which assume a set instead of a frozenset. For example:
bonds = seed.bonds.copy() bonds.add(bond)I'll change the three cases of set.copy() followed by set.add() into:
from itertools import chain ... bonds = frozenset(chain(seed.bonds, [bond]))and I need to change the initial seed creation from
seed = Seed(set([atom]), set())to
seed = Seed(frozenset([atom]), frozenset())
Duplicate removal while generating seeds
Put all of these together and I get:
from itertools import chain from openeye.oechem import * class Seed(object): def __init__(self, atoms, bonds): # These are frozensets self.atoms = atoms self.bonds = bonds # I'll put a Seed in a set to test for uniqueness. # This means seeds must be immutable and they must # implement __hash__ and __eq__. If two sets are # equal then they must generate the same hash and # __eq__ returns 1. If two sets are different then # __eq__ must return 0. def __hash__(self): # Base the hash on the xor-ed hashes of the # set of atoms and the set of bonds. return hash(self.atoms) ^ hash(self.bonds) def __eq__(self, other): return (self.atoms == other.atoms and self.bonds == other.bonds) def new_seed_with_bond(seed, bond): assert bond not in seed.bonds atoms = seed.atoms # Since seed.bonds is a frozenset, I can't copy and modify. # Instead, build the frozenset input list directly. bonds = frozenset(chain(seed.bonds, [bond])) return Seed(atoms, bonds) def new_seed_with_atom_and_bond(seed, atom, bond): assert atom not in seed.atoms assert bond not in seed.bonds atoms = frozenset(chain(seed.atoms, [atom])) bonds = frozenset(chain(seed.bonds, [bond])) return Seed(atoms, bonds) def print_seed(seed): for atom in seed.atoms: print atom.GetIdx(), OEGetAtomicSymbol(atom.GetAtomicNum()) for bond in seed.bonds: if bond.IsAromatic(): bondtype = "~" else: bondtype = "X-=#$"[bond.GetOrder()] print bond.GetBgnIdx(), bond.GetEndIdx(), bondtype, bond.GetIdx() seed_count = 0 def report_seed(seed): global seed_count seed_count += 1 print "=== New seed (%d) ===" % seed_count print_seed(seed) print "==== end ===" def generate_unique_substructures(mol, k=5): # If you don't want any atoms then you won't get any subgraphs if k < 1: return # Keep track of every seed I've seen except the seeds # of size 1 (since there's no way they can repeat) previous_seeds = set() seeds = [] for atom in mol.GetAtoms(): seed = Seed(frozenset([atom]), frozenset()) yield seed seeds.append(seed) while seeds: seed = seeds.pop() for bond in mol.GetBonds(): # If the bond is in the seed then skip this bond if bond in seed.bonds: continue new_atom = None # If neither atom of the bond is in the seed then skip this bond if bond.GetBgn() not in seed.atoms: if bond.GetEnd() not in seed.atoms: continue new_atom = bond.GetBgn() else: if bond.GetEnd() not in seed.atoms: new_atom = bond.GetEnd() # If one of the bond atoms is not in the seed # and the seed is at maximumize, then skip this bond if new_atom is not None: if len(seed.atoms) == k: continue # Create the new seed with this atom and bond new_seed = new_seed_with_atom_and_bond(seed, new_atom, bond) else: # Create a new seed where the atoms are already in the # seed but the bond is new new_seed = new_seed_with_bond(seed, bond) # Check for duplicates if new_seed in previous_seeds: continue yield new_seed # Add the new seed to the collection seeds.append(new_seed) # Prevent future duplicates previous_seeds.add(new_seed) mol = OEGraphMol() #OEParseSmiles(mol, "SC=N") #OEParseSmiles(mol, "SC=N.[Cl-]") OEParseSmiles(mol, "S1C=N1") #OEParseSmiles(mol, "c1ccccc1O") for seed in generate_unique_substructures(mol): report_seed(seed)I tested it out and I got the same results as the previous algorithm which generated all duplicate substructures and post-filtered the duplicates.
Notice the testing theme I'm developing here? I often start with a simple algorithm and get it working, then use it to cross-check a more powerful but also more complex algorithm.
Generating canonical SMARTS for each subgraph
I'm not done yet though. What I've done is generate all the unique substructures of an input structure. What I really want is all the unique canonical substructures, and preferably in SMARTS form. For example, if the input SMILES is "c1ccccc1O" with k=2 there should be 6 "c" matches, 1 "O" match, 6 "cc" matches, and 1 "cO" (or perhaps "Oc" depending on the canonicalization).
I don't know of a toolkit which lets me generate canonical substructures or canonical SMARTS from a subgraph, but for this problem it's easy to create one from the parts I do have. I say "for this problem" because I'm only worried about the element type and aromaticity, and not chirality, stereochemistry, or even charge.
The algorithm is simple, but not obvious. I make a new molecule which is isomorphic to the substructure, create it canonical SMILES, and use that SMILES to create the correct SMARTS.
The obvious problem is that a substructure may correspond to "ccC", which is not a valid SMILES. Most toolkits don't want to deal with that. OEChem is an exception. It does a very good job of respecting aromaticity flags even when it doesn't make chemical sense. But I want a solution which works in every toolkit. (I also ran into some canonicalization problems in OEChem for these patterns. I didn't resolve to my satisifaction and it could be a misunderstanding on my side.)
What I need is some way to encode the aromaticity in the SMILES in such a way that I can still get proper canonicalization and so that I can decode it. After a few attempts at different encodings, I decided to to use the isotope field, where isotope 1 mean "not aromatic" and 2 means "aromatic." For example, "[1C]" means "C" and "[2C]" means "c". The fragment "ccC" becomes the SMILES "[2C][2C][1C]", which is easily canonicalized and decoded as "ccC".
(If you also want the charge and hydrogen counts then you can encode those directly. With something like
new_atom.SetIsotope(old_atom.GetIsotope()*2+1 + old_atom.IsAromatic())you can also encode the old isotope information.)
So I turn the substructure into a non-aromatic canonical SMILES but with enough information that I can turn that back into the aromatic canonical SMARTS.
Note: There is a possibility that "cc" will appear in the result when it should be "c-c" (consider two aromatics in different rings, but also bonded to each other through a non-aromatic ring bond) but that's rare. For my purposes there's no problem because "cc" will match that bond and nothing will ever generate "c-c" so it won't get double counted. Instead, "c-c" and "cc" will both be counted as "cc".
Here's the code to convert the seeds to canonical SMARTS as post-processing step:
import re # I'm going to convert a substructure into a new molecule, # canonicalize the new molecule, and use the canonical SMILES to # create a canonical SMARTS for the original substructure. # I'll preserve the aromaticity information by encoding (atomic # number, is_aromatic) as (atomic number, isotope number) where the # isotope == 1 for not aromatic and 2 for aromatic. The corresponding # SMILES will be either the form "[1*]" or "[2*]" where "*" is a valid # atomic symbol. # Given the SMILES I can reverse that with a string transformation. # To make that easier I'll enumerate all possible reverse transformations. organic_subset = set([5,6,7,8,9,15,16,17,35,53]) decode_mapping = {"[1*]": "*", # Encode these manually "[2*]": "*"} for atomno in range(1, OEElemNo_MAXELEM): symbol = OEGetAtomicSymbol(atomno) if atomno in organic_subset: atom_smarts = symbol else: atom_smarts = "[%s]" % symbol decode_mapping["[1%s]" % symbol] = atom_smarts decode_mapping["[2%s]" % symbol] = atom_smarts.lower() # Helper function to transform the SMILES back to the correct SMARTS _atom_name_pat = re.compile(r"\[[12]\w+\]") def replace_atom_terms(m, decode_mapping = decode_mapping): return decode_mapping[m.group()] def generate_canonical_smarts(mol, k=2): for seed in generate_unique_substructures(mol, k): # Make a new molecule based on the substructure new_mol = OEGraphMol() new_atoms = {} for atom in seed.atoms: new_atom = new_mol.NewAtom(atom.GetAtomicNum()) # 1 if not aromatic, 2 if aromatic new_atom.SetIsotope(atom.IsAromatic()+1) new_atoms[atom.GetIdx()] = new_atom for bond in seed.bonds: new_bond = new_mol.NewBond(new_atoms[bond.GetBgnIdx()], new_atoms[bond.GetEndIdx()], bond.GetOrder()) # GetOrder() sets the Kekule form, but I want the aromatic # form. If I don't do this I end up with terms like "c=c" # in my results. Setting it to a single bond works because # the output will always have "" as the bond (as in "CC", # "cC" or "cc"). The only way to get an incorrect result # is if the structure should be "c-c", which is rare. But # since the "" connection means "single or aromatic", then # the SMARTS won't miss it, so we're okay. As long as I'm # consistent. if bond.IsAromatic(): new_bond.SetOrder(1) smiles = OECreateIsoSmiString(new_mol) smarts = _atom_name_pat.sub(replace_atom_terms, smiles) #print smiles, "->", smarts yield smarts if __name__ == "__main__": mol = OEGraphMol() #OEParseSmiles(mol, "SC=N") #OEParseSmiles(mol, "S1C=N1") OEParseSmiles(mol, "c1ccccc1C#N") #OEParseSmiles(mol, "c1ccc1") for smarts in generate_canonical_smarts(mol, 3): print smarts(If you are working with OEChem and you want something more sophisticated which will let you include SMARTS patterns like "X3" in your result, then there might be another solution via OESMILESFlag_SuperAtoms and SetName(). Remember, I'm available for consulting!)
Counting the unique canonical SMARTS
When I run the above I get
c c c c c c C N C#N cC#N Cc Ccc Ccc cc cc ccc ccc ccc cc ccc cc ccc cc ccc ccand it's a simple step to turn that into counts. (Python 2.7's collections.Counter would make this even easier.)
from collections import defaultdict counts = defaultdict(int) for smarts in generate_canonical_smarts(mol, 3): counts[smarts] += 1 for v, k in sorted((v,k) for (k,v) in counts.items()): print v, kThe output is:
1 C 1 C#N 1 Cc 1 N 1 cC#N 2 Ccc 6 c 6 cc 6 cccThis is easy to verify manually, and easier to verify this set than the list of atoms / list of bonds style I showed earlier.
Stress testing
Is there some way I can stress the code? Yes. I have a set of structures with explicit hydrogens, like
C(C1=C(C(=O)C(=C(C1=O)[H])[H])[H])([H])([H])[H] S(Sc1nc2c(c(c(c(c2s1)[H])[H])[H])[H])c3nc4c(c(c(c(c4s3)[H])[H])[H])[H] N(=O)(=O)c1c(c(c(c(c1O[H])Cl)[H])[N+](=O)[O-])[H]and when I send one of those through it's obvious that I want to remove hydrogens from any input structure. In OEChem that's
OESuppressHydrogens(mol, False, False, False)How about testing the patterns? My first thought was the if I find six "cccccc" then a unique SMARTS match must match six times. This turns out not to be the case because unique SMARTS matches are unique only with the set of atoms, and not the set of atoms + set of bonds. Benzene yields 6 "cccccc" substructures with my algorithm, but only has 1 unique SMARTS match. (Indigo does have search option like I want, and if I was being more rigorous I would use it to help out.)
Much as I tried, that approach does not work. I tried something else. I can check that each pattern matches as least once. Also, since the patterns are canonical, that means two different patterns will never match against the same set of atoms+bonds. This I can test!
if __name__ == "__main__": for line in open("/Users/dalke/databases/nci_09425001_09450000.smi"): line = line.strip() print "Process", line mol = OEGraphMol() assert OEParseSmiles(mol, line), repr(line) OESuppressHydrogens(mol, False, False, False) # No two patterns should match the same set of atoms+bonds matched_sets = {} for smarts in set(generate_canonical_smarts(mol, 7)): pat = OESubSearch() assert pat.Init(smarts) at_least_one = False # Search for non-unique matches so I get all possible # sets of atom and bond matches. for m in pat.Match(mol): at_least_one = True atoms = frozenset(m.GetTargetAtoms()) bonds = frozenset(m.GetTargetBonds()) # This should be unique, unless the pattern has # symmetry in which case it's only been matched by # this pattern. q = (atoms, bonds) if q in matched_sets: assert matched_sets[q] == smarts, (smarts, matched_sets[q]) matched_sets[q] = smarts assert at_least_one, smarts #print "Good"That took a while, but no failures! That leads me to suspect that my algorithm is working. Yay!
Download the final code
Here's the final version with everything, so you don't need to assemble it yourself. Or you can download it as simple_subgraph_enumeration.py.
# Details available at # http://dalkescientific.com/writings/diary/archive/2011/01/10/subgraph_enumeration.html # For a faster subgraph enumeration algorithm see # http://dalkescientific.com/writings/diary/archive/2011/01/13/faster_subgraph_enumeration.html import re from itertools import chain from openeye.oechem import * # A seed is a substructure. # It contains a frozenset of atoms and a frozenset of molecules. class Seed(object): def __init__(self, atoms, bonds): # These are frozensets self.atoms = atoms self.bonds = bonds # The following lets seeds be put in a set to test for uniqueness def __hash__(self): # Base the hash on the xor-ed hashes of the # set of atoms and the set of bonds. return hash(self.atoms) ^ hash(self.bonds) def __eq__(self, other): return (self.atoms == other.atoms and self.bonds == other.bonds) # Create a new seed from an old seed plus a bond. # Both end atoms for the bond must already be in the seed. def new_seed_with_bond(seed, bond): assert bond not in seed.bonds atoms = seed.atoms bonds = frozenset(chain(seed.bonds, [bond])) return Seed(atoms, bonds) # Create a new seed from an old seed plus a bond linking # an atom already in the seed to a new atom not in the seed. def new_seed_with_atom_and_bond(seed, atom, bond): assert atom not in seed.atoms assert bond not in seed.bonds atoms = frozenset(chain(seed.atoms, [atom])) bonds = frozenset(chain(seed.bonds, [bond])) return Seed(atoms, bonds) def generate_unique_substructures(mol, k=5): "Generate all unique substructures (as seeds) in a molecule, up to size 'k'" # If you don't want any atoms then you won't get any subgraphs if k < 1: return # Keep track of every seed I've seen except the seeds # of size 1 (since there's no way they can repeat) previous_seeds = set() seeds = [] for atom in mol.GetAtoms(): seed = Seed(frozenset([atom]), frozenset()) yield seed seeds.append(seed) while seeds: seed = seeds.pop() for bond in mol.GetBonds(): # If the bond is in the seed then skip this bond if bond in seed.bonds: continue new_atom = None # If neither atom of the bond is in the seed then skip this bond if bond.GetBgn() not in seed.atoms: if bond.GetEnd() not in seed.atoms: continue new_atom = bond.GetBgn() else: if bond.GetEnd() not in seed.atoms: new_atom = bond.GetEnd() # If one of the bond atoms is not in the seed # and the seed is at maximumize, then skip this bond if new_atom is not None: if len(seed.atoms) == k: continue # Create the new seed with this atom and bond new_seed = new_seed_with_atom_and_bond(seed, new_atom, bond) else: # Create a new seed where the atoms are already in the # seed but the bond is new new_seed = new_seed_with_bond(seed, bond) # Check for duplicates if new_seed in previous_seeds: continue yield new_seed # Add the new seed to the collection seeds.append(new_seed) # Prevent future duplicates previous_seeds.add(new_seed) # I'm going to convert a substructure into a new molecule, # canonicalize the new molecule, and use the canonical SMILES to # create a canonical SMARTS for the original substructure. # I'll preserve the aromaticity information by encoding (atomic # number, is_aromatic) as (atomic number, isotope number) where the # isotope == 1 for not aromatic and 2 for aromatic. The corresponding # SMILES will be either the form "[1*]" or "[2*]" where "*" is a valid # atomic symbol. # Given the SMILES I can reverse that with a string transformation. # To make that easier I'll enumerate all possible reverse transformations. organic_subset = set([5,6,7,8,9,15,16,17,35,53]) decode_mapping = {"[1*]": "*", # Encode these manually "[2*]": "*"} for atomno in range(1, OEElemNo_MAXELEM): symbol = OEGetAtomicSymbol(atomno) if atomno in organic_subset: atom_smarts = symbol else: atom_smarts = "[%s]" % symbol decode_mapping["[1%s]" % symbol] = atom_smarts decode_mapping["[2%s]" % symbol] = atom_smarts.lower() # Helper function to transform the SMILES back to the correct SMARTS _atom_name_pat = re.compile(r"\[[12]\w+\]") def replace_atom_terms(m, decode_mapping = decode_mapping): return decode_mapping[m.group()] def generate_canonical_smarts(mol, k=5): "generate all canonial SMARTS in a molecule up to size 'k'" for seed in generate_unique_substructures(mol, k): # Make a new molecule based on the substructure new_mol = OEGraphMol() new_atoms = {} for atom in seed.atoms: new_atom = new_mol.NewAtom(atom.GetAtomicNum()) # 1 if not aromatic, 2 if aromatic new_atom.SetIsotope(atom.IsAromatic()+1) new_atoms[atom.GetIdx()] = new_atom for bond in seed.bonds: new_bond = new_mol.NewBond(new_atoms[bond.GetBgnIdx()], new_atoms[bond.GetEndIdx()], bond.GetOrder()) # GetOrder() sets the Kekule form, but I want the aromatic # form. If I don't do this I end up with terms like "c=c" # in my results. Setting it to a single bond works because # the output will always have "" as the bond (as in "CC", # "cC" or "cc"). The only way to get an incorrect result # is if the structure should be "c-c", which is rare. But # since the "" connection means "single or aromatic", then # the SMARTS won't miss it, so we're okay. As long as I'm # consistent. if bond.IsAromatic(): new_bond.SetOrder(1) smiles = OECreateIsoSmiString(new_mol) smarts = _atom_name_pat.sub(replace_atom_terms, smiles) #print smiles, "->", smarts yield smarts #################### Self-test code ################## def test_seeds(): mol = OEGraphMol() OEParseSmiles(mol, "S1C=N1.[Cl-]") expected_indexes = [ ( (0,), ()), ( (1,), ()), ( (2,), ()), ( (3,), ()), ( (0,1), (1,)), ( (1,2), (2,)), ( (0,2), (0,)), ( (0,1,2), (0,1)), ( (0,1,2), (1,2)), ( (0,1,2), (0,2)), ( (0,1,2), (0,1,2)), ] expected_indexes.sort() found_indexes = [] for seed in generate_unique_substructures(mol): atom_indexes = tuple(sorted(atom.GetIdx() for atom in seed.atoms)) bond_indexes = tuple(sorted(bond.GetIdx() for bond in seed.bonds)) found_indexes.append( (atom_indexes, bond_indexes) ) found_indexes.sort() assert len(expected_indexes) == len(found_indexes), (len(expected_indexes), len(found_indexes)) for left, right in zip(expected_indexes, found_indexes): assert left == right, (left, right) def _make_counts(it): from collections import defaultdict counts = defaultdict(int) for item in it: counts[item] += 1 return dict(counts) def test_k_0(): mol = OEGraphMol() OEParseSmiles(mol, "c1ccccc1C#N") counts = _make_counts(generate_canonical_smarts(mol, 0)) assert counts == {}, counts def test_k_1(): mol = OEGraphMol() OEParseSmiles(mol, "c1ccccc1C#N") counts = _make_counts(generate_canonical_smarts(mol, 1)) assert counts == {"c": 6, "C": 1, "N": 1}, counts def test_k_3(): mol = OEGraphMol() OEParseSmiles(mol, "c1ccccc1C#N") counts = _make_counts(generate_canonical_smarts(mol, 3)) assert counts == {"c": 6, "Cc": 1, "C": 1, "N": 1, "cc": 6, "C#N": 1, "Ccc": 2, "cC#N": 1, "ccc": 6}, counts def stress_test(): # You will need to change this to a file you have on your file system for line in open("/Users/dalke/databases/nci_09425001_09450000.smi"): line = line.strip() print "Process", line mol = OEGraphMol() assert OEParseSmiles(mol, line), repr(line) OESuppressHydrogens(mol, False, False, False) # No two patterns should match the same set of atoms+bonds matched_sets = {} for smarts in set(generate_canonical_smarts(mol, 7)): pat = OESubSearch() assert pat.Init(smarts) at_least_one = False # Search for non-unique matches so I get all possible # sets of atom and bond matches. for m in pat.Match(mol): at_least_one = True atoms = frozenset(m.GetTargetAtoms()) bonds = frozenset(m.GetTargetBonds()) # This should be unique, unless the pattern has # symmetry in which case it's only been matched by # this pattern. q = (atoms, bonds) if q in matched_sets: assert matched_sets[q] == smarts, (smarts, matched_sets[q]) matched_sets[q] = smarts assert at_least_one, smarts def test(): test_seeds() test_k_0() test_k_1() test_k_3() print "Quick self-tests passed, running stress tests" stress_test() # Remember to change the code to use a SMILES file you have! if __name__ == "__main__": test()
Comments and Feedback
If you liked this essay, found a problem with the code, or have something else to add then go ahead and leave a comment. I would especially like to hear from people who have done non-trivial work with subgraph enumerations and in fingerprint filter generation.
Advertisement: Training courses in Leipzig in Feb 2011
Next month I will be teaching programming classes for cheminformatics in Leipzig, Germany. Python for Cheminformatics is 14-15 February and Web Application Development with Django is 16-18 February. Let me know if you're interested, and remember, there's a discount if you attend both classes!
Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me
Copyright © 2001-2020 Andrew Dalke Scientific AB