Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2016/08/09/fragment_achiral_molecules

Fragment achiral molecules in RDKit using low-level functions

This essay shows a simple way to fragment a non-chiral molecule (okay, a "molecular graph") using low-level RDKit calls, and then a complex way to re-connect the SMILES fragment by manipulating the SMILES string at the syntax level then recanonicalizing. (The correct way to fragment a bond in RDKit is to use FragmentOnBonds(), and not the code I cover here, which is meant as the basis for understanding how something like FragmentOnBonds() works.)

It is part of a series of essays on how to fragment a molecular graph.

I'll start by fragmenting the methoxy group of vanillin. (I wrote an essay on the vanillin functional groups.) I want to turn the SMILES of "c1(C=O)cc(OC)c(O)cc1" into two parts, '[*]OC' (the methoxy group) and 'O=Cc1cc[*]c(O)cc1' (the rest of the structure). The "[*]" terms are a way to represent an R-group in SMILES.

The method I'll use is to identify the bond and its two end atoms, delete the bond, and for each of the end atoms add a new "*" atom, which is an atom with atomic number 0.

I'll identify the bond manually, by counting the atom indices and identifying the substring which contains the methoxy:

                              1   ] atom indices, from 0 to 10.
            0  1 2 34 56 7 8 90   ]   (The top line is the tens.)
 vanillin = c1(C=O)cc(OC)c(O)cc1
                      ^^-- methoxy group
		    ^-^ break this bond between atom[4] and atom[5]
I want to break the bond between atoms at 4 and 5. I'll use RDKit for this. The default RDKit molecule does not support topological modifications, so I'll have to turn the molecule into a "RWMol" (Read/Write Molecule) to edit it, then convert the result back to a normal molecule:
>>> mol = Chem.MolFromSmiles("c1(C=O)cc(OC)c(O)cc1")
>>> rwmol = Chem.RWMol(mol)
>>> rwmol.RemoveBond(4, 5)
>>> new_mol = rwmol.GetMol()
>>> Chem.MolToSmiles(new_mol)
Visual inspection shows that this broke the correct bond, but it's missing the new "[*]" atoms. I need to go back to the RWMol and add two atoms with atomic number 0 (which Daylight SMILES terms a 'wildcard' atom):
>>> rwmol.AddAtom(Chem.Atom(0))
>>> rwmol.AddAtom(Chem.Atom(0))
>>> new_mol = rwmol.GetMol()
>>> Chem.MolToSmiles(new_mol)
That's getting closer! Now I need to connect each new atom to the correct atom which was as the end of the deleted bond. Remember, the old atom ids were 5 and 6, and you can see from the above code that th new atom ids are 11 and 12. I'll connect atom 4 to 11 and atom 5 to 12, both using a single bond:
>>> rwmol.AddBond(4, 11, Chem.BondType.SINGLE)
>>> rwmol.AddBond(5, 12, Chem.BondType.SINGLE)
>>> new_mol = rwmol.GetMol()
>>> Chem.MolToSmiles(new_mol)
That's the output I'm looking for. Now to wrap it up in a function:
from rdkit import Chem

def fragment_simple(mol, atom1, atom2):
    rwmol = Chem.RWMol(mol)
    rwmol.RemoveBond(atom1, atom2)
    wildcard1 = rwmol.AddAtom(Chem.Atom(0))
    wildcard2 = rwmol.AddAtom(Chem.Atom(0))
    rwmol.AddBond(atom1, wildcard1, Chem.BondType.SINGLE) 
    rwmol.AddBond(atom2, wildcard2, Chem.BondType.SINGLE) 
    return rwmol.GetMol()
and test it out on the vanillin structure:
>>> vanillin = Chem.MolFromSmiles("c1(C=O)cc(OC)c(O)cc1")
>>> fragmented_vanillin = fragment_simple(vanillin, 4, 5)
>>> Chem.MolToSmiles(fragmented_vanillin)

Reassemble two fragments

The above code looks right, but the real test is to reassemble the fragments and see if it gives the same result. Rather than work with the molecule at the molecular graph level, I'll manipulate the SMILES string directly, using a ring closure method I described in 2004.

The term "ring closure" sounds like it's only used for rings, but it's actually simply an alternate way to specify a bond. As Daylight's SMILES theory documentation points out, "C1.C1 specifies the same molecule as CC(ethane)".

I'll replace the "[*]" symbols with the ring closure "%90", which is high enough that won't occur in the structures I deal with. I can't do a simple string substitution since that would turn '[*]OC' into '%90OC'. If the '[*]' is the first atom, or is after a '.', then I need to place the %90 after the second atom symbol, giving "O%90C.c%901cc(C=O)ccc1O". I'll canonicalize that string, and compare it to the canonicalized vanillin structure:

>>> test_smiles = "O%90C.c%901cc(C=O)ccc1O"
>>> Chem.CanonSmiles(test_smiles)
>>> Chem.MolToSmiles(vanillin)
As hoped for, they are the same.

Convert wildcards to closures

I apologize. I'm about to show you a big block of code with very little commentary besides the comments. Feel free to skip to the next section.

In the previous section, I manually tranformed a SMILES string containing wildcards to one with ring closures so I could test if the fragmentation code was correct. I can automated that step, but it's a bit complicated and I won't go into the details. The key observation is that the wildcard atom is only connected by a single bond to the rest of the fragment. If the wildcard is the first atom term in the SMILES then the ring closure goes on the second atom term. Otherwise, the ring closure goes on the atom immediately before the wildcard atom. (The last is RDKit-specific becuse RDKit's canonical SMILES ends up placing branches with wildcard atoms before any other branches.)

The following code automates that process. I call the file "smiles_syntax.py" and it defines the new function "convert_wildcards_to_closures(smiles, offsets=None)". Here's an example:

>>> import smiles_syntax
>>> fragmented_smiles = "[*]OC.[*]c1cc(C=O)ccc1O"
>>> smiles_syntax.convert_wildcards_to_closures(fragmented_smiles)
>>> smiles_syntax.convert_wildcards_to_closures(fragmented_smiles, (0, 0))
>>> closure_smiles = smiles_syntax.convert_wildcards_to_closures(fragmented_smiles, (0, 0))
>>> from rdkit import Chem
>>> Chem.CanonSmiles(closure_smiles)
By default each wildcard atom gets a new closure number, starting from 90, which is why you see the "%90" and "%91". The offsets parameter specifies which offsets from 90 to use. With "(0, 0)", both wildcards get the '%90' ring closure.

This will be used later on work with multiple cuts of the same structure.

Here's the content of "smiles_syntax.py".

# I call this "smiles_syntax.py"
from __future__ import print_function

import re

# Match a '*' in the different forms that might occur,
# including with directional single bonds inside of ()s.
_wildcard_regex = " |\n".join(re.escape(regex) for regex in
  ("*", "[*]", "(*)", "([*])", "(/*)", "(/[*])", "(\\*)", "(\\[*])"))
_wildcard_pattern = re.compile(_wildcard_regex, re.X)

# Match the SMILES for an atom, followed by its closures
_atom_pattern = re.compile(r"""
 Cl? |             # Cl and Br are part of the organic subset
 Br? |
 [NOSPFIbcnosp] |  # as are these single-letter elements
 \[[^]]*\]         # everything else must be in []s
""", re.X)

def convert_wildcards_to_closures(smiles, offsets=None):
    # This is designed for RDKit's canonical SMILES output. It does
    # not handle all possible SMILES inputs.
    if offsets is None:
        # Use 0, 1, 2, ... up to the number of '*'s
        offsets = range(smiles.count("*"))
    closure_terms = []
    for offset in offsets:
        if not (0 <= offset <= 9):
            raise ValueError("offset %d out of range (must be from 0 to 9)"
                             % (offset,))
        closure_terms.append("%%%02d" % (90 + offset))
    new_smiles = smiles
    while 1:
        # Find the first '*'. If none are left, stop.
        wildcard_match = _wildcard_pattern.search(new_smiles)
        if wildcard_match is None:

        closure_term = closure_terms.pop(0)
        wildcard_start = wildcard_match.start()
        if wildcard_start == 0 or new_smiles[wildcard_start-1] == ".":
            # At the start of the molecule or after a ".". Need to
            # put the closure after the second atom. Find the second
            # atom. Since we only ever break on single non-ring bonds,
            # and since the first atom is a terminal atom, the second
            # atom must either be immediately after the first atom, or
            # there is a directional bond between them.
            wildcard_end = wildcard_match.end()
            second_atom_match = _atom_pattern.match(new_smiles, wildcard_end)
            if second_atom_match is None:
                # There was no atom. Is it a "/" or "\"? If so,
                # we'll need to swap the direction when we move
                # to a closure after the second atom.
                bond_dir = new_smiles[wildcard_end:wildcard_end+1]
                if bond_dir == "/":
                    direction = "\\"
                elif bond_dir == "\\":
                    direction = "/"
                    raise AssertionError(new_smiles)
                # Look for the second atom, which must exist
                second_atom_match = _atom_pattern.match(new_smiles, wildcard_end+1)
                if second_atom_match is None:
                    raise AssertionError((new_smiles, new_smiles[match_end:]))
                direction = ""

            second_atom_term = second_atom_match.group(1)
            # I changed the bond configuration, so I may need to
            # invert chirality of implicit chiral hydrogens.
            if "@@H" in second_atom_term:
                second_atom_term = second_atom_term.replace("@@H", "@H")
            elif "@H" in second_atom_term:
                second_atom_term = second_atom_term.replace("@H", "@@H")

            # Reassemble the string with the wildcard term deleted and
            # the new closure inserted directly after the second atom
            # (and before any of its closures).
            new_smiles = (new_smiles[:wildcard_start] + second_atom_term
                          + direction + closure_term
                          + new_smiles[second_atom_match.end():])
            # The match is somewhere inside of a molecule, so we attach
            # assign the closure to the atom it's bonded to on the left
            c = new_smiles[wildcard_start-1]
            if c == "(" or c == ")":
                # In principle, this could be something like "CCC(F)(Cl)[*]", 
                # where I would need to count the number of groups back to
                # the main atom, and flip chirality accordingly. Thankfully,
                # RDKit always puts the "[*]" terms immediately after the
                # preceeding atom, so I don't need to worry.
                raise NotImplementedError("intermediate groups not supported",
                                          new_smiles, new_smiles[wildcard_start-1:])
            elif c in "CNcnOS]Pos0123456789ABDEFGHIJKLMQRTUVWXYZabdefghijklmpqrtuvwxyz":
                # Double-check the the previous character looks like part of an atom.
                wildcard_term = wildcard_match.group()
                # Preserve the direction, if present
                if "/" in wildcard_term:
                    direction = "/"
                elif "\\" in wildcard_term:
                    direction = "\\"
                    direction = ""
                new_smiles = (new_smiles[:wildcard_start] + direction + closure_term
                              + new_smiles[wildcard_match.end():])

                raise AssertionError((new_smiles, c, new_smiles[wildcard_start-1:]))

    return new_smiles

if __name__ == "__main__":
    for smiles in ("*C", "*/CO.*CN", "C*.C(*)N"):
        print(smiles, convert_wildcards_to_closures(smiles, (0,)*smiles.count("*")))

Test driver for fragment_simple()

While the above code looks right, it got that way only as the result of a lot of testing. There are a lot of corner cases, like deuterium and directional bonds, which caused the first versions of this function to fail.

I've found it best to pass a lot of real-world SMILES through code like this, to see if there are any problem. I usually draw from PubChem SMILES or ChEMBL. I'll show you what that looks like.

NOTE: the file I test against, "pubchem.smi", excludes structures with chiral terms. This is deliberate, for reasons I'll get to soon.

The "verify()" function, below, takes a molecule and a list of a pairs. For each pair of atom describing a bond, it fragments the molecule on that bond then reassembles it and tests that the result matches the input.

from __future__ import print_function  
from rdkit import Chem
from smiles_syntax import convert_wildcards_to_closures

def fragment_simple(mol, atom1, atom2):
    rwmol = Chem.RWMol(mol)
    rwmol.RemoveBond(atom1, atom2)
    wildcard1 = rwmol.AddAtom(Chem.Atom(0))
    wildcard2 = rwmol.AddAtom(Chem.Atom(0))
    rwmol.AddBond(atom1, wildcard1, Chem.BondType.SINGLE) 
    rwmol.AddBond(atom2, wildcard2, Chem.BondType.SINGLE) 
    return rwmol.GetMol()

def verify(mol, bond_atoms):
    # Even though I'm not checking for stereochemistry, some of
    # my input files contain isotopes. If I don't use isomeric
    # SMILES then the input '[2H]' gets represented as '[H]',
    # which during recanonicalization is placed on the main atom.
    # (RDKit 2016 added an an option to disable that.)
    expected_smiles = Chem.MolToSmiles(mol, isomericSmiles=True)
    for begin_atom, end_atom in bond_atoms:
        fragmented_mol = fragment_simple(mol, begin_atom, end_atom)
        fragmented_smiles = Chem.MolToSmiles(fragmented_mol, isomericSmiles=True)
        assert "." in fragmented_smiles, fragmented_smiles # safety check
        closure_smiles = convert_wildcards_to_closures(fragmented_smiles, (0, 0))
        assert "%90" in closure_smiles, closure_smiles # safety check
        closure_mol = Chem.MolFromSmiles(closure_smiles)
        final_smiles = Chem.MolToSmiles(closure_mol, isomericSmiles=True)
        if final_smiles != expected_smiles:
            raise ValueError( (expected_smiles, begin_atom, end_atom, fragmented_smiles,
                               closure_smiles, final_smiles) )
I can run it manually, and also check what it does when I break one of the ring bonds:
>>> vanillin = Chem.MolFromSmiles("c1(C=O)cc(OC)c(O)cc1")
>>> verify(vanillin, [(4, 5)])
>>> verify(vanillin, [(3, 4)])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 12, in verify
AssertionError: [*]COc1([*])cc(C=O)ccc1O

I can have it break all single bond which aren't in a ring and which aren't connected to a wildcard atom (#0) or a hydrogen (#1):

>>> single_bond_pat = Chem.MolFromSmarts("[!#0;!#1]-!@[!#0;!#1]")
>>> matches = vanillin.GetSubstructMatches(single_bond_pat)
>>> matches
((0, 1), (4, 5), (5, 6), (7, 8))
>>> verify(vanillin, matches)

Once I'm happy that the verification function is correct, I can put it in a test driver:

def main():
    # Match a single bond not in a ring
    single_bond_pat = Chem.MolFromSmarts("[!#0;!#1]-!@[!#0;!#1]")

    # NOTE! This file contains no chiral structures (nothing with a '@')
    with open("pubchem.smi") as infile:
        ## You might try processing every 100th record
        # import itertools
        # infile = itertools.islice(infile, None, None, 100)

        ## Or process the lines in random order
        # import random
        # lines = infile.readlines()
        # random.shuffle(lines)
        # infile = lines
        recno = -1
        num_tests = 0
        for recno, line in enumerate(infile):
            if recno % 100 == 0:
                print("Processing record:", recno, "#tests:", num_tests)
            smiles, id = line.split()
            if "*" in smiles:
                print("Skipping record with a '*':", repr(line))
            if "@" in smiles:
                print("Skipping record with a chiral '@':", repr(line))
            mol = Chem.MolFromSmiles(smiles)
            if mol is None:

            matches = mol.GetSubstructMatches(single_bond_pat)
            verify(mol, matches)
            num_tests += len(matches)
        recno += 1
        print("Processed", recno, "records and", num_tests, "tests")
if __name__ == "__main__":
The output from this looks like:
% python fragment2.py
Processing record: 0 #tests: 0
Processing record: 100 #tests: 1102
Processing record: 200 #tests: 2078
Processing record: 300 #tests: 3111
Processing record: 6800 #tests: 72322
[00:24:27] Explicit valence for atom # 2 Cl, 3, is greater than permitted
Processing record: 6900 #tests: 73275

fragment_simple() does not preserve chirality

That seems like a lot of test code just to show that the function, "fragment_simple()", is correct. It's only seven lines long, without any branches. It's the sort of funciton that should be easy to test by eye or with a few test cases.

Which is what I did the first ten or so times I wrote that function. But then I found out that the function doesn't preserve chirality. Here's an example of the verify() function failing on what should be a simple test case:

>>> mol = Chem.MolFromSmiles("C[C@](N)(C=O)c1ccccc1")
>>> verify(mol, [(1, 3)])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 21, in verify
ValueError: ('C[C@](N)(C=O)c1ccccc1', 1, 3, '[*]C=O.[*][C@@](C)(N)c1ccccc1',
'C%90=O.[C@@]%90(C)(N)c1ccccc1', 'C[C@@](N)(C=O)c1ccccc1')

You can see the problem in the fragmentation step, where the chirality somehow inverted:

>>> mol = Chem.MolFromSmiles("C[C@](N)(C=O)c1ccccc1")
>>> new_mol = fragment_simple(mol, 1, 3)
>>> Chem.MolToSmiles(new_mol, isomericSmiles=True)
The second fragment should have "[C@]" instead of "[C@@]". Why did it flip?

The problem comes back to the "RemoveBond()/AddBond()" in fragment_simple(). RDKit stores the chirality as an atom property, but that property is actually based on the permutation order of the bonds. If the permutation order changes, then the chirality changes even if the chirality property hasn't changed.

To demonstrate, I need a helper function to show the connection table:

_bond_symbol = {
    Chem.BondType.SINGLE: "-",
    Chem.BondType.DOUBLE: "=",
    Chem.BondType.TRIPLE: "#",
    Chem.BondType.AROMATIC: ":",
def get_bond_info(from_atom_idx, bond):
    to_atom_idx = bond.GetBeginAtomIdx()
    if from_atom_idx == to_atom_idx:
        to_atom_idx = bond.GetEndAtomIdx()
    return _bond_symbol[bond.GetBondType()] + str(to_atom_idx)

# Example output line:
#   5 C -1 :6 :10 
# This says that atom[5] is a "C"arbon with 3 bond.
# The first bond is a single bond to atom[1].
# The second bond is an aromatic bond to atom[6]. 
# The third bond is an aromatic bond to atom[10].

def show_connection_table(mol):
    for atom in mol.GetAtoms():
        bonds = " ".join(get_bond_info(atom.GetIdx(), bond) for bond in atom.GetBonds())
        print(atom.GetIdx(), atom.GetSymbol(), bonds)
then use it to show the connection table for the two molecules.
>>> show_connection_table(mol)
0 C -1
1 C -0 -2 -3 -5
2 N -1
3 C -1 =4
4 O =3
5 C -1 :6 :10  
6 C :5 :7
7 C :6 :8
8 C :7 :9
9 C :8 :10
10 C :9 :5
>>> show_connection_table(new_mol)
0 C -1
1 C -0 -2 -5 -11
2 N -1
3 C =4 -12
4 O =3
5 C -1 :6 :10
6 C :5 :7
7 C :6 :8
8 C :7 :9
9 C :8 :10
10 C :9 :5
11 * -1
12 * -3
I deleted the bond from atom[1] to atom[3]. It's atom[1] which has the chiral property so look at the second line of each of the outputs.

The original bond configuration order was (0, 2, 3, 5), while now it's (0, 2, 5, 11), where the third bond which went to the old atom[5] was replaced with the new fourth bond to the new wildcard atom[11]. The permutation order changed. The third and fourth bonds must be swapped to get back to the original permutation order. Each flip in the permutation order corresponds to a chirality inversion, which is why the output chirality is inverted.

Stay tuned to this channel for the continuing story of how to fragment a molecule and preserve chirality after reassembly. (Remember, RDKit has a built-in function for this, FragmentOnBonds(). Use that if you want to fragment.)

Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me

Copyright © 2001-2013 Andrew Dalke Scientific AB