Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2010/12/28/reordering_smiles

Reordering SMILES atom order

Over in Blue Obelisk's Shapado site Noel O'Boyle asked "How to reorder a SMILES string to have specific first and final atoms?" I'll describe a way to get any ordering you want for the SMILES atoms.

Reordering the atoms in a SMILES string is not something I need often but being able to build a SMILES string from a list of atoms and bonds (for example, from a molecular substructure) is very useful, and the technique I develop here can be applied directly to that problem.

The time: 1999. I was an employee at Bioreason, which developed data mining techniques for HTS data. Some of our code worked with SD records and identified atoms based on their position in the records. Others worked with SMILES. We wanted to keep the SMILES atom ordering the same so we could use the position information directly and not have to map from one ordering system to the other.

At that time we used the Daylight toolkit. It generates a SMILES by walking the spanning tree of the molecular graph. The choice of start atom and the direction of which branch or cycle to take are determined by the canonicalization algorithm. We could change how those decisions were made, but it would still be a spanning tree traversal. It isn't enough to guarantee a specific atom order.

As a simple example, the SMILES for cyanic acid is N#CO. That's a nitrogen triple-bonded to a carbon which in turn is single bonded to an oxygen. In this SMILES form, hydrogens are used to fill the valences. Here the N and C valances are filled so they have no hydrogens. The oxygen is not filled, so there's a hydrogen on the oxygen.

If I want the "C" to be the first atom in the SMILES then the spanning tree algorithm would generate either C(#N)O or C(O)#N. But what if I wanted the "O" first, the "N" second, and the "C" last? It's not possible to specify an ordering in the Daylight tools, nor do any of the modern toolkits have this functionality built-in.

Dot disconnects and closures

That's okay: if you don't mind ignoring stereochemistry and chirality then it's easy to create your own equivalent SMILES. How? Dot disconnections and closures. These are often called "ring closures" but there's no need to use them only to close rings so I'll just call them "closures."

I'll start with the simplest case where I don't care about the bonds or the hydrogen count. Then it's easy to list the atoms in the right order. For cyanic acid, that's "O.N.C". The "." between the atoms is a "dot disconnect." It means that there's no bond between the two adjacent atoms.

I'll use closures to link the atoms together. You can think of a closure as a half-bond. A closure is mentioned twice: once for each bond it connects. For example, I can connect the "O" and "C" atoms by doing "O1.N.C1". This says closure 1 makes a single bond between the two marked atoms. I can also connect the "N" and "C" atoms as well with "O1.N#2.C1#2". In this case I need to specify that it's a triple bond and not the default single-or-aromatic bond.

So the oxygen has one bond, and it's a single bond to the carbon. The nitrogen has one bond, and it's a triple bond to the carbon. The carbon has two bonds, which is are the single bond to the oxygen and the triple bond to the nitrogen.

Tagging bonds with a unique closure number

Implementing this as an algorithm is pretty simple. The hardest part is actually writing the atoms in the correct form, since SMILES has a short-hand notation which means "O" is the preferred form instead of "[OH2]" and "CS(=O)[O-]" is preferred over "[CH3][SH](=[O])[O-]". See how I wrote "preferred"? There's no requirement to use the short-hand notation, so I'm always going to write the atoms in the "[bracketed]" definition, along with the number of hydrogens and charge. Notice how I listed the "#" twice? I don't need to, and most programs will only indicate the bond type on the first closure, but including it makes the code easier to write.

I'm not going to handle the isotope number, I'm not going to deal with chirality, and I'm not going to support stereochemistry.

SMILES representations for an atom, a bond, or a closure

The first bit shows the code for generating the SMILES for given atoms, bonds, and closures.

# Given a molecule and the list of atoms in the molecule,
# generate an equivalent SMILES string where the atoms are in
# the same order as the list.
# Caution: this version is limited to 100 bonds.

from rdkit import Chem
from rdkit.Chem.rdchem import BondType

def format_atom(atom):
    # A general purpose SMILES writer would check to see if the atom
    # can be written without the []s and if so, only show the symbol.
    # This version is easier, though incomplete (no isotope support)
    # The goal is to get the point across and not make the best SMILES
    smiles = '['
    if atom.GetIsAromatic():
        smiles += atom.GetSymbol().lower()
    else:
        smiles += atom.GetSymbol()

    hcount = atom.GetNumExplicitHs() + atom.GetNumImplicitHs()
    if hcount != 0:
        smiles += "H%s" % hcount
            
    charge = atom.GetFormalCharge()
    if charge != 0:
        # Represent "[Cl-]" and "[NH4+]" as "[Cl-1]" and "[NH4+1]"
        # That's a bit more cumbersome, but it's easy to code
        smiles += "%+d" % charge

    return smiles + "]"

_bond_symbols = {
    BondType.SINGLE: "",
    BondType.AROMATIC: "",  # No need to use ":" to force aromatic perception since
                            # an implicit bond between two aromatics is aromatic
    BondType.DOUBLE: "=",
    BondType.TRIPLE: "#"
    }

def format_bond(bond):
    bondtype = bond.GetBondType()
    # Handle the special case where the end atoms are aromatic but the
    # bond itself is not aromatic. c1ccccc1-c1ccccc1 . This preserves
    # aromatic information for some ring systems when parsed by tools
    # like OEChem which don't atomatically perceive aromaticity on input.
    if (bondtype is BondType.SINGLE and
        bond.GetBeginAtom().GetIsAromatic() and
        bond.GetEndAtom().GetIsAromatic()):
        return "-"
    return _bond_symbols[bondtype]
    

# the closure symbols in the order:
#   1, 2, ..., 9, %10, %11, ... %99, 0
# (I put 0 last since most people expect to start from 1)
_closure_symbols = list("123456789") + ["%"+"%d"%i for i in range(10, 100)] + ["0"]
def format_closure(i):
    return _closure_symbols[i]

Assign unique closure numbers to each bond

With that support structure out of the way (why don't chemistry toolkits have built-in functions equivalent to format_atom() and format_bond()?), I can go ahead and implement the simple bond numbering algorithm:

# Caution: this version is limited to 100 bonds.
def reordered_smiles_100(mol, atoms):
    # Make a mapping from bond.GetIdx() to a unique closure string
    closure_map = {}
    for bond_i, bond in enumerate(mol.GetBonds()):
        closure_map[bond.GetIdx()] = format_closure(bond_i)

    smiles = ""
    for atom_i, atom in enumerate(atoms):
        if atom_i != 0:
            # Each atom is dot disconnected
            # (Don't put a "." before the first atom)
            smiles += "."
        # Save the atom, in the correct order
        smiles += format_atom(atom)

        # Make the bond connection through the closure
        for bond in atom.GetBonds():
            smiles += format_bond(bond)
            smiles += closure_map[bond.GetIdx()]

    return smiles

Testing the unique closure number bond assignment algorithm

It's short and it looks right, but how do I test it? The easiest is to construct some test cases, run them manually, verify they are correct, and put those in as automated test. Actually, I ran things in the interactive shell or from the command line until the results looked correct, then made the automated tests, then evaluated my tests to make sure they were sufficiently diverse and complete.

def test_symbols_100():
    # Make sure the charges and hydrogen counts stay correct
    mol = Chem.MolFromSmiles("[NH4+].[Cl-].[F].[He-2]")
    s = reordered_smiles_100(mol, mol.GetAtoms())
    assert s == "[NH4+1].[Cl-1].[F].[He-2]", s

    # Make sure I can specify the order
    s = reordered_smiles_100(mol, reversed(mol.GetAtoms()))
    assert s == "[He-2].[F].[Cl-1].[NH4+1]", s

    # Make sure the bond types are carried over
    mol = Chem.MolFromSmiles("CCO.N=O.C#N")
    s = reordered_smiles_100(mol, mol.GetAtoms())
    assert s == "[CH3]1.[CH2]12.[OH1]2.[NH1]=3.[O]=3.[CH1]#4.[N]#4"

    # Make sure the aromatic carbons stay aromatic
    mol = Chem.MolFromSmiles("c1ccccc1")
    s = reordered_smiles_100(mol, mol.GetAtoms())
    assert s == "[cH1]16.[cH1]12.[cH1]23.[cH1]34.[cH1]45.[cH1]56", s

    # Make sure the connection between the two rings has an explicit single bond
    mol = Chem.MolFromSmiles("c1ccccc1c2ccccc2")
    s = reordered_smiles_100(mol, mol.GetAtoms())
    assert "-" in s, s

Random shuffle testing of NCI structures

This is nice, but the structures aren't very complex and don't look much like real SMILES. I have a bunch of SMILES strings from the old NCI data set. I can use those to write a stress test based on canonical SMILES. Two molecular graphs are the same if and only if they have the same canonical SMILES. So if I have a molecule and I shuffle its atoms around in any arbitrary order to get a new SMILES, then both the original SMILES and the reordered SMILES must have the same canonical SMILES. And I can test that. In fact, I'll do 100 different orderings for each input SMILES.

There is one problem. I don't support chirality or stereochemistry but the NCI test set includes those. I need to strip this data out before I use it. But that's easy! If I assume tetragonal chiral classes represented only as "@" or "@@" then I can get the non-chiral versions by removing the "@"s. Similarly, I can replace "/" and "\" with "-" and get the forms without stereochemistry. This removal can be done directly on the SMILES string, like this:

# This removes standard chiral and stereochemistry from SMILES strings.
# It's very simple and will not work if stereo classes are specified.
# (If that happens then you'll get a SMILES parse error.)
def remove_stereochemistry(smiles):
    return smiles.replace("@", "").replace("/", "-").replace("\\", "-")

With that in place, here's the code to shuffle atoms around and compare the reordered canonical SMILES with the original:

import random

input_smiles = [
    "Clc1c(/C=C\\C(=O)NNC(=O)Cn2nc(cc2C)C)c(F)ccc1",
    "o1nc(nc1CCC(=O)NNC(=O)Cn1nc(cc1C)C)c1ccccc1",
    "o1nc(c(c1C)C(=O)NNC(=O)Cn1nc(cc1C)C)CC",
    "O=C(CCC(=O)NNC(=O)Cn1nc(cc1C)C)c1ccc(cc1)c1ccccc1",
    "S(c1c(C(=O)NNC(=O)Cn2nc(cc2C)C)cccc1)CC(=O)N(C)C",
    "Clc1c2OCCCOc2cc(c1)/C=C\\C(=O)NNC(=O)Cn1nc(cc1C)C",
    "S1C[CH]([NH2+]C1)(C(=O)N1[C@@H](CN(CC1)C(=O)NC(C)C)(C(=O)N[C@H]1(CCCNC1=O)))",
    "S1C[CH](NC1)(C(=O)N1[C@@H](CN(CC1)C(=O)NC(C)C)(C(=O)N[C@@H]1(CCCNC1=O)))",
    "O=C(NNC(=O)Cc1ccc(cc1)C)Cn1nc(cc1C)C",
    "s1cc(nc1C)COc1ccc(C(=O)NNC(=O)Cn2nc(cc2C)C)cc1",
    "O=C(NNC(=O)/C=C\\c1ccc(CC)cc1)Cn1nc(cc1C)C",
    "Clc1c(N2CCCC2=O)cc(cc1)C(=O)NNC(=O)Cn1nc(cc1C)C",
    "O=C(NNC(=O)Cn1nc(cc1C)C)Cc1c(n2ncnc2nc1C)C",
    "Clc1cc(ncc1)C(=O)NCC(=O)NNC(=O)Cn1nc(cc1C)C",
    "O=C1N(CCC(=O)NNC(=O)Cn2nc(cc2C)C)C(=O)c2c(C1)cccc2",
    "O1[C@H](CC(=O)NNC(=O)Cn2nc(cc2C)C)(C(=O)Nc2c1cccc2)",
    "O1[C@@H](CC(=O)NNC(=O)Cn2nc(cc2C)C)(C(=O)Nc2c1cccc2)",
    "s1c(nc2c1cccc2)CCCCC(=O)NNC(=O)Cn1nc(cc1C)C",
    "Brc1cc(C(=O)NNC(=O)Cn2nc(cc2C)C)ccc1OC",
    "Clc1cc2c(oc(cc2=O)C(=O)NNC(=O)Cn2nc(cc2C)C)cc1",
    "O=C1N(C(=O)c2c(C1)cccc2)CC(=O)NNC(=O)Cn1nc(cc1C)C",
]
    

def test_shuffle_100():
    for smiles in input_smiles:
        # Reinterpret the SMILES, without chirality or stereo information
        smiles = remove_stereochemistry(smiles)
        canonical_smiles = Chem.CanonSmiles(smiles)
        print "I have", repr(smiles)
        assert "@" not in canonical_smiles, canonical_smiles
        assert "/" not in canonical_smiles, canonical_smiles
        assert "\\" not in canonical_smiles, canonical_smiles

        mol = Chem.MolFromSmiles(canonical_smiles)
        atoms = list(mol.GetAtoms())

        # Generate some random permutations and verify matching SMILES
        for i in range(100):
            random.shuffle(atoms)
            new_smiles = reordered_smiles_100(mol, atoms)
            new_canonical_smiles = Chem.CanonSmiles(new_smiles)
            assert canonical_smiles == new_canonical_smiles
I'll also need a bit of code to make the tests work as a "mainline" program:
def test():
    test_symbols_100()
    test_shuffle_100()

if __name__ == "__main__":
    test()
If you run this you'll get 21 lines of output, that is, one for each input structure. It looks like the code works.

All the code at once

It's easier to get the code in one spot rather than splicing together all the small code blocks I posted above, so here's the code and tests in a single chunk:

# Given a molecule and the list of atoms in the molecule,
# generate an equivalent SMILES string where the atoms are in
# the same order as the list.
# Caution: this version is limited to 100 bonds.

from rdkit import Chem
from rdkit.Chem.rdchem import BondType

def format_atom(atom):
    # A general purpose SMILES writer would check to see if the atom
    # can be written without the []s and if so, only show the symbol.
    # This version is easier, though incomplete (no isotope support)
    # The goal is to get the point across and not make the best SMILES
    smiles = '['
    if atom.GetIsAromatic():
        smiles += atom.GetSymbol().lower()
    else:
        smiles += atom.GetSymbol()

    hcount = atom.GetNumExplicitHs() + atom.GetNumImplicitHs()
    if hcount != 0:
        smiles += "H%s" % hcount
            
    charge = atom.GetFormalCharge()
    if charge != 0:
        # Represent "[Cl-]" and "[NH4+]" as "[Cl-1]" and "[NH4+1]"
        # That's a bit more cumbersome, but it's easy to code
        smiles += "%+d" % charge

    return smiles + "]"

_bond_symbols = {
    BondType.SINGLE: "",
    BondType.AROMATIC: "",  # No need to use ":" to force aromatic perception since
                            # an implicit bond between two aromatics is aromatic
    BondType.DOUBLE: "=",
    BondType.TRIPLE: "#"
    }

def format_bond(bond):
    bondtype = bond.GetBondType()
    # Handle the special case where the end atoms are aromatic but the
    # bond itself is not aromatic. c1ccccc1-c1ccccc1 . This preserves
    # aromatic information for some ring systems when parsed by tools
    # like OEChem which don't atomatically perceive aromaticity on input.
    if (bondtype is BondType.SINGLE and
        bond.GetBeginAtom().GetIsAromatic() and
        bond.GetEndAtom().GetIsAromatic()):
        return "-"
    return _bond_symbols[bondtype]
    

# the closure symbols in the order:
#   1, 2, ..., 9, %10, %11, ... %99, 0
# (I put 0 last since most people expect to start from 1)
_closure_symbols = list("123456789") + ["%"+"%d"%i for i in range(10, 100)] + ["0"]
def format_closure(i):
    return _closure_symbols[i]


# Caution: this version is limited to 100 bonds.
def reordered_smiles_100(mol, atoms):
    # Make a mapping from bond.GetIdx() to a unique closure string
    closure_map = {}
    for bond_i, bond in enumerate(mol.GetBonds()):
        closure_map[bond.GetIdx()] = format_closure(bond_i)

    smiles = ""
    for atom_i, atom in enumerate(atoms):
        if atom_i != 0:
            # Each atom is dot disconnected
            # (Don't put a "." before the first atom)
            smiles += "."
        # Save the atom, in the correct order
        smiles += format_atom(atom)

        # Make the bond connection through the closure
        for bond in atom.GetBonds():
            smiles += format_bond(bond)
            smiles += closure_map[bond.GetIdx()]

    return smiles

def test_symbols_100():
    # Make sure the charges and hydrogen counts stay correct
    mol = Chem.MolFromSmiles("[NH4+].[Cl-].[F].[He-2]")
    s = reordered_smiles_100(mol, mol.GetAtoms())
    assert s == "[NH4+1].[Cl-1].[F].[He-2]", s

    # Make sure I can specify the order
    s = reordered_smiles_100(mol, reversed(mol.GetAtoms()))
    assert s == "[He-2].[F].[Cl-1].[NH4+1]", s

    # Make sure the bond types are carried over
    mol = Chem.MolFromSmiles("CCO.N=O.C#N")
    s = reordered_smiles_100(mol, mol.GetAtoms())
    assert s == "[CH3]1.[CH2]12.[OH1]2.[NH1]=3.[O]=3.[CH1]#4.[N]#4"

    # Make sure the aromatic carbons stay aromatic
    mol = Chem.MolFromSmiles("c1ccccc1")
    s = reordered_smiles_100(mol, mol.GetAtoms())
    assert s == "[cH1]16.[cH1]12.[cH1]23.[cH1]34.[cH1]45.[cH1]56", s

    # Make sure the connection between the two rings has an explicit single bond
    mol = Chem.MolFromSmiles("c1ccccc1c2ccccc2")
    s = reordered_smiles_100(mol, mol.GetAtoms())
    assert "-" in s, s

# This removes standard chiral and stereochemistry from SMILES strings.
# It's very simple and will not work if stereo classes are specified.
# (If that happens then you'll get a SMILES parse error.)
def remove_stereochemistry(smiles):
    return smiles.replace("@", "").replace("/", "-").replace("\\", "-")

import random

input_smiles = [
    "Clc1c(/C=C\\C(=O)NNC(=O)Cn2nc(cc2C)C)c(F)ccc1",
    "o1nc(nc1CCC(=O)NNC(=O)Cn1nc(cc1C)C)c1ccccc1",
    "o1nc(c(c1C)C(=O)NNC(=O)Cn1nc(cc1C)C)CC",
    "O=C(CCC(=O)NNC(=O)Cn1nc(cc1C)C)c1ccc(cc1)c1ccccc1",
    "S(c1c(C(=O)NNC(=O)Cn2nc(cc2C)C)cccc1)CC(=O)N(C)C",
    "Clc1c2OCCCOc2cc(c1)/C=C\\C(=O)NNC(=O)Cn1nc(cc1C)C",
    "S1C[CH]([NH2+]C1)(C(=O)N1[C@@H](CN(CC1)C(=O)NC(C)C)(C(=O)N[C@H]1(CCCNC1=O)))",
    "S1C[CH](NC1)(C(=O)N1[C@@H](CN(CC1)C(=O)NC(C)C)(C(=O)N[C@@H]1(CCCNC1=O)))",
    "O=C(NNC(=O)Cc1ccc(cc1)C)Cn1nc(cc1C)C",
    "s1cc(nc1C)COc1ccc(C(=O)NNC(=O)Cn2nc(cc2C)C)cc1",
    "O=C(NNC(=O)/C=C\\c1ccc(CC)cc1)Cn1nc(cc1C)C",
    "Clc1c(N2CCCC2=O)cc(cc1)C(=O)NNC(=O)Cn1nc(cc1C)C",
    "O=C(NNC(=O)Cn1nc(cc1C)C)Cc1c(n2ncnc2nc1C)C",
    "Clc1cc(ncc1)C(=O)NCC(=O)NNC(=O)Cn1nc(cc1C)C",
    "O=C1N(CCC(=O)NNC(=O)Cn2nc(cc2C)C)C(=O)c2c(C1)cccc2",
    "O1[C@H](CC(=O)NNC(=O)Cn2nc(cc2C)C)(C(=O)Nc2c1cccc2)",
    "O1[C@@H](CC(=O)NNC(=O)Cn2nc(cc2C)C)(C(=O)Nc2c1cccc2)",
    "s1c(nc2c1cccc2)CCCCC(=O)NNC(=O)Cn1nc(cc1C)C",
    "Brc1cc(C(=O)NNC(=O)Cn2nc(cc2C)C)ccc1OC",
    "Clc1cc2c(oc(cc2=O)C(=O)NNC(=O)Cn2nc(cc2C)C)cc1",
    "O=C1N(C(=O)c2c(C1)cccc2)CC(=O)NNC(=O)Cn1nc(cc1C)C",
]
    

def test_shuffle_100():
    for smiles in input_smiles:
        # Reinterpret the SMILES, without chirality or stereo information
        smiles = remove_stereochemistry(smiles)
        canonical_smiles = Chem.CanonSmiles(smiles)
        print "I have", repr(smiles)
        canonical_smiles = Chem.CanonSmiles(smiles)
        assert "@" not in canonical_smiles, canonical_smiles
        assert "/" not in canonical_smiles, canonical_smiles
        assert "\\" not in canonical_smiles, canonical_smiles

        mol = Chem.MolFromSmiles(canonical_smiles)
        atoms = list(mol.GetAtoms())

        # Generate some random permutations and verify matching SMILES
        for i in range(100):
            random.shuffle(atoms)
            new_smiles = reordered_smiles_100(mol, atoms)
            new_canonical_smiles = Chem.CanonSmiles(new_smiles)
            assert canonical_smiles == new_canonical_smiles

def test():
    test_symbols_100()
    test_shuffle_100()

if __name__ == "__main__":
    test()

What happens when I run out of closure numbers?

Perhaps the above code is a bit too simple. Each bond gets its own closure number, and there are only 100 possible closures. What if you have more than 100 bonds? That's easy to test:

>>> from rdkit import Chem
>>> from reordered_smiles import reordered_smiles_100
>>> mol = Chem.MolFromSmiles("C"*101)
>>> reordered_smiles_100(mol, mol.GetAtoms())
'[CH3]1.[CH2]12.[CH2]23.[CH2]34. ... .[CH2]%990.[CH3]0'
>>> mol = Chem.MolFromSmiles("C"*102)
>>> reordered_smiles_100(mol, mol.GetAtoms())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "reordered_smiles.py", line 65, in reordered_smiles_100
    closure_map[bond.GetIdx()] = format_closure(bond_i)
  File "reordered_smiles.py", line 58, in format_closure
    return _closure_symbols[i]
IndexError: list index out of range
>>> 
Nope, can't have a structure with that many bonds.

The solution of course is to know that SMILES processes closures from left to right and that duplicates numbers are allowed once a closure has been closed. For example, I could write ethanol ("CCO") as "C1.C12.O2" or reuse the closure numbers and write it as "C1.C11.O1". It looks strange but it's perfectly valid. Well, except that RDKit has a bug where it doesn't allow duplicate identifiers on the same atom after a dot disconnect. I've reported the bug and it was quickly fixed in rev1586, but I haven't installed the bugfix version.

If you are using the Q4 2010 release or newer then you won't have a problem.

To reuse closure numbers I need to keep track of the available closure numbers. I could use any closure number but I'm going to make the results look like normal SMILES by using the smallest available number as the next closure.

It's only a couple of extra lines do to this. For example, I could use a set of available closures for this, and get the minimum value in the set, but that's an O(n) operation which doesn't appeal to me, even through n will never be larger than 100 to there's no scaling problem here.

A heap of closures

I'll use a heap instead. This is a way to keep track of ordered values such that the smallest value can be found in O(1) time and insertions and deletes can be done in O(log(n)) time. Python does *not* have an abstract heap data type. Instead, it's a "concrete" data type built on top of the list data structure. Python's heapq module contains functions which manipulate the list and maintain heap behavior. Here's some simple examples:

>>> import heapq
>>> available_closures = range(100)
>>> heapq.heapify(available_closures)
>>> heapq.heappop(available_closures)
0
>>> heapq.heappop(available_closures)
1
>>> heapq.heappush(available_closures, 0)
>>> heapq.heappop(available_closures)
0
>>> heapq.heappop(available_closures)
2
>>> 

To use it I turn the list of 100 possible closure numbers into a heap with "heapify". I ask for the smallest item in the heap with heappop. When the closure number is available again, I put it back into the heap with heappush.

The code to set up the heap is simple. I have 100 possible closure values so I make a list of them and turn it into a heap.

import heapq

# Convert the closure numbers into a heap so I can find the smallest unused number.
# I'll still use format_closure to map that to the correct closure respresentation
_closures = range(100)
heapq.heapify(_closures)
I'm going to be a bit clever and initialize the heap once. When reordered_smiles() starts I'll make a copy of the heap and work with the copy. But this is a minor detail which doesn't really affect things.

Algorithm to reuse closure numbers

In the first version I included the bond type information when I both opened and closed a closure. I only needed to include it once, but it made the code simpler since I didn't have to check which case I was in. But here, since I have to do different things for each case, I might as well put the code in which includes the bond type information only when I open the closure.

With all that explanation in place, I think you can understand the algorithm just from the code:

def reordered_smiles(mol, atoms):
    # Clone the heap so I can always get the smallest closure number
    available_closures = _closures[:]
    # Map from bond.GetIdx() to the unique closure number
    closure_map = {}

    smiles = ""
    for i, atom in enumerate(atoms):
        if i != 0:
            # Each atom is dot disconnected
            # (Don't put a "." before the first atom)
            smiles += "."

        # Save the atom, in the correct order
        smiles += format_atom(atom)

        # Work around an RDKit bug where the closures can't be reused
        # for the same atom after there's been a dot disconnect.
        # Save the list of finished closures until the end of atom
        # processing, which is whey they can safely be returned to the
        # heap of available closures.
        # Note: this bug is fixed in the Q4 2010 release and newer
        # (Without the workaround this line wouldn't be needed.)
        to_restore = []
        
        for bond in atom.GetBonds():
            bond_idx = bond.GetIdx()
            if bond_idx in closure_map:
                # Already seen, so just print the symbol and return it to the pool
                closure = closure_map.pop(bond_idx)
                # The bond type was included when the closure was opened so
                # theres' no need to specify the type again.
                smiles += format_closure(closure)

                # Workaround the RDKit bug
                to_restore.append(closure)
                # Otherwise I would just do
                # heapq.heappush(available_closures, closure)
                
            else:
                # Need a new closure
                closure = heapq.heappop(available_closures)
                # This opens the closure so include the bond type information
                smiles += format_bond(bond)
                smiles += format_closure(closure)
                # Keep track of the closure number for this bond
                closure_map[bond_idx] = closure

        # Implement the cleanup step of the RDKit workaround,
        # (Otherwise these two lines wouldn't be needed.)
        for closure in to_restore:
            heapq.heappush(available_closures, closure)

    return smiles

Can't forget the tests!

I'll have to modify the tests a bit since the closures are different, but structually this is identical to the previous version. I did add a test case with 200 carbons since that would have failed under the previous implementation.


def test_symbols():
    mol = Chem.MolFromSmiles("[NH4+].[Cl-].[F].[He-2]")
    s = reordered_smiles(mol, mol.GetAtoms())
    assert s == "[NH4+1].[Cl-1].[F].[He-2]", s

    # Make sure I can specify the order
    s = reordered_smiles(mol, reversed(mol.GetAtoms()))
    assert s == "[He-2].[F].[Cl-1].[NH4+1]", s

    # Make sure the bond types are carried over.
    # They should only be on the start of a closure, and not both sides.
    # This also checks that I'm re-using bond closure numbers correctly.
    mol = Chem.MolFromSmiles("CCO.N=O.C#N")
    s = reordered_smiles(mol, mol.GetAtoms())
    assert s == "[CH3]1.[CH2]12.[OH1]2.[NH1]=1.[O]1.[CH1]#1.[N]1"

    # Make sure the aromatic carbons stay aromatic
    mol = Chem.MolFromSmiles("c1ccccc1")
    s = reordered_smiles(mol, mol.GetAtoms())
    assert s == "[cH1]12.[cH1]13.[cH1]31.[cH1]13.[cH1]31.[cH1]12", s

    # Make sure the connection between the two rings has an explicit single bond
    mol = Chem.MolFromSmiles("c1ccccc1c2ccccc2")
    s = reordered_smiles(mol, mol.GetAtoms())
    assert "-" in s, s

def test_large():
    mol = Chem.MolFromSmiles("C" * 200)
    s = reordered_smiles(mol, mol.GetAtoms())
    assert s == "[CH3]1" + ".[CH2]12.[CH2]21" * 99 + ".[CH3]1", s

import random

input_smiles = [
    "Clc1c(/C=C\\C(=O)NNC(=O)Cn2nc(cc2C)C)c(F)ccc1",
    "o1nc(nc1CCC(=O)NNC(=O)Cn1nc(cc1C)C)c1ccccc1",
    "o1nc(c(c1C)C(=O)NNC(=O)Cn1nc(cc1C)C)CC",
    "O=C(CCC(=O)NNC(=O)Cn1nc(cc1C)C)c1ccc(cc1)c1ccccc1",
    "S(c1c(C(=O)NNC(=O)Cn2nc(cc2C)C)cccc1)CC(=O)N(C)C",
    "Clc1c2OCCCOc2cc(c1)/C=C\\C(=O)NNC(=O)Cn1nc(cc1C)C",
    "S1C[CH]([NH2+]C1)(C(=O)N1[C@@H](CN(CC1)C(=O)NC(C)C)(C(=O)N[C@H]1(CCCNC1=O)))",
    "S1C[CH](NC1)(C(=O)N1[C@@H](CN(CC1)C(=O)NC(C)C)(C(=O)N[C@@H]1(CCCNC1=O)))",
    "O=C(NNC(=O)Cc1ccc(cc1)C)Cn1nc(cc1C)C",
    "s1cc(nc1C)COc1ccc(C(=O)NNC(=O)Cn2nc(cc2C)C)cc1",
    "O=C(NNC(=O)/C=C\\c1ccc(CC)cc1)Cn1nc(cc1C)C",
    "Clc1c(N2CCCC2=O)cc(cc1)C(=O)NNC(=O)Cn1nc(cc1C)C",
    "O=C(NNC(=O)Cn1nc(cc1C)C)Cc1c(n2ncnc2nc1C)C",
    "Clc1cc(ncc1)C(=O)NCC(=O)NNC(=O)Cn1nc(cc1C)C",
    "O=C1N(CCC(=O)NNC(=O)Cn2nc(cc2C)C)C(=O)c2c(C1)cccc2",
    "O1[C@H](CC(=O)NNC(=O)Cn2nc(cc2C)C)(C(=O)Nc2c1cccc2)",
    "O1[C@@H](CC(=O)NNC(=O)Cn2nc(cc2C)C)(C(=O)Nc2c1cccc2)",
    "s1c(nc2c1cccc2)CCCCC(=O)NNC(=O)Cn1nc(cc1C)C",
    "Brc1cc(C(=O)NNC(=O)Cn2nc(cc2C)C)ccc1OC",
    "Clc1cc2c(oc(cc2=O)C(=O)NNC(=O)Cn2nc(cc2C)C)cc1",
    "O=C1N(C(=O)c2c(C1)cccc2)CC(=O)NNC(=O)Cn1nc(cc1C)C",
]

# This removes standard chiral and stereochemistry from SMILES strings.
# It's very simple and will not work if stereo classes are specified.
# (If that happens then you'll get a SMILES parse error.)
def remove_stereochemistry(smiles):
    return smiles.replace("@", "").replace("/", "-").replace("\\", "-")


def test_shuffle():
    for smiles in input_smiles:
        # Reinterpret the SMILES, without chirality or stereo information
        smiles = remove_stereochemistry(smiles)
        canonical_smiles = Chem.CanonSmiles(smiles)
        print "I have", repr(smiles)
        assert "@" not in canonical_smiles, canonical_smiles
        assert "/" not in canonical_smiles, canonical_smiles
        assert "\\" not in canonical_smiles, canonical_smiles

        mol = Chem.MolFromSmiles(canonical_smiles)
        atoms = list(mol.GetAtoms())

        # Generate some random permutations and verify matching SMILES
        for i in range(100):
            random.shuffle(atoms)
            new_smiles = reordered_smiles(mol, atoms)
            new_canonical_smiles = Chem.CanonSmiles(new_smiles)
            assert canonical_smiles == new_canonical_smiles

def test():
    test_symbols()
    test_shuffle()

if __name__ == "__main__":
    test()

Sequential SMILES atoms don't need closures

I tear the molecule apart with dot disconnects and join them with closures. That works and it's simple to understand. I can definitely think of cases where having the explicit list of all bond closures for a given atom would be useful.

But if you are a SMILES fan you might also think this representation is clumsy, or at least not elegantly aware of how SMILES works. Suppose two atoms are bonded to each other and are sequential in the SMILES string. In that case there's no reason to dot disconnect them then reconnect them with a closure number since they can be connected by just specifying the bond type.

In other words, why have "[CH2]=1.[CH2]1" when you can have "[CH2]=[CH2]"?

The change to the algorithm is simple. When going through the list of bonds for an atom, check to see if one of them is connected to the sequentially next atom. If so, it's not a closure. Place the bond type after the rest of the atom's closures instead of the "." which would otherwise seperate the atom from the next atom. And if there's a bond which connects the atom to the sequentially previous atom, ignore it since it's already been taken care of.

You know, this is easier to read in code, so I'll show you that.

Final Algorithm: reordered SMILES with neighbor joining using OpenBabel

I implemented the algorithm which takes advantage of neighbor joining in SMILES but RDKit couldn't handle the SMILES. It gave an error message like this:

>>> from rdkit import Chem
>>> Chem.MolFromSmiles("[CH1]1=2.[n]34.[F]5.[cH1]67.[NH1]89.[CH2]%10%11.[c]%12%13%14.
[cH1]%15%16.[C]9=9%10.[CH3]%10.[c]%17%18%19.[cH1]%166.[cH1]%12[c]4%10.[c]%195%15.[CH3]%14.
[NH1]48.[n]%11%133.[C]=314.[c]7%18[Cl].[O]3.[O]9.[CH1]2%17")
[16:44:49] 

****
Pre-condition Violation
attempt to add self-bond
Violation occurred on line 349 in file /Users/dalke/ftps/RDKit_Q22010_1/Code/GraphMol/ROMol.cpp
Failed Expression: bond_pin->getBeginAtomIdx()!=bond_pin->getEndAtomIdx()
****
In the earlier code I was able to work around the bug, but not this time. (Remember, if you are using the Q4 2010 release or newer then you won't have a problem.) Instead, I ported everything over to OpenBabel. I wanted to use pybel for this but couldn't figure out how to get the details for format_atom, so I ended up using the Python API which matches the OpenBabel C++ API. Since I'm not using RDKit now, I got rid of the work-around I used to delay returning the closure numbers to the heap.

If you've made it this far then you should be able to figure out the code by reading it, so here you go:

# Given a molecule and the list of atoms in the molecule,
# generate an equivalent SMILES string where the atoms are in
# the same order as the list.

import openbabel as ob
import heapq

def format_atom(atom):
    # A general purpose SMILES writer would check to see if the atom
    # can be written without the []s and if so, only show the symbol.
    # This version is easier, though incomplete (no isotope support)
    # The goal is to get the point across and not make the best SMILES
    smiles = '['
    if atom.IsAromatic():
        smiles += ob.etab.GetSymbol(atom.GetAtomicNum()).lower()
    else:
        smiles += ob.etab.GetSymbol(atom.GetAtomicNum())

    hcount = atom.ImplicitHydrogenCount()
    if hcount != 0:
        smiles += "H%s" % hcount
            
    charge = atom.GetFormalCharge()
    if charge != 0:
        # Represent "[Cl-]" and "[NH4+]" as "[Cl-1]" and "[NH4+1]"
        # That's a bit more cumbersome, but it's easy to code
        smiles += "%+d" % charge

    return smiles + "]"

_bond_symbols = { 1: "", 2: "=", 3: "#" }

def format_bond(bond):
    if bond.IsAromatic():
        return ""
    bond_order = bond.GetBondOrder()
    if bond_order == 1:
        # Handle the special case where the end atoms are aromatic but the
        # bond itself is not aromatic. c1ccccc1-c1ccccc1 . This preserves
        # aromatic information for some ring systems when parsed by tools
        # like OEChem which don't atomatically perceive aromaticity on input.
        if (bond.GetBeginAtom().IsAromatic() and
            bond.GetEndAtom().IsAromatic()):
            return "-"
    return _bond_symbols[bond_order]
    

# the closure symbols in the order:
#   1, 2, ..., 9, %10, %11, ... %99, 0
# (I put 0 last since most people expect to start from 1)
_closure_symbols = list("123456789") + ["%"+"%d"%i for i in range(10, 100)] + ["0"]
def format_closure(i):
    return _closure_symbols[i]

# Convert the closure numbers into a heap so I can find the smallest unused number.
# I'll still use format_closure to map that to the correct closure respresentation
_closures = range(100)
heapq.heapify(_closures)


def reordered_smiles(mol, atoms):
    # Clone the heap so I can always get the smallest closure number
    available_closures = _closures[:]
    # Map from bond.GetIdx() to the unique closure number
    closure_map = {}

    smiles = ""
    atoms = list(atoms)
    # For each atom, also get the sequentially previous and next atoms
    for prev_atom, atom, next_atom in zip([None]+atoms[:-1],
                                          atoms, 
                                          atoms[1:] + [None]):
        # Save the atom, in the correct order
        smiles += format_atom(atom)

        # Two sequential atoms are either disconnected, in which case
        # this will get a ".", or they are bonded, in which case this
        # gets the bond type.
        neighbor_bond = "."
        
        for bond in ob.OBAtomBondIter(atom):
            bond_idx = bond.GetIdx()
            other_atom = bond.GetNbrAtom(atom)
            # Handle the cases the bond connects this atom to a sequential neighbor
            if next_atom is not None and other_atom.GetIdx() == next_atom.GetIdx():
                # These don't need a dot disconnect but they do need a bond type
                # at the end of this atom in order to connect them based on order
                assert neighbor_bond == "."
                neighbor_bond = format_bond(bond)
                continue

            if prev_atom is not None and other_atom.GetIdx() == prev_atom.GetIdx():
                # These are already connected so ignore
                continue
                
            if bond_idx in closure_map:
                # Already seen, so just print the symbol and return it to the pool
                closure = closure_map.pop(bond_idx)
                # The bond type was included when the closure was opened so
                # theres' no need to specify the type again.
                smiles += format_closure(closure)

                heapq.heappush(available_closures, closure)
                
            else:
                # Need a new closure
                closure = heapq.heappop(available_closures)
                # This opens the closure so include the bond type information
                smiles += format_bond(bond)
                smiles += format_closure(closure)
                # Keep track of the closure number for this bond
                closure_map[bond_idx] = closure

        if next_atom is not None:
            smiles += neighbor_bond
    return smiles

####  Test code

import random

conv = ob.OBConversion()
conv.SetInAndOutFormats("smi", "can")
def smilin(smiles):
    mol = ob.OBMol()
    conv.ReadString(mol, smiles)
    mol.DeleteHydrogens()
    return mol


def test_symbols():
    mol = smilin("[NH4+].[Cl-].[F].[He-2]")
    s = reordered_smiles(mol, ob.OBMolAtomIter(mol))
    assert s == "[NH4+1].[Cl-1].[F].[He-2]", s

    # Make sure I can specify the order
    s = reordered_smiles(mol, list(ob.OBMolAtomIter(mol))[::-1])
    assert s == "[He-2].[F].[Cl-1].[NH4+1]", s

    # Make sure the bond types are carried over.
    # They should only be on the start of a closure, and not both sides.
    # This also checks that I'm re-using bond closure numbers correctly.
    mol = smilin("CCO.N=O.C#N")
    s = reordered_smiles(mol, ob.OBMolAtomIter(mol))
    assert s == "[CH3][CH2][OH1].[NH1]=[O].[CH1]#[N]", s

    # Mix things up a bit
    atoms = list(ob.OBMolAtomIter(mol))
    s = reordered_smiles(mol, atoms[::2] + atoms[-1::-2])
    assert s == "[CH3]1.[OH1]2.[O]=3.[N]#4.[N]4.[O]3.[OH1]2.[CH3]1"

    # Make sure the aromatic carbons stay aromatic
    mol = smilin("c1ccccc1")
    s = reordered_smiles(mol, ob.OBMolAtomIter(mol))
    assert s == "[cH1]1[cH1][cH1][cH1][cH1][cH1]1", s

    # Make sure the connection between the two rings has an explicit single bond
    mol = smilin("c1ccccc1c2ccccc2")
    s = reordered_smiles(mol, ob.OBMolAtomIter(mol))
    assert "-" in s, s

def test_large():
    mol = smilin("C" * 200)
    s = reordered_smiles(mol, ob.OBMolAtomIter(mol))
    assert s == "[CH3]" + "[CH2][CH2]" * 99 + "[CH3]", s


input_smiles = [
    "Clc1c(/C=C\\C(=O)NNC(=O)Cn2nc(cc2C)C)c(F)ccc1",
    "o1nc(nc1CCC(=O)NNC(=O)Cn1nc(cc1C)C)c1ccccc1",
    "o1nc(c(c1C)C(=O)NNC(=O)Cn1nc(cc1C)C)CC",
    "O=C(CCC(=O)NNC(=O)Cn1nc(cc1C)C)c1ccc(cc1)c1ccccc1",
    "S(c1c(C(=O)NNC(=O)Cn2nc(cc2C)C)cccc1)CC(=O)N(C)C",
    "Clc1c2OCCCOc2cc(c1)/C=C\\C(=O)NNC(=O)Cn1nc(cc1C)C",
    "S1C[CH]([NH2+]C1)(C(=O)N1[C@@H](CN(CC1)C(=O)NC(C)C)(C(=O)N[C@H]1(CCCNC1=O)))",
    "S1C[CH](NC1)(C(=O)N1[C@@H](CN(CC1)C(=O)NC(C)C)(C(=O)N[C@@H]1(CCCNC1=O)))",
    "O=C(NNC(=O)Cc1ccc(cc1)C)Cn1nc(cc1C)C",
    "s1cc(nc1C)COc1ccc(C(=O)NNC(=O)Cn2nc(cc2C)C)cc1",
    "O=C(NNC(=O)/C=C\\c1ccc(CC)cc1)Cn1nc(cc1C)C",
    "Clc1c(N2CCCC2=O)cc(cc1)C(=O)NNC(=O)Cn1nc(cc1C)C",
    "O=C(NNC(=O)Cn1nc(cc1C)C)Cc1c(n2ncnc2nc1C)C",
    "Clc1cc(ncc1)C(=O)NCC(=O)NNC(=O)Cn1nc(cc1C)C",
    "O=C1N(CCC(=O)NNC(=O)Cn2nc(cc2C)C)C(=O)c2c(C1)cccc2",
    "O1[C@H](CC(=O)NNC(=O)Cn2nc(cc2C)C)(C(=O)Nc2c1cccc2)",
    "O1[C@@H](CC(=O)NNC(=O)Cn2nc(cc2C)C)(C(=O)Nc2c1cccc2)",
    "s1c(nc2c1cccc2)CCCCC(=O)NNC(=O)Cn1nc(cc1C)C",
    "Brc1cc(C(=O)NNC(=O)Cn2nc(cc2C)C)ccc1OC",
    "Clc1cc2c(oc(cc2=O)C(=O)NNC(=O)Cn2nc(cc2C)C)cc1",
    "O=C1N(C(=O)c2c(C1)cccc2)CC(=O)NNC(=O)Cn1nc(cc1C)C",
]

# This removes standard chiral and stereochemistry from SMILES strings.
# It's very simple and will not work if stereo classes are specified.
# (If that happens then you'll get a SMILES parse error.)
def remove_stereochemistry(smiles):
    return smiles.replace("@", "").replace("/", "-").replace("\\", "-")


def test_shuffle():
    for smiles in input_smiles:
        # Reinterpret the SMILES, without chirality or stereo information
        smiles = remove_stereochemistry(smiles)
        canonical_smiles = conv.WriteString(smilin(smiles))
        print "I have", repr(smiles)
        assert "@" not in canonical_smiles, canonical_smiles
        assert "/" not in canonical_smiles, canonical_smiles
        assert "\\" not in canonical_smiles, canonical_smiles

        mol = smilin(canonical_smiles)
        atoms = list(ob.OBMolAtomIter(mol))

        # Generate some random permutations and verify matching SMILES
        for i in range(100):
            random.shuffle(atoms)
            new_smiles = reordered_smiles(mol, atoms)
            #print "new smiles:", new_smiles
            new_canonical_smiles = conv.WriteString(smilin(new_smiles))
            assert canonical_smiles == new_canonical_smiles



def test():
    test_symbols()
    test_large()
    test_shuffle()

if __name__ == "__main__":
    test()

Possibility for better SMILES awareness

There are some things you could do to make the reordered SMILES look even nicer. You could change format_atom() so that it returns the short-hand notation when possible instead of the detailed one I do. You could add support for isotopes, and chirality, and stereochemistry. You could even try taking advantage of (branches), but I'm not sure there's any advantage to that.

I'll leave those as exercises for you.

Comments or questions?

I hope this was informative and intersting. I know I've talked to several people who have used SMILES a lot who hadn't considered this trick before. I hope you have fun with it!

Feel free to ask a question or leave a comment.


Advertising: In less than two months I will be teaching programming classes for cheminformatics in Leipzig, Germany. Python for Cheminformatics is 14-15 February and Web Application Development with Django is 16-18 February. Let me know if you're interested, and remember, there's a discount if you attend both classes!


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2013 Andrew Dalke Scientific AB