Dalke Scientific Software: More science. Less time. Products

Diary RSS | All of Andrew's writings | Diary archive

Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure. Code to taste. Best served at room temperature.

Fragment by copy and trim #

This is part of a series of essays on how to fragment a molecular graph using RDKit. These are meant to describe the low-level steps that go into fragmentation, and the ways to think about testing. To fragment a molecule in RDKit, use FragmentOnBonds(). The essays in this series are:

Why another fragmentation method?

How do you tell if an algorithm is correct? Sometimes you can inspect the code. More often you have test cases where you know the correct answer. For complex algorithms, especially when using real-world data, testing is even harder. It's not always clear where edge cases are, and often the real-world is often more complicated than you think.

Another validation solution is to write two different algorithms which accomplish the same thing, and compare them with each other. I did that in my essay about different ways to evaluate the parity of a permutation order. That cross-validation testing was good enough, because it's easy to compute all possible input orders.

In the previous essay in this series, I cross-validated that fragment_chiral() and fragment_on_bonds() give the same results. They did. That isn't surprising because they implement essentially the same algorithm. Not only that, but I looked at the FragmentOnBonds() implementation when I found out my first fragment_chiral() didn't give the same answers. (I didn't know I needed to ClearComputedProps() after I edit the structure.)

Cross-validation works better when the two algorithm are less similar. The way you think about a problem and implement a solution are connected. Two similar solutions may be the result of thinking about things the same way, and come with the same blind spots. (This could also be a design goal, if you want the new implementation to be "bug compatible" with the old.)

I've made enough hints that you likely know what I'm leading up to. RDKit's FragmentOnBonds() code currently (in mid-2016) converts directional bonds (the E-Z stereochemistry around double bonds into non-directional single bonds.

>>> from rdkit import Chem
>>> mol = Chem.MolFromSmiles("F/C=C/F")
>>> fragmented_mol = Chem.FragmentOnBonds(mol, [0], dummyLabels=[(0,0)])
>>> Chem.MolToSmiles(fragmented_mol, isomericSmiles=True)
'[*]C=CF.[*]F'
when I expected the last line to be
'[*]/C=C/F.[*]F'

This may or may not be what you want. Just in case, I've submitted a patch to discuss changing it in RDKit, so your experience in the future might be different than mine in the past.

I bring it up now to show how cross-validation of similar algorithms isn't always enough to figure out if the algorithms do as you expect.

Validate though re-assembly

As a reminder, in an earlier essay I did a much stronger validation test. I took the fragments, reassembled them via SMILES syntax manipulation, and compared the canonicalized result to the canonicalized input structure. These should always match, to within limits of the underlying chemistry support.

They didn't, because my original code didn't support directional bonds. Instead, in that essay I pre-processed the input SMILES string to replace the "/" and "\" characters with a "-". That gives the equivalent chemistry without directional bonds.

I could use that validation technique again, but I want to explore more of what it means to cross-validate by using a different fragmentation implementations.

Fragment by 'copy and trim'

Dave Cosgrove, on the RDKit mailing list, described the solution he uses to cut a bond and preserve chirality in toolkits which don't provide a high-level equivalent to FragmentOnBonds(). This method only works on non-ring bonds, which isn't a problem for me as I want to fragment R-groups. (As I discovered while doing the final testing, my implementation of the method also assumes there is only one molecule in the structure.)

To make fragment from a molecule, trim away all the atoms which aren't part of the fragment, except for the atom connected to the fragment. Convert that atom into the wildcard atom "*" by setting its atomic number, charge, and number of hydrogens to 0, and setting the chirality and aromaticity flags correctly. The image below shows the steps applied to cutting the ester off of aspirin:

steps to trim a fragment of aspirin

This method never touches the bonds connected to a chiral atom of a fragment, so chirality or other bond properties (like bond direction!) aren't accidently removed by breaking the bond and making a new one in its place.

A downside is that I need to copy and trim twice in order to get both fragments after cutting a chain bond. I can predict now the performance won't be as fast as FragmentOnBonds().

Let's try it out.

Trim using the interactive shell

I'll use aspirin as my reference structure and cut the bond to the ester (the R-OCOCH3), which is the bond between atoms 2 and 3.

aspirin with atoms labeled by index
I want the result to be the same as using FragmentOnBonds() to cut that bond:
>>> from rdkit import Chem
>>> aspirin = Chem.MolFromSmiles("O=C(Oc1ccccc1C(=O)O)C")
>>> bond_idx = aspirin.GetBondBetweenAtoms(2, 3).GetIdx()
>>> bond_idx
2
>>> fragmented_mol = Chem.FragmentOnBonds(aspirin, [bond_idx], dummyLabels=[(0,0)])
>>> Chem.MolToSmiles(fragmented_mol, isomericSmiles=True)
'[*]OC(C)=O.[*]c1ccccc1C(=O)O'

I'll need a way to find all of the atoms which are on the ester side of the bond. I don't think there's a toolkit function to get all the atoms on one side of a bond, so I'll write one myself.

Atom 2 is the connection point on the ester to the rest of the aspirin structure. I'll start with that atom, show that it's an oxygen, and get information about its bonds:

>>> atom = aspirin.GetAtomWithIdx(2)
>>> atom.GetSymbol()
'O'
>>> [bond.GetBondType() for bond in atom.GetBonds()]
[rdkit.Chem.rdchem.BondType.SINGLE, rdkit.Chem.rdchem.BondType.SINGLE]
I'll use the bond object to get to the atom on the other side of the bond from atom 2, and report more detailed information about the atom at the other end of each bond:
>>> for bond in atom.GetBonds():
...   other_atom = bond.GetOtherAtom(atom)
...   print(other_atom.GetIdx(), bond.GetBondType(), other_atom.GetSymbol(), other_atom.GetIsAromatic())
... 
1 SINGLE C False
3 SINGLE C True
This says that atom 1 is an aliphatic carbon, and atom 3 is an aromatic carbon. This matches the structure diagram I used earlier.

I know that atom 3 is the other end of the bond which was cut, so I can stop there. What about atom 1? What is it connected to?

>>> next_atom = aspirin.GetAtomWithIdx(1)
>>> [b.GetOtherAtom(next_atom).GetIdx() for b in next_atom.GetBonds()]
[0, 2, 12]
I've already processed atom 2, so only atoms 0 and 12 are new.

What are those atoms connected to?

>>> next_atom = aspirin.GetAtomWithIdx(0)
>>> [b.GetOtherAtom(next_atom).GetIdx() for b in next_atom.GetBonds()]
[1]
>>> next_atom = aspirin.GetAtomWithIdx(12)
>>> [b.GetOtherAtom(next_atom).GetIdx() for b in next_atom.GetBonds()]
[1]
Only atom 1, which I've already seen before so I've found all of the atoms on the ester side of the bond.

Semi-automated depth-first search

There are more atoms on the other side of the bond, so I'll have to automate it somewhat. This sort of graph search is often implemented as a depth-first search (DFS) or a breadth-first search (BFS). Python lists easily work as stacks, which makes DFS slightly more natural to implement.

To give you an idea of what I'm about to explain, here's an animation of the aspirin depth-first search:

depth-first search of an apsirin fragment

If there is the chance of a ring (called "cycle" in graph theory), then there are two ways to get to the same atom. To prevent processing the same atom multiple times, I'll set up set named "seen_ids", which contains the atoms indices I've before. Since it's the start, I've only seen the two atoms which are at the end of the bond to cut.

>>> seen_ids = set([2, 3])
I also store a list of the atoms which are in this side of the bond, at this point is only atom 3, and a stack (a Python list) of the atoms I need to process further:
>>> atom_ids = [3]
>>> stack = [aspirin.GetAtomWithIdx(3)]
I'll start by getting the top of the stack (the last element of the list) and looking at its neighbors:
>>> atom = stack.pop()
>>> neighbor_ids = [b.GetOtherAtom(atom).GetIdx() for b in atom.GetBonds()]
>>> neighbor_ids
[2, 4, 8]
I'll need to filter out the neighbors I've already seen. I'll write a helper function for that. I'll need both the atom objects and the atom ids, so this will return two lists, one for each:
def get_atoms_to_visit(atom, seen_ids):
    neighbor_ids = []
    neighbor_atoms = []
    for bond in atom.GetBonds():
        neighbor_atom = bond.GetOtherAtom(atom)
        neighbor_id = neighbor_atom.GetIdx()
        if neighbor_id not in seen_ids:
            neighbor_ids.append(neighbor_id) 
            neighbor_atoms.append(neighbor_atom)
    
    return neighbor_ids, neighbor_atoms
and use it:
>>> atom_ids_to_visit, atoms_to_visit = get_atoms_to_visit(atom, seen_ids)
>>> atom_ids_to_visit
[4, 8]
>>> [a.GetSymbol() for a in atoms_to_visit]
['C', 'C']
These atom ids are connected to the original atom, so add them all it to the atom_ids.
>>> atom_ids.extend(atom_ids_to_visit) 
>>> atom_ids
[3, 4, 8]
These haven't been seen before, so add the atoms (not the atom ids) to the stack of items to process:
>>> stack.extend(atoms_to_visit)
>>> stack
[<rdkit.Chem.rdchem.Atom object at 0x111bf37b0>, <rdkit.Chem.rdchem.Atom object at 0x111bf3760>]
>>> [a.GetIdx() for a in stack]
[4, 8]
Now that they've been seen, I don't need to process them again, so add the new ids to the set of seen ids:
>>> seen_ids.update(atom_ids_to_visit)
>>> seen_ids
{8, 2, 3, 4}

That's the basic loop. The stack isn't empty, so pop the top of the stack to get a new atom object, and repeat the earlier steps:

>>> atom = stack.pop()
>>> atom.GetIdx()
8
>>> atom_ids_to_visit, atoms_to_visit = get_atoms_to_visit(atom, seen_ids)
>>> atom_ids_to_visit
[7, 9]
>>> atom_ids.extend(atom_ids_to_visit) 
>>> atom_ids 
[3, 4, 8, 7, 9]
>>> stack.extend(atoms_to_visit)
>>> [a.GetIdx() for a in stack]
[4, 7, 9]
>>> seen_ids.update(atom_ids_to_visit)
>>> seen_ids
{2, 3, 4, 7, 8, 9}
Then another loop. I'll stick it in a while loop to process the stack until its empty, and I'll only have it print the new atom ids added to the list:
>>> while stack:
...     atom = stack.pop()
...     atom_ids_to_visit, atoms_to_visit = get_atoms_to_visit(atom, seen_ids)
...     atom_ids.extend(atom_ids_to_visit)
...     stack.extend(atoms_to_visit)
...     seen_ids.update(atom_ids_to_visit)
...     print("added atoms", atom_ids_to_visit)
... 
added atoms [10, 11]
added atoms []
added atoms []
added atoms [6]
added atoms [5]
added atoms []
added atoms []
The final set of atoms is:
>>> atom_ids
[3, 4, 8, 7, 9, 10, 11, 6, 5]

Fully automated - fragment_trim()

In this section I'll put all the pieces together to make fragment_trim(), a function which implements the trim algorithm.

Find atoms to delete

The trim algorithm needs to know which atoms to delete. That will be all of the atoms on one side of the bond, except for the atom which is at the end of the bond. (I'll turn that atom into a "*" atom.) I'll use the above code to get the atom ids to delete:

# Look for neighbor atoms of 'atom' which haven't been seen before  
def get_atoms_to_visit(atom, seen_ids):
    neighbor_ids = []
    neighbor_atoms = []
    for bond in atom.GetBonds():
        neighbor_atom = bond.GetOtherAtom(atom)
        neighbor_id = neighbor_atom.GetIdx()
        if neighbor_id not in seen_ids:
            neighbor_ids.append(neighbor_id) 
            neighbor_atoms.append(neighbor_atom)
    
    return neighbor_ids, neighbor_atoms

# Find all of the atoms connected to 'start_atom_id', except for 'ignore_atom_id'
def find_atom_ids_to_delete(mol, start_atom_id, ignore_atom_id):
    seen_ids = {start_atom_id, ignore_atom_id}
    atom_ids = [] # Will only delete the newly found atoms, not the start atom
    stack = [mol.GetAtomWithIdx(start_atom_id)]
    
    # Use a depth-first search to find the connected atoms
    while stack:
        atom = stack.pop()
        atom_ids_to_visit, atoms_to_visit = get_atoms_to_visit(atom, seen_ids)
        atom_ids.extend(atom_ids_to_visit)
        stack.extend(atoms_to_visit)
        seen_ids.update(atom_ids_to_visit)
    
    return atom_ids
and try it out:
>>> from rdkit import Chem
>>> aspirin = Chem.MolFromSmiles("O=C(Oc1ccccc1C(=O)O)C")
>>> find_atom_ids_to_delete(aspirin, 3, 2)
[1, 0, 12]
>>> find_atom_ids_to_delete(aspirin, 2, 3)
[4, 8, 7, 9, 10, 11, 6, 5]
That looks right. (I left out the other testing I did to get this right.)

Trim atoms

The trim function has two parts. The first is to turn the attachment point into the wildcard atom "*", which I'll do by removing any charges, hydrogens, chirality, or other atom properties. This atom will not be deleted. The second is to remove the other atoms on that side of the bond.

Removing atoms from a molecule is slightly tricky. The atom indices are reset after each atom is removed. (While I often use a variable name like "atom_id", the atom id is not a permanent id, but simply the current atom index.) When atom "i" is deleted, all of the atoms with a larger id "j" get the new id "j-1". That is, the larger ids are all shifted down one to fill in the gap.

This becomes a problem because I have a list of ids to delete, but as I delete them the real ids change. For example, if I delete atoms 0 and 1 from a two atom molecule, in that order, I will get an exception:

>>> rwmol = Chem.RWMol(Chem.MolFromSmiles("C#N"))
>>> rwmol.RemoveAtom(0)
>>> rwmol.RemoveAtom(1)
[13:54:02] 

****
Range Error
idx
Violation occurred on line 155 in file /Users/dalke/ftps/rdkit-Release_2016_03_1/Code/GraphMol/ROMol.cpp
Failed Expression: 1 <= 0
****

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: Range Error
	idx
	Violation occurred on line 155 in file Code/GraphMol/ROMol.cpp
	Failed Expression: 1 <= 0
	RDKIT: 2016.03.1
	BOOST: 1_60
This is because after RemoveAtom(0) the old atom with id 1 gets the id 0. As another example, if I remove atoms 0 and 1 from a three atom molecule, I'll end up with what was originally the second atom:
>>> rwmol = Chem.RWMol(Chem.MolFromSmiles("CON"))
>>> rwmol.RemoveAtom(0)
>>> rwmol.RemoveAtom(1)
>>> Chem.MolToSmiles(rwmol)
'O'
Originally the ids were [C=0, O=1, N=2]. The RemoveAtom(0) removed the C and renumbered the remaining atoms, giving [O=0, N=1]. The RemoveAtom(1) then removed the N, leaving [O=0] as the remaining atom.

While you could pay attention to the renumbering and adjust the index of the atom to delete, the far easier solution is to sort the atom ids, and delete starting from the largest id.

Here is the function to turn the attachment atom into a wildcard and to delete the other atoms:

def trim_atoms(mol, wildcard_atom_id, atom_ids_to_delete):
    rwmol = Chem.RWMol(mol)
    # Change the attachment atom into a wildcard ("*") atom
    atom = rwmol.GetAtomWithIdx(wildcard_atom_id)
    atom.SetAtomicNum(0)
    atom.SetIsotope(0)
    atom.SetFormalCharge(0)
    atom.SetIsAromatic(False)
    atom.SetNumExplicitHs(0)
    
    # Remove the other atoms.
    # RDKit will renumber after every RemoveAtom() so
    # remove from highest to lowest atom index.
    # If not sorted, may get a "RuntimeError: Range Error"
    for atom_id in sorted(atom_ids_to_delete, reverse=True):
        rwmol.RemoveAtom(atom_id)
    
    # Don't need to ClearComputedProps() because that will
    # be done by CombineMols()
    return rwmol.GetMol()
(If this were a more general-purpose function I would need to call ClearComputedProps(), which is needed after any molecular editing in RDKit. I don't need to do this here because it will be done during CombineMols(), which is coming up.)

How did I figure out which properties needed to be changed to make a wildcard atom? I tracked the down during cross-validation tests of an earlier version of this algorithm.

I'll try out the new code:

>>> fragment1 = trim_atoms(aspirin, 2, [1, 0, 12])
>>> Chem.MolToSmiles(fragment1, isomericSmiles=True)
'[*]c1ccccc1C(=O)O'
>>> fragment2 = trim_atoms(aspirin, 3, [4, 8, 7, 9, 10, 11, 6, 5])
>>> Chem.MolToSmiles(fragment2, isomericSmiles=True)
'[*]OC(C)=O'

fragment_trim()

The high-level fragment code is straight-forward, and I don't think it requires explanation:

def fragment_trim(mol, atom1, atom2):
    # Find the fragments on either side of a bond
    delete_atom_ids1 = find_atom_ids_to_delete(mol, atom1, atom2)
    fragment1 = trim_atoms(mol, atom1, delete_atom_ids1)
    
    delete_atom_ids2 = find_atom_ids_to_delete(mol, atom2, atom1)
    fragment2 = trim_atoms(mol, atom2, delete_atom_ids2)
    
    # Merge the two fragments.
    # (CombineMols() does the required ClearComputedProps())
    return Chem.CombineMols(fragment1, fragment2)

I'll check that it does what I think it should do:

>>> new_mol = fragment_trim(aspirin, 2, 3)
>>> Chem.MolToSmiles(new_mol)
'[*]OC(C)=O.[*]c1ccccc1C(=O)O'
This matches the FragmentOnBonds() output at the start of this essay.

Testing

I used the same cross-validation method I did in my previous essay. I parsed structures from ChEMBL and checked that fragment_trim() produces the same results as FragmentOnBonds(). It doesn't. The first failure mode surprised me:

fragment_trim() only works with a single connected structure

Here's one of the failure reports:

Mismatch: record: 61 id: CHEMBL1203109 smiles: Cl.CNC(=O)c1cc(C(O)CNC(C)CCc2ccc3c(c2)OCO3)ccc1O cut: 1 2
   smiles_trim: Cl.Cl.[*]C.[*]NC(=O)c1cc(C(O)CNC(C)CCc2ccc3c(c2)OCO3)ccc1O
smiles_on_bond: Cl.[*]C.[*]NC(=O)c1cc(C(O)CNC(C)CCc2ccc3c(c2)OCO3)ccc1O
Somehow the "Cl" appears twice. After thinking about it I realized that my implementation of the copy and trim algorithm assumes that the input structure is strongly connected, that is that it only contains a single molecule. Each copy contains a chlorine atom, so when I CombineMols() I end up with two chlorine atoms.

I could fix my implementation to handle this, but as I'm only interested in cross-validation, the easier solution is to never process a structure with multiple molecules. I decided that if the input SMILES contains multiple molecules (that is, if the SMILES contains the dot disconnection symbol ".") then I would choose the component which has the most characters, using the following:

if "." in smiles:
    # The fragment_trim() method only works on a connected molecule.
    # Arbitrarily pick the component with the longest SMILES string.
    smiles = max(smiles.split("."), key=len)

Directional bonds

After 1000 records and 31450 tests there were 362 mismatches. All of them appeared to be related to directional bonds, like this mismatch report:

Mismatch: record: 124 id: CHEMBL445177 smiles: CCCCCC/C=C\CCCCCCCc1cccc(O)c1C(=O)O cut: 7 8
   smiles_trim: [*]/C=C\CCCCCC.[*]CCCCCCCc1cccc(O)c1C(=O)O
smiles_on_bond: [*]C=CCCCCCC.[*]CCCCCCCc1cccc(O)c1C(=O)O
I knew this would happen coming in to this essay, but it's still nice to see. It demonstrates that the fragment_trim() is a better way to cross-validate FragmentOnBonds() than my earlier fragment_chiral() algorithm.

It's possible that other mismatches are hidden in the hundreds of reports, so I removed directional bonds from the SMILES and processed again:

if "/" in smiles:
    smiles = smiles.replace("/", "")
if "\\" in smiles:
    smiles = smiles.replace("\\", "")
That testing shows I forgot to include "atom.SetIsotope(0)" when I converted the attachment atom into a wildcard atom. Without it, the algorithm turned a "[18F]" into a "[18*]". It surprised me because I copied it from working code I made for an earlier version of the algorithm. Looking into it, I realized I left that line out during the copy&paste. This is why regression tests using diverse real-world data are important.

I stopped the testing after 100,00 records. Here is the final status line:
Processed 100000 records, 100000 molecules, 1120618 tests, 0 mismatches.  T_trim: 2846.79 T_on_bond 258.18
While trim code is slower than using the built-in function, the point of this exercise isn't to figure out the fastest implementation but to have a better way to cross-validate the original code, which it has done.

The final code

Here is the final code:

from __future__ import print_function

import datetime
import time

from rdkit import Chem

# Look for neighbor atoms of 'atom' which haven't been seen before  
def get_atoms_to_visit(atom, seen_ids):
    neighbor_ids = []
    neighbor_atoms = []
    for bond in atom.GetBonds():
        neighbor_atom = bond.GetOtherAtom(atom)
        neighbor_id = neighbor_atom.GetIdx()
        if neighbor_id not in seen_ids:
            neighbor_ids.append(neighbor_id) 
            neighbor_atoms.append(neighbor_atom)
    
    return neighbor_ids, neighbor_atoms

# Find all of the atoms connected to 'start_atom_id', except for 'ignore_atom_id'
def find_atom_ids_to_delete(mol, start_atom_id, ignore_atom_id):
    seen_ids = {ignore_atom_id, start_atom_id}
    atom_ids = [] # Will only delete the newly found atoms, not the start atom
    stack = [mol.GetAtomWithIdx(start_atom_id)]

    # Use a depth-first search to find the connected atoms
    while stack:
        atom = stack.pop()
        atom_ids_to_visit, atoms_to_visit = get_atoms_to_visit(atom, seen_ids)
        atom_ids.extend(atom_ids_to_visit)
        stack.extend(atoms_to_visit)
        seen_ids.update(atom_ids_to_visit)
    
    return atom_ids

def trim_atoms(mol, wildcard_atom_id, atom_ids_to_delete):
    rwmol = Chem.RWMol(mol)
    # Change the attachment atom into a wildcard ("*") atom
    atom = rwmol.GetAtomWithIdx(wildcard_atom_id)
    atom.SetAtomicNum(0)
    atom.SetIsotope(0)
    atom.SetFormalCharge(0)
    atom.SetIsAromatic(False)
    atom.SetNumExplicitHs(0)
    
    # Remove the other atoms.
    # RDKit will renumber after every RemoveAtom() so
    # remove from highest to lowest atom index.
    # If not sorted, may get a "RuntimeError: Range Error"
    for atom_id in sorted(atom_ids_to_delete, reverse=True):
        rwmol.RemoveAtom(atom_id)
    
    # Don't need to ClearComputedProps() because that will
    # be done by CombineMols()
    return rwmol.GetMol()
    
def fragment_trim(mol, atom1, atom2):
    # Find the fragments on either side of a bond
    delete_atom_ids1 = find_atom_ids_to_delete(mol, atom1, atom2)
    fragment1 = trim_atoms(mol, atom1, delete_atom_ids1)
    
    delete_atom_ids2 = find_atom_ids_to_delete(mol, atom2, atom1)
    fragment2 = trim_atoms(mol, atom2, delete_atom_ids2)

    # Merge the two fragments.
    # (CombineMols() does the required ClearComputedProps())
    return Chem.CombineMols(fragment1, fragment2)

####### Cross-validation code

def fragment_on_bond(mol, atom1, atom2):
    bond = mol.GetBondBetweenAtoms(atom1, atom2)
    return Chem.FragmentOnBonds(mol, [bond.GetIdx()], dummyLabels=[(0, 0)])


def read_records(filename):
    with open(filename) as infile:
        for recno, line in enumerate(infile):
            smiles, id = line.split()
            if "*" in smiles:
                continue
            yield recno, id, smiles
            
def cross_validate():
    # Match a single bond not in a ring
    BOND_SMARTS = "[!#0;!#1]-!@[!#0;!#1]"
    single_bond_pat = Chem.MolFromSmarts(BOND_SMARTS)
    
    num_mols = num_tests = num_mismatches = 0
    time_trim = time_on_bond = 0.0

    # Helper function to print status information. Since
    # the function is defined inside of another function,
    # it has access to variables in the outer function.
    def print_status():
        print("Processed {} records, {} molecules, {} tests, {} mismatches."
              "  T_trim: {:.2f} T_on_bond {:.2f}"
              .format(recno, num_mols, num_tests, num_mismatches,
                      time_trim, time_on_bond))
    
    filename = "/Users/dalke/databases/chembl_20_rdkit.smi"
    start_time = datetime.datetime.now()
    for recno, id, smiles in read_records(filename):
        if recno % 100 == 0:
            print_status()

        if "." in smiles:
            # The fragment_trim() method only works on a connected molecule.
            # Arbitrarily pick the component with the longest SMILES string.
            smiles = max(smiles.split("."), key=len)
            
        ## Remove directional bonds to see if there are mismatches
        ## which are not caused by directional bonds.
        #if "/" in smiles:
        #    smiles = smiles.replace("/", "")
        #if "\\" in smiles:
        #    smiles = smiles.replace("\\", "")
        
        mol = Chem.MolFromSmiles(smiles)
        if mol is None:
            continue
        num_mols += 1

        # Find all of the single bonds
        matches = mol.GetSubstructMatches(single_bond_pat)

        for a1, a2 in matches:
            # For each pair of atom indices, split on that bond
            # and determine the canonicalized fragmented molecule.
            # Do it for both fragment_trim() ...
            t1 = time.time()
            mol_trim = fragment_trim(mol, a1, a2)
            t2 = time.time()
            time_trim += t2-t1
            smiles_trim = Chem.MolToSmiles(mol_trim, isomericSmiles=True)

            # ... and for fragment_on_bond()
            t1 = time.time()
            mol_on_bond = fragment_on_bond(mol, a1, a2)
            t2 = time.time()
            time_on_bond += t2-t1
            smiles_on_bond = Chem.MolToSmiles(mol_on_bond, isomericSmiles=True)

            # Report any mismatches and keep on going.
            num_tests += 1
            if smiles_trim != smiles_on_bond:
                print("Mismatch: record: {} id: {} smiles: {} cut: {} {}".format(
                      recno, id, smiles, a1, a2))
                print("   smiles_trim:", smiles_trim)
                print("smiles_on_bond:", smiles_on_bond)
                num_mismatches += 1
    
    end_time = datetime.datetime.now()
    elapsed_time = str(end_time - start_time)
    elapsed_time = elapsed_time.partition(".")[0]  # don't display subseconds
    
    print("Done.")
    print_status()
    print("elapsed time:", elapsed_time)

if __name__ == "__main__":
    cross_validate()


Software mentioned in ChEMBL SD records #

When I work with real-world data sets, I usually start with PubChem before going on to ChEMBL. Why? The PubChem data is all generated by one chemistry toolkit - I'm pretty sure it's OEChem - while ChEMBL data comes from many sources.

To get a sense of the diversity, I processed the ChEBML 21 release to get the second line of each record. The ctfile format specification says that if non-blank then the 8 characters starting in the third position should contain the program name. The documentation includes a description of the fields:

IIPPPPPPPPMMDDYYHHmmddSS…
where those fields are: Here are a few examples:
  SciTegic12231509382D
  Mrv0541 05211314572D          
          08180812482D 1   1.00000     0.00000     0
  Symyx   11300911332D 1   1.00000     0.00000     0
The third program wishes to remain anonymous.

Extract program names from the ChEMBL 21 SDF

I'll write a program to extract those program names and count how many times each one occurs. I don't need a general-purpose SD file reader because ChEMBL uses a subset of the SD format. For example, there are only two lines in each record which start with "CHEMBL", the title line (the first line of the record) and the data line after the "chembl_id" tag.

My code can read line through the file. The first time it sees a "CHEMBL" line is the title line, so the following line (the second line) contains the data. Then when it sees "> <chembl_id>" it knows to skip the following line, which will have a CHEMBL on it.

There are two oddities here. First, the gzip reader returns byte strings. I decided to do the pattern matching on the byte strings to ignore the overhead of converting everything to Unicode when I only need a few characters from the file. Second, a Python file object is its own iterator, so I can use "infile" in both the for-loop and in the body itself.

import sys, gzip
from collections import Counter
counter = Counter()
with gzip.open("/Users/dalke/databases/chembl_21.sdf.gz", "rb") as infile:
  for line in infile:
      # Get the line after the title line (which starts with 'CHEMBL')
      if line[:6] == b"CHEMBL":
          next_line = next(infile)
          # Print unique program names
          program = next_line[2:10]
          if program not in counter:
              print("New:", repr(str(program, "ascii")))
          counter[program] += 1
      
      if line[:13] == b"> ":
          ignore = next(infile) # skip the CHEMBL
  
  print("Done. Here are the counts for each program seen.")
  for name, count in counter.most_common():
    print(repr(str(name, "ascii")), count)

Program counts

Here is the table of counts:
Program namecount
'SciTegic'1105617
' '326445
'CDK 9' 69145
'Mrv0541 '30962
''24531
'CDK 6'10805
'CDK 1'8642
'CDK 5'4771
'CDK 8'2209
'-ISIS- '281
'Symyx '171
'CDK 7'144
'-OEChem-'96
'Marvin '61
'Mrv0540 '13
' RDKit'3
'CDK 3'1
The number of different CDK lines is not because of a version number but because CDK doesn't format the line correctly. The specification states that the first few fields of a non-blank line are supposed to be:

IIPPPPPPPPMMDDYYHHmmddSS…
while the CDK lines use:
  CDK    7/28/10,10:58
  CDK    8/10/10,12:22
  CDK    9/16/09,9:40
  CDK    10/7/09,10:42
  CDK    11/9/10,11:20
were the date starts one byte too early. The date also isn't in the specified format.

Further examination shows these were only generated in 2009 and 2010. The current CDK implementation is correct, and a 'git annotate' suggests it was fixed in 2010.

I don't think anyone uses that line for anything, and don't see the point of changing anything, so I don't think it's worthwhile to even ask ChEBML to change these old records.



Use FragmentOnBonds to fragment a molecule in RDKit #

This is part of a continuing series on how to fragment a molecular graph. So far I have covered:

Those were mostly pedagogical. They describe the low-level details of how to cut a bond to fragment a molecule into two parts, and still maintain correct chirality.

If you are writing production code, you should almost certainly use the built-in RDKit function FragmentOnBonds() to fragment a molecule, and not use any of the code from those essays.

FragmentOnBonds

FragmentOnBonds() will fragment a molecule given a list of bond identifiers. In default use it attaches a new "dummy" atom (what I've been calling a wildcard atom) to the atoms at the end of each broken bond. It will also set the isotope label for each newly created dummy atom to the atom index of the atom to which it is attached.

>>> from rdkit import Chem
>>> bond_idx = vanillin.GetBondBetweenAtoms(4, 5)
>>> from rdkit import Chem
>>> vanillin = Chem.MolFromSmiles("c1(C=O)cc(OC)c(O)cc1")
>>> bond = vanillin.GetBondBetweenAtoms(4, 5)
>>> bond
<rdkit.Chem.rdchem.Bond object at 0x102bec8a0<
>>> bond.GetIdx()
4
>>> new_mol = Chem.FragmentOnBonds(vanillin, [bond.GetIdx()])
>>> Chem.MolToSmiles(new_mol, isomericSmiles=True)
'[4*]OC.[5*]c1cc(C=O)ccc1O'
I don't want a label, so I'll specify 0 for the atom labels. (An unlabeled atom has an isotope of zero.)
>>> new_mol = Chem.FragmentOnBonds(vanillin, [bond.GetIdx()], dummyLabels=[(0, 0)])
>>> Chem.MolToSmiles(new_mol, isomericSmiles=True)
'[*]OC.[*]c1cc(C=O)ccc1O'

I'll turn this into function that I can use for testing:

def fragment_on_bond(mol, a1, a2):
    bond = mol.GetBondBetweenAtoms(a1, a2)
    return Chem.FragmentOnBonds(mol, [bond.GetIdx()], addDummies=True, dummyLabels=[(0, 0)])
That's certainly much easier than fragment_chiral()!

FragmentOnBonds implementation

The RDKit source code is available, so we can look at how this built-in function is implemented. A search for "FragmentOnBonds":

Code/GraphMol/Wrap/MolOps.cpp
1631:      python::def("FragmentOnBonds", fragmentOnBondsHelper,
plus some investigation shows that the actual C++ code has the name "fragmentOnBonds". The RDKit convention is to use an intial uppercase letter for Python function, and an initial lowercase letter for C++ functions.

A search this time for "fragmentOnBonds" has a few more hits, including this very suggestive one:

Code/GraphMol/ChemTransforms/MolFragmenter.cpp
320:        ROMol *nm=fragmentOnBonds(mol,fragmentHere,addDummies,dummyLabelsHere,bondTypesHere,lCutsPerAtom);
329:    ROMol *fragmentOnBonds(const ROMol &mol,const std::vector &bondIndices,
That second result is start of the FragmentOnBondsfunction definition,. Here's the key part, with commentary.

The function can fragment multiple bonds. It iterates through the bonds in order. For each one it gets the bond type and the begin and end atoms, and if requested it stores information about the number of cuts per atom.

  for (unsigned int i = 0; i < bondsToRemove.size(); ++i) {
    const Bond *bond = bondsToRemove[i];
    unsigned int bidx = bond->getBeginAtomIdx();
    unsigned int eidx = bond->getEndAtomIdx();
    Bond::BondType bT = bond->getBondType();
    res->removeBond(bidx, eidx);
    if (nCutsPerAtom) {
      (*nCutsPerAtom)[bidx] += 1;
      (*nCutsPerAtom)[eidx] += 1;
    }
FragmentOnBonds() by default will add "dummy" atoms (atoms with atomic number 0), but this can be disabled. If enabled, then it will create the two atoms with atomic number 0. By default it will set the istope to be the index of the atom it will be attached to, but that can also be specified with the dummyLabels parameter:
    if (addDummies) {
      Atom *at1, *at2;
      at1 = new Atom(0);
      at2 = new Atom(0);
      if (dummyLabels) {
        at1->setIsotope((*dummyLabels)[i].first);
        at2->setIsotope((*dummyLabels)[i].second);
      } else {
        at1->setIsotope(bidx);
        at2->setIsotope(eidx);
      }
Next, make the bond from the old terminal atoms to the new wildcard atoms. By default the bond type for the new bonds is the same as the bond which was broken (determined earlier). Otherwise, there's an option to specify an alternate bond type.
      unsigned int idx1 = res->addAtom(at1, false, true);
      if (bondTypes) bT = (*bondTypes)[i];
      res->addBond(eidx, at1->getIdx(), bT);
      unsigned int idx2 = res->addAtom(at2, false, true);
      res->addBond(bidx, at2->getIdx(), bT);

The last part does what the comment says it does; if the atom has a tetrahedral chirality then check if the permutation order has changed and invert the chirality if needed:
      // figure out if we need to change the stereo tags on the atoms:
      if (mol.getAtomWithIdx(bidx)->getChiralTag() ==
              Atom::CHI_TETRAHEDRAL_CCW ||
          mol.getAtomWithIdx(bidx)->getChiralTag() ==
              Atom::CHI_TETRAHEDRAL_CW) {
        checkChiralityPostMove(mol, mol.getAtomWithIdx(bidx),
                               res->getAtomWithIdx(bidx),
                               mol.getBondBetweenAtoms(bidx, eidx));
      }
      if (mol.getAtomWithIdx(eidx)->getChiralTag() ==
              Atom::CHI_TETRAHEDRAL_CCW ||
          mol.getAtomWithIdx(eidx)->getChiralTag() ==
              Atom::CHI_TETRAHEDRAL_CW) {
        checkChiralityPostMove(mol, mol.getAtomWithIdx(eidx),
                               res->getAtomWithIdx(eidx),
                               mol.getBondBetweenAtoms(bidx, eidx));
}
The code to check if the chirality has changed iterates through all of the bonds of the old atom ("oAt") to make a list ("newOrder") containing all of the atom identifiers on the other side of the bond. (The "OEDGE_ITER" is part of the adapter to work with Boost Graph library. It's an "out_edge_iterator".)
void checkChiralityPostMove(const ROMol &mol, const Atom *oAt, Atom *nAt,
                            const Bond *bond) {
  INT_LIST newOrder;
  ROMol::OEDGE_ITER beg, end;
  boost::tie(beg, end) = mol.getAtomBonds(oAt);
  while (beg != end) {
    const BOND_SPTR obond = mol[*beg];
    ++beg;
The code knows that the RemoveBond()/AddBond() caused the new dummy atom to be placed at the end of bond list, so when it sees the old bond it simply ignores it. Once it's gone through the old atoms, it adds the new wildcard atom id to the end of the list. Interestingly, this knowledge isn't something I can depend on, because it's an implementation detail which might change in the future. That's why I needed to substitute the new bond id at this point in my code.
    if (obond.get() == bond) {
      continue;
    }
    newOrder.push_back(obond->getIdx());
  }
  newOrder.push_back(bond->getIdx());
The last bit is to compute the permutation order, or as it says, "perturbation order". I believe that is a typo. Dig down a bit and it uses a selection sort to count the number of swaps needed to make the "oAt" atom list and the "newOrder" list match. The result, modulo two, gives the parity. If the parity is odd, invert the chirality:
  unsigned int nSwaps = oAt->getPerturbationOrder(newOrder);
  if (nSwaps % 2) nAt->invertChirality();
}
This was a bit different than my approach, which compared the parity before and after, but it gives the same results.

Jumping back to the fragmentOnBonds() function, the last bit of the function is:

  res->clearComputedProps();
  return static_cast(res);
This is needed because some of the computed properties may depend on bond patterns which are no longer present. Note that it does not use SanitizeMol(), which is something I investigated in the previous essay.

While the details are a bit different between my fragment_chiral() and FragmentOnBonds(), the general approach is the same.

A possible optimization

I mentioned that checkChiralityPostMove() uses insider knowledge. It knows that the deleted bond is removed from the list and the new bond added at the end. It uses this information to build the "newOrder" list with atom indices in the correct order, before determining the difference in parity between that list and the list of atom indices around the new bond.

There's an easier way. Count the number of bond between where the deleted bond was and the end, and take the result modulo 2. For example, if the transformation were:

initial neighbors = [1, 6, 5, 4]
   - remove bond to atom 5
neighbors after delete = [1, 6, 4]
   - add bond to new atom 9
final neighbors = [1, 6, 4, 9]
then the bond to atom 5 was in the third position, with one bond (to atom 4) between it and the end of the list. The parity is therefore odd. I added this suggestion to the RDKit issue tracker as #1033.

The proof is simple. Assume there are n elements afterward the bond to delete. Then there are n swaps needed to bring it to the end, where it can be replaced with the connection to the new atom. Hence the parity is n%2.

Not using SanitizeMol for this investigation

In the earlier essay in this series, "Fragment chiral molecules in RDKit", I investigated some failure cases where the chirality wasn't preserved. Greg Landrum pointed out: after you break one or more bonds, you really, really should re-sanitize the molecule (or at least call ClearComputedProps() .... I found that if I added SanitizeMol() then some of the error cases done by using ClearComputedProps() were no longer error cases. This may be because of a bug in FastFindRings. Greg, in the second link, writes: There is a workaround until this is fixed; explicitly sanitize your molecules or just cal GetSymmSSSR() before you generate SMILES.

So that's what I did. However, without SanitizeMol() there were only 232 failures out of 1.4M+ input structures. Adding SanitizeMol() didn't fix all of the remaining errors. On the other hand, the performance overhead to SanitizeMol() is (as expected) larger than changing a few bonds around. In one timing test I made, the time to process 10,000 structures using fragment_chiral() without SanitizeMol() was 107 seconds, while the time with SanitizeMol() was 255 seconds. The overall processing time takes about twice as long. (There is an additional constant overhead to parse the SMILES and canonicalize the fragmented molecule, which I did not include.)

For this pedagogical investigation, don't think the performance impact is worth a slightly error rate, so I will not include SanitizeMol() in this essay. That means I can use my new fragment_on_mol() function as-is, and I'll modify fragment_chiral() so it calls ClearComputedProps() instead of SanitizeMol().

In any case, the explicit call to SanitizeMol() is a workaround. I expect ClearComputedProps() will suffice in the RDKit of the future world that you, the reader, inhabit.

Cross-validation

In the earlier essay I tested the fragment_chiral() function by attempting to re-connect the fragments. If the canonicalized result doesn't match the canonicalized input structure then there's a problem. (And there were a small number of problems, at the edge of RDKit's ability to handle chemisty.)

That test ignored what SMILES calls "directional bonds", that is the "/" and "\" in "F/C=C\F", which is how SMILES denotes the stereochemistry of double bonds. This is how SMILES deals with "E-Z notation", which would be more familiar to a chemist. The tests ignored those bonds for the simple reason that FragmentOnBonds() doesn't preserve that information if it cuts the bond. In a future essay I'll show how to implement that ability.

Instead of fully testing fragment_on_bonds(), I'll do a weaker test where I compare its results to fragment_chiral(). I say it's weaker because what it shows is that they are likely bug-compatible, which is different than being correct. It's also weaker because the two implementation methods are very similar. A stronger test would use a different way to fragment.

I think of this as a cross-validation test. As Wikipedia helpful points out, that term is overloaded. Statistics uses a different definition than chemistry; the latter term is very similar to what's used in verification and validation, so I think I'm right to use the term.

The test code is not interesting, so I'll put it at the end and not discuss it in detail. The results weren't interesting either. There were no mismatches. I just had to wait for a few hours while I let it process the 1.4M+ structures from ChEMBL 20.

Here are the last few lines of output:

Processed 1455762 records, 1455763 molecules, 16017700 tests, 0 mismatches.  T_chiral: 5912.26 T_on_bond 3554.22
elapsed time: 8:30:15
The two implementations gave identical results, and time needed for fragment_chiral() is about 1.7x that needed for fragment_on_bond(). The fragment_on_bond() code is shorter, faster, and easier to understand, so you should always use it instead of fragment_chiral().

The code

The fragment_chiral() function here is a copy of the previous essay, except I replaced the SanitizeMol(new_mol) with a mol.ClearComputedProps() for 2x performance, at the expense of a few incorrect structures. (The SanitizeMol() is a work-around to a current bug report. It shouldn't be needed in long-term use.) I put the code here so the test code was self-contained, except of course you'll need your own dataset.

from __future__ import print_function
import time
import datetime
from rdkit import Chem

# Two different methods to cut a bond and fragment an RDKit molecule.
#   - fragment_on_bond() uses RDKit's FragmentOnBonds()
#   - fragment_chiral() uses lower-level API calls

# This also includes a cross-validation function which checks that the
# two methods produce the same fragments as output.
# The final two lines when I evaluated ChEMBL20 were:
#
# Processed 1455762 records, 1455763 molecules, 16017700 tests, 0 mismatches.  T_chiral: 5912.26 T_on_bond 3554.22
# elapsed time: 8:30:15
#
# The timings show that fragment_on_bond() is about 1.7x the performance of fragment_chiral().


#### Fragment using a high-level RDKit API function

def fragment_on_bond(mol, atom1, atom2):
    bond = mol.GetBondBetweenAtoms(atom1, atom2)
    new_mol = Chem.FragmentOnBonds(mol, [bond.GetIdx()], dummyLabels=[(0, 0)])
    # FragmentOnBonds() calls ClearComputedProps() at the end.  There
    # is a current bug report where, as a downstream effect, that may
    # cause some chiralities to change, most notably on some
    # bridgeheads.. A workaround for now is to call SanitizeMol(),
    # though that ends up tripling the time. I'll stay compatible
    # with FragmentOnBonds() and not call it.
    #Chem.SanitizeMol(new_mol)

    return new_mol


#### Fragment using low-level RDKit API functions

# See http://www.dalkescientific.com/writings/diary/archive/2016/08/14/fragment_chiral_molecules.html
# for implementation discussion

CHI_TETRAHEDRAL_CW = Chem.ChiralType.CHI_TETRAHEDRAL_CW
CHI_TETRAHEDRAL_CCW = Chem.ChiralType.CHI_TETRAHEDRAL_CCW

def parity_shell(values):
    # Simple Shell sort; while O(N^2), we only deal with at most 4 values
    # See http://www.dalkescientific.com/writings/diary/archive/2016/08/15/fragment_parity_calculation.html
    # for faster versions for fixed-size lists.
    values = list(values)
    N = len(values)
    num_swaps = 0
    for i in range(N-1):
        for j in range(i+1, N):
            if values[i] > values[j]:
                values[i], values[j] = values[j], values[i]
                num_swaps += 1
    return num_swaps % 2


def get_bond_parity(mol, atom_id):
    """Compute the parity of the atom's bond permutation

    Return None if it does not have tetrahedral chirality,
    0 for even parity, or 1 for odd parity.
    """
    atom_obj = mol.GetAtomWithIdx(atom_id)
    
    # Return None unless it has tetrahedral chirality
    chiral_tag = atom_obj.GetChiralTag()
    if chiral_tag not in (CHI_TETRAHEDRAL_CW, CHI_TETRAHEDRAL_CCW):
        return None
    
    # Get the list of atom ids for the each atom it's bonded to.
    other_atom_ids = [bond.GetOtherAtomIdx(atom_id) for bond in atom_obj.GetBonds()]
    
    # Use those ids to determine the parity
    return parity_shell(other_atom_ids)


def set_bond_parity(mol, atom_id, old_parity, old_other_atom_id, new_other_atom_id):
    """Compute the new bond parity and flip chirality if needed to match the old parity"""
    
    atom_obj = mol.GetAtomWithIdx(atom_id)
    # Get the list of atom ids for the each atom it's bonded to.
    other_atom_ids = [bond.GetOtherAtomIdx(atom_id) for bond in atom_obj.GetBonds()]
    
    # Replace id from the new wildcard atom with the id of the original atom
    i = other_atom_ids.index(new_other_atom_id)
    other_atom_ids[i] = old_other_atom_id
    
    # Use those ids to determine the parity
    new_parity = parity_shell(other_atom_ids)
    if old_parity != new_parity:
        # If the parity has changed, invert the chirality
        atom_obj.InvertChirality()

def fragment_chiral(mol, atom1, atom2):
    """Cut the bond between atom1 and atom2 and replace with connections to wildcard atoms

    Return the fragmented structure as a new molecule.
    """
    rwmol = Chem.RWMol(mol)
    
    atom1_parity = get_bond_parity(mol, atom1)
    atom2_parity = get_bond_parity(mol, atom2)
    
    rwmol.RemoveBond(atom1, atom2)
    wildcard1 = rwmol.AddAtom(Chem.Atom(0))
    wildcard2 = rwmol.AddAtom(Chem.Atom(0))
    new_bond1 = rwmol.AddBond(atom1, wildcard1, Chem.BondType.SINGLE)
    new_bond2 = rwmol.AddBond(atom2, wildcard2, Chem.BondType.SINGLE)
    
    if atom1_parity is not None:
        set_bond_parity(rwmol, atom1, atom1_parity, atom2, wildcard1)
    if atom2_parity is not None:
        set_bond_parity(rwmol, atom2, atom2_parity, atom1, wildcard2)
    
    # After breaking bonds, should re-sanitize See
    # https://github.com/rdkit/rdkit/issues/1022#issuecomment-239355482
    # or at least call ClearComputedProps().  I found that
    # SanitizeMol() improves chiral carbon bridgeheads handling,
    # though using it doubles the execution time. I'll stick with
    # ClearComputedProps(), which matches what FragmentOnBonds() does
    new_mol = rwmol.GetMol()
    # I must ClearComputedProps() after editing the structure.
    new_mol.ClearComputedProps()
    #Chem.SanitizeMol(new_mol)
    return new_mol

####### Cross-validation code

def read_records(filename):
    with open(filename) as infile:
        for recno, line in enumerate(infile):
            smiles, id = line.split()
            if "*" in smiles:
                continue
            yield recno, id, smiles

def cross_validate():
    # Match a single bond not in a ring
    BOND_SMARTS = "[!#0;!#1]-!@[!#0;!#1]"
    single_bond_pat = Chem.MolFromSmarts(BOND_SMARTS)
    
    num_mols = num_tests = num_mismatches = 0
    time_chiral = time_on_bond = 0.0

    # Helper function to print status information. Since
    # the function is defined inside of another function,
    # it has access to variables in the outer function.
    def print_status():
        print("Processed {} records, {} molecules, {} tests, {} mismatches."
              "  T_chiral: {:.2f} T_on_bond {:.2f}"
              .format(recno, num_mols, num_tests, num_mismatches,
                      time_chiral, time_on_bond))
    
    filename = "/Users/dalke/databases/chembl_20_rdkit.smi"
    start_time = datetime.datetime.now()
    for recno, id, smiles in read_records(filename):
        if recno % 1000 == 0:
            print_status()

        mol = Chem.MolFromSmiles(smiles)
        if mol is None:
            continue
        num_mols += 1

        # Find all of the single bonds
        matches = mol.GetSubstructMatches(single_bond_pat)

        for a1, a2 in matches:
            # For each pair of atom indices, split on that bond
            # and determine the canonicalized fragmented molecule.
            # Do it for both fragment_chiral() ...
            t1 = time.time()
            mol_chiral = fragment_chiral(mol, a1, a2)
            t2 = time.time()
            time_chiral += t2-t1
            smiles_chiral = Chem.MolToSmiles(mol_chiral, isomericSmiles=True)

            # ... and for fragment_on_bond()
            t1 = time.time()
            mol_on_bond = fragment_on_bond(mol, a1, a2)
            t2 = time.time()
            time_on_bond += t2-t1
            smiles_on_bond = Chem.MolToSmiles(mol_on_bond, isomericSmiles=True)

            # Report any mismatches and keep on going.
            num_tests += 1
            if smiles_chiral != smiles_on_bond:
                print("Mismatch: record: {} id: {} smiles: {} cut: {} {}".format(
                      recno, id, smiles, a1, a2))
                print(" smiles_chiral:", smiles_chiral)
                print("smiles_on_bond:", smiles_on_bond)
                num_mismatches += 1
    
    end_time = datetime.datetime.now()
    elapsed_time = str(end_time - start_time)
    elapsed_time = elapsed_time.partition(".")[0]  # don't display subseconds
    
    print("Done.")
    print_status()
    print("elapsed time:", elapsed_time)
                
if __name__ == "__main__":
    cross_validate()



Faster parity calculation #

In the previous essay I needed determine the parity of a permutation. I used a Shell sort and counted the number of swaps needed to order the list. The parity is even (or "0") if the number of swaps is even, otherwise it's odd (or "1"). The final code was:

def parity_shell(values):
    # Simple Shell sort; while O(N^2), we only deal with at most 4 values 
    values = list(values)
    N = len(values)
    num_swaps = 0
    for i in range(N-1):
        for j in range(i+1, N):
            if values[i] > values[j]:
                values[i], values[j] = values[j], values[i]
                num_swaps += 1
    return num_swaps % 2

I chose this implementation because it's easy to understand, and any failure case is easily found. However, it's not fast.

It's tempting to use a better sort method. The Shell sort takes quadratic time in the number of elements, while others take O(N*ln(N)) time in the asymptotic case.

However, an asymptotic analysis is pointless for this case. The code will only ever receive 3 terms (if there is a chiral hydrogen) or 4 terms because the code will only ever be called for tetrahedral chirality.

Sorting networks

The first time I worked on this problem, I used a sorting networks. A sorting network works on a fixed number of elements. It uses a pre-determined set of pairwise comparisons, each followed by a swap if needed. These are often used where code branches are expensive, like in hardware or on a GPU. A sorting network takes constant time, so can help minimize timing side-channel attacks, where the time to sort may give some insight into what is being sorted.

A general algorithm to find a perfect sorting network for a given value of 'N' element isn't known, though there are non-optimal algorithms like Bose-Nelson and Batcher's odd–even mergesort, and optimal solutions are known for up to N=10.

John M. Gamble has a CGI script which will generate a sorting network for a given number of elements and choice of algorithm. For N=4 it generates:

N=4 elements: SWAP(0, 1); SWAP(2, 3); SWAP(0, 2); SWAP(1, 3); SWAP(1, 2);
where the SWAP would modify the elements of an array in-place. Here's one way to turn those instructions into a 4-element sort for Python:
def sort4(data):
  if data[1] < data[0]:  # SWAP(0, 1)
    data[0], data[1] = data[1], data[0]
  
  if data[3] < data[2]:  # SWAP(2, 3)
    data[2], data[3] = data[3], data[2]
  
  if data[2] < data[0]:  # SWAP(0, 2)
    data[0], data[2] = data[2], data[0]
  
  if data[3] < data[1]:  # SWAP(1, 3)
    data[1], data[3] = data[3], data[1]
  
  if data[2] < data[1]:  # SWAP(1, 2)
    data[1], data[2] = data[2], data[1]

As a test, I'll sort every permutation of four values and make sure the result is sorted. I could write the test cases out manually, but it's easier to use the "permutations()" function from Python's itertools module, as in this example with 3 values:

>>> list(itertools.permutations([1, 2, 3]))
[(1, 2, 3), (1, 3, 2), (2, 1, 3), (2, 3, 1), (3, 1, 2), (3, 2, 1)]
Here's the test, which confirms that the function sorts correctly:
>>> for permutation in itertools.permutations([0, 1, 2, 3]):
...   permutation = list(permutation) # Convert the tuple to list sort4() can swap elements
...   sort4(permutation)
...   if permutation != [0, 1, 2, 3]:
...     print("ERROR:", permutation)
... 
>>>

I think it's obvious how to turn this into a parity function by adding a swap counter. If the input array cannot be modified then the parity function need to make a copy of the array first. That's what parity_shell() does. (NOTE: the day after writing this I realized I'm thinking like a C programmer. In Python the better solution is to copy the items into local variables.)

No need to sort

A sort network will always do D comparisions, but those sorts aren't always needed. The reason is simple - if you think of the network as a decision tree, where each comparison is a branch, then D comparison will always have 2D leaves. This must be at least as large as N!, where N is the number of elements in the list. But N! for N>2 is not a perfect power of 2, so there will be some unused leaves.

I would like to minimize the number of comparisions. I would also like to not modify the array in-place by actually sorting it.

The key realization is that there's no need to sort in order to determine the parity. For example, if there are only two elements in the list, then the parity is as simple as testing

def two_element_parity(x):
  assert len(x) == 2
  return x[1] > x[0]

The three element parity is a bit harder to do by hand:

def three_element_parity(x):
  assert len(x) == 3
  if x[0] < x[1]:
    if x[1] < x[2]:
      return 0      # 1, 2, 3
    elif x[0] < x[2]:
      return 1      # 1, 3, 2
    else:
      return 0      # 2, 3, 1
  elif x[0] < x[2]:
    return 1        # 2, 1, 3
  elif x[1] < x[2]:
    return 0        # 3, 1, 2
  else:
    return 1        # 3, 2, 1
It's complicated enough that it took several attempts before it was correct. I had to fix it using the following test code, which uses parity_shell() as a reference because I'm confident that it gives the correct values. (A useful development technique is to write something that you know works, even if it's slow, so you can use it to test more complicated code which better fits your needs)

The test code is:

def test_three_element_parity():
  for x in itertools.permutations([1,2,3]):
    p1 = parity_shell(x)
    p2 = three_element_parity(x)
    if p1 != p2:
      print("MISMATCH", x, p1, p2)
    else:
      print("Match", x, p1, p2)
which gives the output:
>>> test_three_element_parity()
Match (1, 2, 3) 0 0
Match (1, 3, 2) 1 1
Match (2, 1, 3) 1 1
Match (2, 3, 1) 0 0
Match (3, 1, 2) 0 0
Match (3, 2, 1) 1 1

A debugging technique

As I said, it took a couple of iterations to get correct code. I wasn't sure sometimes which branch was used to get a 0 or 1. During development I added a second field to each return value, to serve as a tag. The code looked like:

def three_element_parity(x):
  assert len(x) == 3
  if x[0] < x[1]:
    if x[1] < x[2]:
      return 0,1      # 1, 2, 3
    elif x[0] < x[2]:
      return 1,2      # 1, 3, 2
    else:
      return 0,3      # 2, 3, 1
  …
which meant I could see which path was in error. Here's one of the debug outputs using that technique:
>>> test_three_element_parity()
MISMATCH (1, 2, 3) 0 (0, 0)
MISMATCH (1, 3, 2) 1 (1, 1)
MISMATCH (2, 1, 3) 1 (1, 3)
MISMATCH (2, 3, 1) 0 (0, 2)
MISMATCH (3, 1, 2) 0 (0, 4)
MISMATCH (3, 2, 1) 1 (1, 5)
Of course the "MISMATCH" now is misleading and I need to compare things by eye, but with this few number of elements that's fine. For more complicated code I would modify the test code as well.

Brute force solution

The last time I worked on this problem I turned the sorting network for N=4 into a decision tree. With 5 swaps there 25=32 terminal nodes, but only N! = 4! = 24 of them will be used. I pruned them by hand, which is possible with 32 elements.

I thought this time I would come up with some clever way to handle this, and pulled out Knuth's "The Art of Computer Programming" for a pointer, which has a lot about optimal sorts and sorting network. Oddly, "parity" wasn't in the index.

There's probably some interesting pattern I could use to figure out which code paths to use, but N is small, so I decided to brute force it.

I'll start with all of the possible permutations:

>>> permutations = list(itertools.permutations(range(3)))
>>> permutations
[(0, 1, 2), (0, 2, 1), (1, 0, 2), (1, 2, 0), (2, 0, 1), (2, 1, 0)]
I want to build a decision tree where each leaf contains only one permutation. Each decision will be made by choosing two indices to use for the comparison test. I'll go through the permtuations. If its values at those indices are sorted then I'll put them into the "lt_permutations" list ("lt" is short for "less than"), otherwise they go into the "gt_permutations" list.

For now, I'll assume the first pair of indices to swap is (0, 1):

>>> lt_permutations = []
>>> gt_permutations = []
>>> for permutation in permutations:
...   if permutation[0] < permutation[1]:
...     lt_permutations.append(permutation)
...   else:
...     gt_permutations.append(permutation)
... 
>>> lt_permutations
[(0, 1, 2), (0, 2, 1), (1, 2, 0)]
>>> gt_permutations
[(1, 0, 2), (2, 0, 1), (2, 1, 0)]

I'll turn the above into a utility function:

def partition_permutations(i, j, permutations):
    lt_permutations = []
    gt_permutations = []
    for permutation in permutations:
        if permutation[i] < permutation[j]:
            lt_permutations.append(permutation)
        else:
            gt_permutations.append(permutation)
    return lt_permutations, gt_permutations
and use it further partition the 'lt_permutations' on the pair (1, 2):
>>> lt_permutations2, gt_permutations2 = partition_permutations(1, 2, lt_permutations)
>>> lt_permutations2
[(0, 1, 2)]
>>> gt_permutations2
[(0, 2, 1), (1, 2, 0)]
The lt_permutations2 list contains one element, so this time I'll partition gt_permutations2 using the swap index pair (0, 2):
>>> lt_permutations3, gt_permutations3 = partition_permutations(0, 2, gt_permutations2)
>>> lt_permutations3
[(0, 2, 1)]
>>> gt_permutations3
[(1, 2, 0)]
Each partitioning corresponds to additional if-statements until there is only one element in the branch. I want to use the above information to make a decision tree which looks like:
def parity3(data):
  if data[0] < data[1]:
    if data[1] < data[2];
      return 0 # parity of (0, 1, 2)
    else:
      if data[0] < data[2]:
        return 1 # parity of (0, 2, 1)
      else:
        return 0 # parity of (1, 2, 0)
  ...

Partition scoring

In the previous section I partioned using the successive pairs (0, 1), (1, 2) and (0, 2). These are pretty obvious. What should I use for N=4 or higher? In truth, I could likely use same swap pairs as from the sorting network, but I decided to continue with brute force.

For N item there are N*(N-1)/2 possible swap pairs.

>>> n = 4
>>> swap_pairs = [(i, j) for i in range(n-1) for j in range(i+1, n)]
>>> swap_pairs = [(i, j) for i in range(n-1) for j in range(i+1, n)]
>>> swap_pairs
[(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)]
I decided to pick the one which is more likely to partition the set of permutations in half. For each pair, I partition the given permutations, and use the absolute value of the difference between the "less than" and the "greater than" subsets.
def get_partition_score(swap_pair, permutations):
    i, j = swap_pair
    num_lt = num_gt = 0
    for permutation in permutations:
        if permutation[i] < permutation[j]:
            num_lt += 1
        else:
            num_gt += 1
    return abs(num_lt - num_gt)
I'll create all permutations for 4 terms and score each of the pairs on the results:
>>> permutations = list(itertools.permutations(range(n)))
>>> len(permutations)
24
>>> permutations[:3]
[(0, 1, 2, 3), (0, 1, 3, 2), (0, 2, 1, 3)]
>>> for swap_pair in swap_pairs:
...   print(swap_pair, get_partition_score(swap_pair, permutations))
... 
(0, 1) 0
(0, 2) 0
(0, 3) 0
(1, 2) 0
(1, 3) 0
(2, 3) 0
The score is 0, which means that all partitions are equally good. I'll use the first, which is (0, 1), and partition into two sets of 12 each:
>>>lt_permutations, gt_permutations = partition_permutations(0, 1, permutations)
>>> lt_permutations
[(0, 1, 2, 3), (0, 1, 3, 2), (0, 2, 1, 3), (0, 2, 3, 1), (0, 3, 1, 2),
(0, 3, 2, 1), (1, 2, 0, 3), (1, 2, 3, 0), (1, 3, 0, 2), (1, 3, 2, 0),
(2, 3, 0, 1), (2, 3, 1, 0)]
>>> len(lt_permutations), len(gt_permutations)
(12, 12)
I'll redo the scoring with the "less than" permutations
>>> permutations = lt_permutations
>>> for swap_pair in swap_pairs:
...   print(swap_pair, get_partition_score(swap_pair, permutations))
... 
(0, 1) 12
(0, 2) 4
(0, 3) 4
(1, 2) 4
(1, 3) 4
(2, 3) 0
Obviously it does no good to use (0, 1) again because those are all sorted. Most of the other fields are also partially sorted so using them leads to an imbalanced 4-8 partitioning, but (2, 3) gives a perfect partitioning, so I'll use it for the next partitioning, and again select the "less-than" subset and re-score:
>>> lt_permutations, gt_permutations = partition_permutations(2, 3, permutations)
>>> lt_permutations
[(0, 1, 2, 3), (0, 2, 1, 3), (0, 3, 1, 2), (1, 2, 0, 3), (1, 3, 0, 2), (2, 3, 0, 1)]
>>> gt_permutations
[(0, 1, 3, 2), (0, 2, 3, 1), (0, 3, 2, 1), (1, 2, 3, 0), (1, 3, 2, 0), (2, 3, 1, 0)]
>>> permutations = lt_permutations
>>> for swap_pair in swap_pairs:
...   print(swap_pair, get_partition_score(swap_pair, permutations))
... 
(0, 1) 6
(0, 2) 0
(0, 3) 4
(1, 2) 4
(1, 3) 0
(2, 3) 6

Repeat this process until only one permutation is left, and use the parity of that permutation as the return value.

Code generation

I'll combine the above code together and put it into program which generates Python code that will compute the parity of a list with N distinct items. It uses recursion. The main entry point is "generate_parity_function()", which sets up the data for the recursive function "_generate_comparison()". That identifies the best pair of indices to use for the swap then calls itself to process each side.

On the other hand, if there's one permutation in the list, then there's nothing more do to but compute the parity of that permutation and use that as the return value for that case.

import itertools

def parity_shell(values):
    # Simple Shell sort; while O(N^2), we only deal with at most 4 values 
    values = list(values)
    N = len(values)
    num_swaps = 0
    for i in range(N-1):
        for j in range(i+1, N):
            if values[i] > values[j]:
                values[i], values[j] = values[j], values[i]
                num_swaps += 1
    return num_swaps % 2

def get_partition_score(swap_pair, permutations):
    i, j = swap_pair
    num_lt = num_gt = 0
    for permutation in permutations:
        if permutation[i] < permutation[j]:
            num_lt += 1
        else:
            num_gt += 1
    return abs(num_lt - num_gt)

def partition_permutations(i, j, permutations):
    lt_permutations = []
    gt_permutations = []
    for permutation in permutations:
        if permutation[i] < permutation[j]:
            lt_permutations.append(permutation)
        else:
            gt_permutations.append(permutation)
    return lt_permutations, gt_permutations

def generate_parity_function(n):
    print("def parity{}(data):".format(n))
    permutations = list(itertools.permutations(range(n)))
    swap_pairs = [(i, j) for i in range(n-1) for j in range(i+1, n)]
    _generate_comparison(permutations, swap_pairs, "  ")

def _generate_comparison(permutations, swap_pairs, indent):
    if len(permutations) == 1:
        parity = parity_shell(permutations[0])
        print(indent + "return {} # {} ".format(parity, permutations[0]))
        return
    
    swap_pair = min(swap_pairs, key=lambda x: get_partition_score(x, permutations))
    # Delete the swap pair because it can't be used again.
    # (Not strictly needed as it will always have the worse score.)
    del swap_pairs[swap_pairs.index(swap_pair)]

    # I could have a case where the lt subset has 0 elements while the
    # gt subset has 1 element. Rather than have the 'if' block do nothing,
    # I'll swap the comparison indices and swap branches.
    i, j = swap_pair
    lt_permutations, gt_permutations = partition_permutations(i, j, permutations)
    if not lt_permutations:
        lt_permutations, gt_permutations = gt_permutations, lt_permutations
        i, j = j, i
    
    print(indent + "if data[{i}] < data[{j}]:".format(i=i, j=j))
    # Need to copy the swap_pairs because the 'else' case may reuse a pair.
    _generate_comparison(lt_permutations, swap_pairs[:], indent+"  ")
    if gt_permutations:
        print(indent + "else:")
        _generate_comparison(gt_permutations, swap_pairs, indent+"  ")
    

if __name__ == "__main__":
    import sys
    n = 4
    if sys.argv[1:]:
        n = int(sys.argv[1])
    generate_parity_function(n)

The output for n=2 elements is the expected trivial case:

def parity2(data):
  if data[0] < data[1]:
    return 0 # (0, 1) 
  else:
    return 1 # (1, 0) 
For n=3 it's a bit more complicated.
def parity3(data):
  if data[0] < data[1]:
    if data[0] < data[2]:
      if data[1] < data[2]:
        return 0 # (0, 1, 2) 
      else:
        return 1 # (0, 2, 1) 
    else:
      return 0 # (1, 2, 0) 
  else:
    if data[0] < data[2]:
      return 1 # (1, 0, 2) 
    else:
      if data[1] < data[2]:
        return 0 # (2, 0, 1) 
      else:
        return 1 # (2, 1, 0) 
and for n=4 elements, well, you can see why I wrote a program to help generate the function:
def parity4(data):
  if data[0] < data[1]:
    if data[2] < data[3]:
      if data[0] < data[2]:
        if data[1] < data[2]:
          return 0 # (0, 1, 2, 3) 
        else:
          if data[1] < data[3]:
            return 1 # (0, 2, 1, 3) 
          else:
            return 0 # (0, 3, 1, 2) 
      else:
        if data[0] < data[3]:
          if data[1] < data[3]:
            return 0 # (1, 2, 0, 3) 
          else:
            return 1 # (1, 3, 0, 2) 
        else:
          return 0 # (2, 3, 0, 1) 
    else:
      if data[0] < data[3]:
        if data[1] < data[2]:
          if data[1] < data[3]:
            return 1 # (0, 1, 3, 2) 
          else:
            return 0 # (0, 2, 3, 1) 
        else:
          return 1 # (0, 3, 2, 1) 
      else:
        if data[0] < data[2]:
          if data[1] < data[2]:
            return 1 # (1, 2, 3, 0) 
          else:
            return 0 # (1, 3, 2, 0) 
        else:
          return 1 # (2, 3, 1, 0) 
  else:
    if data[2] < data[3]:
      if data[0] < data[3]:
        if data[0] < data[2]:
          return 1 # (1, 0, 2, 3) 
        else:
          if data[1] < data[2]:
            return 0 # (2, 0, 1, 3) 
          else:
            return 1 # (2, 1, 0, 3) 
      else:
        if data[1] < data[2]:
          return 1 # (3, 0, 1, 2) 
        else:
          if data[1] < data[3]:
            return 0 # (3, 1, 0, 2) 
          else:
            return 1 # (3, 2, 0, 1) 
    else:
      if data[0] < data[2]:
        if data[0] < data[3]:
          return 0 # (1, 0, 3, 2) 
        else:
          if data[1] < data[3]:
            return 1 # (2, 0, 3, 1) 
          else:
            return 0 # (2, 1, 3, 0) 
      else:
        if data[1] < data[2]:
          if data[1] < data[3]:
            return 0 # (3, 0, 2, 1) 
          else:
            return 1 # (3, 1, 2, 0) 
        else:
          return 0 # (3, 2, 1, 0) 

The test code is essentially the same as "test_three_element_parity()", so I won't include it here.

Evaluation

I don't think it makes much sense to use this function beyond n=5 because there's so much code. Here's a table of the number of lines of code it generates for difference values of n:

# elements   # lines
----------   -------
    2              5
    3             17
    4             71
    5            349
    6          2,159
    7         15,119
    8        120,959
    9      1,088,662
This appears to be roughly factorial growth, which is what it should be. For my case, n=4, so 71 lines is not a problem.

I wrote some timing code which does 100,000 random selections from the possible permutations and compares the performance of the parityN() function with parity_shell(). To put them on a more even basis, I changed the parity_shell() implementation so mutates the input values rather than making a temporary list. The timing code for parity5() looks like:

import itertools

def parity_shell(values):
    # Simple Shell sort; while O(N^2), we only deal with at most 4 values 
    #values = list(values)
    N = len(values)
    num_swaps = 0
    for i in range(N-1):
        for j in range(i+1, N):
            if values[i] > values[j]:
                values[i], values[j] = values[j], values[i]
                num_swaps += 1
    return num_swaps % 2


if __name__ == "__main__":
    import random
    import time
    permutations = list(itertools.permutations(range(5)))
    perms = [list(random.choice(permutations)) for i in range(100000)]
    t1 = time.time()
    p1 = [parity5(perm) for perm in perms]
    t2 = time.time()
    p2 = [parity_shell(perm) for perm in perms]
    t3 = time.time()
    if p1 != p2:
      print("oops")
    print("parity5:", t2-t1, "parity_shell:", t3-t2)
    print("ratio:", (t2-t1)/(t3-t2))
The decision tree version is consisently 5-6x faster than the Shell sort version across all the sizes I tested.

A performance improvement

By the way, I was able to raise the performance to 9x faster by switching to local variables rather than an array index each time. Here's the start of parity4() with that change:

def parity4(data):
  data0,data1,data2,data3 = data
  if data0 < data1:
    if data2 < data3:
      if data0 < data2:
        if data1 < data2:
          return 0 # (0, 1, 2, 3) 
        else:
          if data1 < data3:
            return 1 # (0, 2, 1, 3) 
          else:
            return 0 # (0, 3, 1, 2) 
      else:
        if data0 < data3:
          if data1 < data3:
            return 0 # (1, 2, 0, 3) 
          else:
            return 1 # (1, 3, 0, 2) 
        else:
          return 0 # (2, 3, 0, 1) 
    else:
      if data0 < data3:
        if data1 < data2:
          if data1 < data3:
            return 1 # (0, 1, 3, 2) 
          else:
            return 0 # (0, 2, 3, 1) 
        else:
          return 1 # (0, 3, 2, 1) 
      else:
        if data0 < data2:
          if data1 < data2:
            return 1 # (1, 2, 3, 0) 
          else:
            return 0 # (1, 3, 2, 0) 
        else:
          return 1 # (2, 3, 1, 0) 
  else:
    if data2 < data3:
      if data0 < data3:
        if data0 < data2:
          return 1 # (1, 0, 2, 3) 
        else:
          if data1 < data2:
            return 0 # (2, 0, 1, 3) 
          else:
            return 1 # (2, 1, 0, 3) 
      else:
        if data1 < data2:
          return 1 # (3, 0, 1, 2) 
        else:
          if data1 < data3:
            return 0 # (3, 1, 0, 2) 
          else:
            return 1 # (3, 2, 0, 1) 
    else:
      if data0 < data2:
        if data0 < data3:
          return 0 # (1, 0, 3, 2) 
        else:
          if data1 < data3:
            return 1 # (2, 0, 3, 1) 
          else:
            return 0 # (2, 1, 3, 0) 
      else:
        if data1 < data2:
          if data1 < data3:
            return 0 # (3, 0, 2, 1) 
          else:
            return 1 # (3, 1, 2, 0) 
        else:
          return 0 # (3, 2, 1, 0) 
It's easy to change the code so it generates this version instead, so I won't show it.

Are the if/else:s worthwhile?

This is written the day after I wrote the previous text. In the above, I minimized the number of comparisions, at the expense of a lot of code generation. But the sorting network doesn't add that much overhead, and it's clearly a lot easier to implement. The real test shouldn't be how much faster the if/else decision tree is over the Shell sort solution, but how much faster it is to the optimal sorting network.

So I did that. Here's the N=4 and N=5 code, which is clearly simpler than the 71 and 349 lines of code from the decision tree:

def parity4_network(data):
    # N=4 Bose-Nelson sorting network from http://pages.ripco.net/~jgamble/nw.html 
    num_swaps = 0
    x0, x1, x2, x3 = data
    if x1 < x0:
        num_swaps += 1
        x0, x1 = x1, x0
  
    if x3 < x2:
        num_swaps += 1
        x2, x3 = x3, x2
  
    if x2 < x0:
        num_swaps += 1
        x0, x2 = x2, x0
  
    if x3 < x1:
        num_swaps += 1
        x1, x3 = x3, x1
  
    if x2 < x1:
        num_swaps += 1
        x1, x2 = x2, x1
  
    return num_swaps % 2
    
def parity5_network(data):
    # N=5 Bose-Nelson sorting network from http://pages.ripco.net/~jgamble/nw.html 
    num_swaps = 0
    x0, x1, x2, x3, x4 = data
    if x1 < x0:
        num_swaps += 1
        x0, x1 = x1, x0
    
    if x4 < x3:
        num_swaps += 1
        x3, x4 = x4, x3
    
    if x4 < x2:
        num_swaps += 1
        x2, x4 = x4, x2
    
    if x3 < x2:
        num_swaps += 1
        x2, x3 = x3, x2
    
    if x3 < x0:
        num_swaps += 1
        x0, x3 = x3, x0
    
    if x2 < x0:
        num_swaps += 1
        x0, x2 = x2, x0
    
    if x4 < x1:
        num_swaps += 1
        x1, x4 = x4, x1
    
    if x3 < x1:
        num_swaps += 1
        x1, x3 = x3, x1
    
    if x2 < x1:
        num_swaps += 1
        x1, x2 = x2, x1
    
    return num_swaps % 2
My timing tests say that the N=4 sorting network takes 1.4x the time of the decision tree with local variables, and the N=5 sorting network takes 1.7x the time. The decision tree is clearly faster.

If you are really interested in performance then you could push this into C or switch to PyPy. But sometimes you just need the code to be fast enough, which is why it can be worthwhile to explore different solutions and know their tradeoffs.



Fragment chiral molecules in RDKit using low-level functions #

In the previous essay, I showed that the simple fragmentation function doesn't preserve chiral after making a single cut. Here's the function definition:

from rdkit import Chem

# Only works correctly for achiral molecules
def fragment_simple(mol, atom1, atom2):
    rwmol = Chem.RWMol(mol)
    rwmol.RemoveBond(atom1, atom2)
    wildcard1 = rwmol.AddAtom(Chem.Atom(0))
    wildcard2 = rwmol.AddAtom(Chem.Atom(0))
    rwmol.AddBond(atom1, wildcard1, Chem.BondType.SINGLE) 
    rwmol.AddBond(atom2, wildcard2, Chem.BondType.SINGLE) 
    return rwmol.GetMol()
The reason is the RemoveBond()/AddBond() combination can change the permutation order of the bonds around and atom, which inverts the chirality. Here's the relevant part of the connection table from the end of that essay:
  connections from atom 1 (as bond type + other atom index)
1 C -0 -2 -3 -5   original structure
1 C -0 -2 -5 -11  modified; bond to atom 3 is now a bond to atom 11
             ^^^--- was bond to atom 3
         ^^^^^^^--- these two bonds swapped position = inverted chirality

I'll now show how to improve the code to handle chirality. (Note: this essay is pedagogical. To fragment in RDKit use FragmentOnBonds().)

Parity of a permutation

There's no way from Python to go in and change the permutation order of RDKit's bond list for an atom. Instead, I need to detect if the permutation order has changed, and if so, un-invert the atom's chirality.

While I say "un-invert", that's because we only need to deal with tetrahedral chirality, which has only two chirality types. SMILES supports more complicated chiralities, like octahedral (for example, "@OH19") which can't be written simply as "@" or "@@". However, I've never seen them in use.

With only two possibilities, this reduces to determining the "parity" of the permutation. There are only two possible parities. I'll call one "even" and the other "odd", though in code I'll use 0 for even and 1 for odd.

A list of values in increasing order, like (1, 2, 9), has an even parity. If I swap two values then it has odd parity. Both (2, 1, 9) and (9, 2, 1) have odd parity, because each needs only one swap to put it in sorted order. With another swap, such as (2, 9, 1), the permutation order is back to even parity. The parity of a permutation is the number of pairwise swaps needed to order the list, modulo 2. If the result is 0 then it has even parity, if the result is 1 then it has odd parity.

One way to compute the permutation order is to sort the list, and count the number of swaps needed. Since there will only be a handful of bonds, I can use a simple sort like the Shell sort:

def parity_shell(values):
    # Simple Shell sort; while O(N^2), we only deal with at most 4 values 
    values = list(values)
    N = len(values)
    num_swaps = 0
    for i in range(N-1):
        for j in range(i+1, N):
            if values[i] > values[j]:
                values[i], values[j] = values[j], values[i]
                num_swaps += 1
    return num_swaps % 2
(I have another essay where I show how to write a faster parity function for a fixed number of values.)

I'll test it with a few different cases to see if it gives the expected results:

>>> parity_shell( (1, 2, 9) )
0
>>> parity_shell( (2, 1, 9) )
1
>>> parity_shell( (2, 9, 1) )
0
>>> parity_shell( (2, 1, 9) )
1
>>> parity_shell( (1, 3, 9, 8) )
1
There are faster and better ways to determine the parity. I find it best to start with the most obviously correct solution first.

Determine an atom's parity

The next step is to determine the configuration order before and after attaching the dummy atom. I'll use the fragment_simple() and parity_shell() functions I defined earlier, and define a couple of helper functions to create an isomeric canonical SMILES from a molecule or SMILES string.

from rdkit import Chem

def C(mol): # Create a canonical isomeric SMILES from a molecule
  return Chem.MolToSmiles(mol, isomericSmiles=True)

def Canon(smiles): # Create a canonical isomeric SMILES from a SMILES string
  return C(Chem.MolFromSmiles(smiles))

The permutation order is based on which atoms are connected to a given bond. I'll parse a simple chiral structure (which is already in canonical form) and get the ids for the atoms bonded to the second atom. (The second atom has an index of 1.)

>>> mol = Chem.MolFromSmiles("O[C@](F)(Cl)Br")
>>> C(mol)
'O[C@](F)(Cl)Br'
>>> 
>>> atom_id = 1
>>> atom_obj = mol.GetAtomWithIdx(atom_id)
>>> other_atoms = [bond.GetOtherAtomIdx(atom_id) for bond in atom_obj.GetBonds()]
>>> other_atoms
[0, 2, 3, 4]
The list values are in order, so you won't be surprised it has a parity of 0 ("even"):
>>> parity_shell(other_atoms)
0

I'll use the fragment_simple() function to fragment between the oxygen and the chiral carbon:

>>> fragmented_mol = fragment_simple(mol, 0, 1)
>>> fragmented_smiles = C(fragmented_mol)
>>> fragmented_smiles
'[*]O.[*][C@@](F)(Cl)Br'
the use the convert_wildcards_to_closures() function from the previous essay to re-connect the fragments and produce a canonical SMILES from it:
>>> from smiles_syntax import convert_wildcards_to_closures
>>> 
>>> closure_smiles = convert_wildcards_to_closures(fragmented_smiles, (0, 0))
>>> closure_smiles
'O%90.[C@@]%90(F)(Cl)Br'
>>> 
>>> Canon(closure_smiles)
'O[C@@](F)(Cl)Br'

If you compare this to the canonicalized input SMILES you'll see the chirality is inverted from what it should be. I'll see if I can detect that from the list of neighbor atoms to the new atom 1 of the fragmented molecule:

>>> atom_id = 1
>>> atom_obj = fragmented_mol.GetAtomWithIdx(atom_id)
>>> other_atoms = [bond.GetOtherAtomIdx(atom_id) for bond in atom_obj.GetBonds()]
>>> other_atoms
[2, 3, 4, 6]
These values are ordered. It's tempting to conclude that this list also has an even parity. But recall that the original list was [0, 2, 3, 4]. The id 0 (the connection to the oxygen) has been replaced with the id 6 (the connection to the wildcard atom).

The permutation must use the same values, so I'll replace the 6 with a 0 and determine the parity of the resulting list:

>>> i = other_atoms.index(6)
>>> i
3
>>> other_atoms[i] = 0
>>> other_atoms
[2, 3, 4, 0]
>>> parity_shell(other_atoms)
1
This returned a 1 when the ealier parity call returned a 0, which means parity is inverted, which means I need to invert the chirality of the second atom:
>>> atom_obj.InvertChirality()

Now to check the re-assembled structure:

>>> fragmented_smiles = C(fragmented_mol)
>>> fragmented_smiles
'[*]O.[*][C@](F)(Cl)Br'
>>> 
>>> closure_smiles = convert_wildcards_to_closures(fragmented_smiles, (0, 0))
>>> closure_smiles
'O%90.[C@]%90(F)(Cl)Br'
>>> 
>>> Canon(closure_smiles)
'O[C@](F)(Cl)Br'
This matches the canonicalized input SMILES, so we're done.

An improved fragment function

I'll use a top-down process to describe the changes to fragment_simple() to make it work. What this doesn't show you is the several iterations I went through to make it look this nice.

At the top level, I need some code to figure out if an atom is chiral, then after I made the cut, and if the atom is chiral, I need some way to restore the correct chirality once I've connected it to the new wildcard atom.

def fragment_chiral(mol, atom1, atom2):
    rwmol = Chem.RWMol(mol)

    # Store the old parity as 0 = even, 1 = odd, or None for no parity 
    atom1_parity = get_bond_parity(mol, atom1)
    atom2_parity = get_bond_parity(mol, atom2)
    
    rwmol.RemoveBond(atom1, atom2)
    wildcard1 = rwmol.AddAtom(Chem.Atom(0))
    wildcard2 = rwmol.AddAtom(Chem.Atom(0))
    new_bond1 = rwmol.AddBond(atom1, wildcard1, Chem.BondType.SINGLE)
    new_bond2 = rwmol.AddBond(atom2, wildcard2, Chem.BondType.SINGLE)

    # Restore the correct parity if there is a parity
    if atom1_parity is not None:
        set_bond_parity(rwmol, atom1, atom1_parity, atom2, wildcard1)
    if atom2_parity is not None:
        set_bond_parity(rwmol, atom2, atom2_parity, atom1, wildcard2)

    # (Later I'll find I also need to call ClearComputedProps().)
    return rwmol.GetMol()

To get atom's bond permutation parity, check if it has tetrahedral chirality (it will either be clockwise or counter-clockwise). If it doesn't have a tetrahedral chirality, return None. Otherwise use the neighboring atom ids to determine the parity:

from rdkit import Chem

CHI_TETRAHEDRAL_CW = Chem.ChiralType.CHI_TETRAHEDRAL_CW
CHI_TETRAHEDRAL_CCW = Chem.ChiralType.CHI_TETRAHEDRAL_CCW

def get_bond_parity(mol, atom_id):
    atom_obj = mol.GetAtomWithIdx(atom_id)
    
    # Return None unless it has tetrahedral chirality
    chiral_tag = atom_obj.GetChiralTag()
    if chiral_tag not in (CHI_TETRAHEDRAL_CW, CHI_TETRAHEDRAL_CCW):
        return None
    
    # Get the list of atom ids for the each atom it's bonded to.
    other_atom_ids = [bond.GetOtherAtomIdx(atom_id) for bond in atom_obj.GetBonds()]
    
    # Use those ids to determine the parity
    return parity_shell(other_atom_ids)

To restore the parity, again get the list of neighboring atom ids, but this time from the fragmented molecule. This will be connected to one of the new wildcard atoms. I need to map that back to the original atom index before I can compute the parity and, if it's changed, invert the chirality:

def set_bond_parity(mol, atom_id, old_parity, old_other_atom_id, new_other_atom_id):
    atom_obj = mol.GetAtomWithIdx(atom_id)
    # Get the list of atom ids for the each atom it's bonded to.
    other_atom_ids = [bond.GetOtherAtomIdx(atom_id) for bond in atom_obj.GetBonds()]

    # Replace id from the new wildcard atom with the id of the original atom
    i = other_atom_ids.index(new_other_atom_id)
    other_atom_ids[i] = old_other_atom_id

    # Use those ids to determine the parity
    new_parity = parity_shell(other_atom_ids)
    if old_parity != new_parity:
        # If the parity has changed, invert the chirality
        atom_obj.InvertChirality()

Testing

I used a simple set of tests during the initial development where I split the bond between the first and second atom of a few structures and compared the result to a reference structure that I fragmented manually (and re-canonicalized):

# Create a canonical isomeric SMILES from a SMILES string.
# Used to put the manually-developed reference structures into canonical form.
def Canon(smiles):
    mol = Chem.MolFromSmiles(smiles)
    assert mol is not None, smiles
    return Chem.MolToSmiles(mol, isomericSmiles=True)

def simple_test():
    for smiles, expected in (
            ("CC", Canon("*C.*C")),
            ("F[C@](Cl)(Br)O", Canon("*F.*[C@](Cl)(Br)O")),
            ("F[C@@](Cl)(Br)O", Canon("*F.*[C@@](Cl)(Br)O")),
            ("F[C@@H](Br)O", Canon("*F.*[C@@H](Br)O")),
            ):
        mol = Chem.MolFromSmiles(smiles)
        fragmented_mol = fragment_chiral(mol, 0, 1)
        fragmented_smiles = Chem.MolToSmiles(fragmented_mol, isomericSmiles=True)
        if fragmented_smiles != expected:
            print("smiles:", smiles)
            print("fragmented:", fragmented_smiles)
            print("  expected:", expected)
These tests passed, so I developed some more extensive tests. My experience is that real-world chemistry is far more complex and interesting than the manual test cases I develop. After the basic tests are done, I do more extensive testing by processing a large number of structures from PubChem and then from ChEMBL.

(Incidentally, I process PubChem first because those files are generated by one tool - OEChem I believe - while ChEMBL files are generated by several different tools, each with a different way to handle chemistry. This makes the chemistry in ChEMBL more diverse.)

Need ClearComputedProps()?

The more extensive test processes every structure which contains a chiral atom (where the SMILES contains a '@'), cuts every bond between heavy atoms, so long as it's a single bond not in a ring, and puts the results back together to see if it matches the canonicalized input structure. The code isn't interesting enough to make specific comments about it. You can get the code at the end of this essay.

The first error occurred quickly, and there were many errors. Here's the first one:

FAILURE in record 94
     input_smiles: C[C@H]1CC[C@@H](NCc2ccc3c(c2)Cc2c(-c4ccc(CC(=O)O)cc4)n[nH]c2-3)CC1
  begin/end atoms: 4 5
fragmented smiles: [*]NCc1ccc2c(c1)Cc1c(n[nH]c1-2)-c1ccc(cc1)CC(=O)O.[*][C@H]1CC[C@H](C)CC1
   closure smiles: N%90Cc1ccc2c(c1)Cc1c(n[nH]c1-2)-c1ccc(cc1)CC(=O)O.[C@@H]%901CC[C@H](C)CC1
     final smiles: C[C@H]1CC[C@H](NCc2ccc3c(c2)Cc2c(-c4ccc(CC(=O)O)cc4)n[nH]c2-3)CC1
  expected smiles: C[C@H]1CC[C@@H](NCc2ccc3c(c2)Cc2c(-c4ccc(CC(=O)O)cc4)n[nH]c2-3)CC1
Greg Landrum, the main RDKit developer, points to the solution. He writes: "after you break one or more bonds, you really, really should re-sanitize the molecule (or at least call ClearComputedProps()".

The modified code is:

def fragment_chiral(mol, atom1, atom2):
    rwmol = Chem.RWMol(mol)
    
    atom1_parity = get_bond_parity(mol, atom1)
    atom2_parity = get_bond_parity(mol, atom2)
    
    rwmol.RemoveBond(atom1, atom2)
    wildcard1 = rwmol.AddAtom(Chem.Atom(0))
    wildcard2 = rwmol.AddAtom(Chem.Atom(0))
    new_bond1 = rwmol.AddBond(atom1, wildcard1, Chem.BondType.SINGLE)
    new_bond2 = rwmol.AddBond(atom2, wildcard2, Chem.BondType.SINGLE)
    
    if atom1_parity is not None:
        set_bond_parity(rwmol, atom1, atom1_parity, atom2, wildcard1)
    if atom2_parity is not None:
        set_bond_parity(rwmol, atom2, atom2_parity, atom1, wildcard2)

    # After breaking bonds, should re-sanitize, or at least use ClearComputedProps()
    # See https://github.com/rdkit/rdkit/issues/1022#issuecomment-239355482
    new_mol = rwmol.GetMol()
    # (I will find that I need to call SanitizeMol().)
    Chem.ClearComputedProps(new_mol)
    return new_mol

Ring chirality failures

There were 200,748 chiral structures in my selected subset from PubChem. Of those, 232 unique structures problems when I try to cut a bond. Here's an example of one of the reported failures:

FAILURE in record 12906
     input_smiles: [O-][S@+](CC1(O)CCN(CCc2c[nH]c3ccc(F)cc23)CC1)c1ccccc1
  begin/end atoms: 0 1
fragmented smiles: [*][O-].[*][S@+](CC1(O)CCN(CCc2c[nH]c3ccc(F)cc23)CC1)c1ccccc1
   closure smiles: [O-]%90.[S@+]%90(CC1(O)CCN(CCc2c[nH]c3ccc(F)cc23)CC1)c1ccccc1
     final smiles: [O-][S@@+](CC1(O)CCN(CCc2c[nH]c3ccc(F)cc23)CC1)c1ccccc1
  expected smiles: [O-][S@+](CC1(O)CCN(CCc2c[nH]c3ccc(F)cc23)CC1)c1ccccc1

Here's the complete list of failing structures, which might make a good test set for other programs:

C-N=C(-S)[C@]1(c2cccnc2)CCCC[S@+]1[O-] 1
C=C(c1ccc([S@+]([O-])c2ccc(OC)cc2)cc1)C1CCN(C2CCCCC2)CC1 2
C=C(c1ccc([S@@+]([O-])c2ccc(OC)cc2)cc1)C1CCN(C2CCCCC2)CC1 3
C=CCCCC[N@+]1(C)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 4
C=CCCCC[N@@+]1(C)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 5
C=CC[S@+]([O-])C[C@H](N)C(=O)O 6
CC#CCOc1ccc([S@@+]([O-])[C@H](C(=O)NO)C(C)C)cc1 7
CC(=O)NC[C@H]1CN(c2ccc([S@+](C)[O-])cc2)C(=O)O1 8
CC(=O)Nc1cc(-c2[nH]c([S@+](C)[O-])nc2-c2ccc(F)cc2)ccn1 9
CC(=O)Nc1cc(-c2c(-c3ccc(F)cc3)nc([S@+](C)[O-])n2C)ccn1 10
CC(C(=O)O[C@@H]1CC2CCC(C1)[N@]2C)(c1ccccc1)c1ccccc1 11
CC(C)(C(=O)N[C@H]1C2CCC1C[C@@H](C(=O)O)C2)N1CCN(c2ccc(C(F)(F)F)cn2)CC1 12
CC(C)(C(=O)N[C@H]1C2CCCC1C[C@@H](C(=O)O)C2)N1CCN(c2ccc(C(F)(F)F)cn2)CC1 13
CC(C)(C(=O)N[C@H]1C2CCCC1C[C@@H](C(N)=O)C2)N1CCN(c2ccc(C(F)(F)F)cn2)CC1 14
CC(C)(C(=O)N[C@H]1C2CCCC1C[C@H](C(N)=O)C2)N1CCN(c2ccc(C(F)(F)F)cn2)CC1 15
CC(C)(Oc1ccc(C#N)cn1)C(=O)N[C@H]1C2CCCC1C[C@H](C(N)=O)C2 16
CC(C)(Oc1ccc(C#N)cn1)C(=O)N[C@H]1C2COCC1C[C@H](C(N)=O)C2 17
CC(C)(Oc1ccc(Cl)cc1)C(=O)N[C@H]1C2CCCC1C[C@@H](C(=O)O)C2 18
CC(C)(Oc1ccc(Cl)cc1)C(=O)N[C@H]1C2CCCC1C[C@@H](C(N)=O)C2 19
CC(C)(Oc1ccc(Cl)cc1)C(=O)N[C@H]1C2CCCC1C[C@H](C(=O)O)C2 20
CC(C)(Oc1ccc(Cl)cc1)C(=O)N[C@H]1C2CCCC1C[C@H](C(N)=O)C2 21
CC(C)(Oc1ccc(Cl)cc1)C(=O)N[C@H]1C2COCC1C[C@@H](C(N)=O)C2 22
CC(C)Cc1c(C#N)cc2ccccc2c1C(=O)N(C)C[C@@H](CCN1CCC(c2ccccc2[S@+](C)[O-])CC1)c1ccc(Cl)c(Cl)c1 23
CC(C)Nc1cc(-c2[nH]c([S@+](C)[O-])nc2-c2ccc(F)cc2)ccn1 24
CC(C)Nc1cc(-c2[nH]c([S@@+]([O-])C(C)C)nc2-c2ccc(F)cc2)ccn1 25
CC(C)OC(=O)N1CCC(Oc2ncnc3c2CCN3c2ccc([S@+](C)[O-])c(F)c2)CC1 26
CC(C)OC(=O)N1CCC(Oc2ncnc3c2CCN3c2ccc([S@@+](C)[O-])c(F)c2)CC1 27
CC(C)[C@@H](C)Nc1cc(-c2[nH]c([S@+](C)[O-])nc2-c2ccc(F)cc2)ccn1 28
CC(C)[C@H](C)Nc1cc(-c2[nH]c([S@+](C)[O-])nc2-c2ccc(F)cc2)ccn1 29
CC(C)[S@+]([O-])c1nc(-c2ccc(F)cc2)c(-c2ccnc(NC3CCCCC3)c2)[nH]1 30
CC1(C)C(=O)N([C@H]2C3CCCC2C[C@@H](C(N)=O)C3)CC1COc1ccc(C#N)cn1 31
CC1(C)C(=O)N([C@H]2C3CCCC2C[C@H](C(N)=O)C3)CC1COc1ccc(C#N)cn1 32
CC1(C)N=C(c2ccccc2)C(=O)N1[C@@H]1C(=O)N2[C@@H]1[S@+]([O-])C(C)(C)[C@@H]2C(=O)O 33
CC1(C)NC(c2ccccc2)C(=O)N1[C@@H]1C(=O)N2[C@@H]1[S@+]([O-])C(C)(C)[C@@H]2C(=O)O 34
CC1(C)NC(c2ccccc2)C(=O)N1[C@@H]1C(=O)N2[C@@H]1[S@@+]([O-])C(C)(C)[C@@H]2C(=O)O 35
CCCCCCCCCCCCCO[C@@H](CO)[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 36
CCCCCCCCC[S@@+]([O-])CC(=O)OC 37
CCCCCCCCC[S@@+]([O-])CC(N)=O 38
CCCCCCCC[N@+]1(C)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 39
CCCCCC[C@@H](C(=O)NO)[S@+]([O-])c1ccc(OC)cc1 40
CCCCCC[C@@H](C(=O)NO)[S@@+]([O-])c1ccc(OC)cc1 41
CCCCCC[N@+]1(C)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 42
CCCCOCCOc1ccc(-c2ccc3c(c2)C=C(C(=O)Nc2ccc([S@@+]([O-])Cc4cncn4CCC)cc2)CCCCCN3CC(C)C)cc1 43
CCCCOCCOc1ccc(-c2ccc3c(c2)C=C(C(=O)Nc2ccc([S@@+]([O-])Cc4cncn4CCC)cc2)CCCCCN3CCC)cc1 44
CCCCOCCOc1ccc(-c2ccc3c(c2)C=C(C(=O)Nc2ccc([S@@+]([O-])Cc4cncn4CCC)cc2)CCCCN3CC(C)C)cc1 45
CCCCOCCOc1ccc(-c2ccc3c(c2)C=C(C(=O)Nc2ccc([S@@+]([O-])Cc4cncn4CCC)cc2)CCCCN3CCC)cc1 46
CCCCOCCOc1ccc(-c2ccc3c(c2)C=C(C(=O)Nc2ccc([S@@+]([O-])Cc4cncn4CCC)cc2)CCCN3CC(C)C)cc1 47
CCCCOCCOc1ccc(-c2ccc3c(c2)C=C(C(=O)Nc2ccc([S@@+]([O-])Cc4cncn4CCC)cc2)CCCN3CC(C)C)cc1.CS(=O)(=O)O 48
CCCCOCCOc1ccc(-c2ccc3c(c2)C=C(C(=O)Nc2ccc([S@@+]([O-])Cc4cncn4CCC)cc2)CCCN3CCC)cc1 49
CCCCOCCOc1ccc(-c2ccc3c(c2)C=C(C(=O)Nc2ccc([S@@+]([O-])Cc4cncn4CCC)cc2)CCCN3Cc2cnn(C)c2)cc1 50
CCCCOCCOc1ccc(-c2ccc3c(c2)C=C(C(=O)Nc2ccc([S@@+]([O-])Cc4nncn4CCC)cc2)CCCN3CC(C)C)cc1 51
CCCCOCCOc1ccc(-c2ccc3c(c2)C=C(C(=O)Nc2ccc([S@@+]([O-])Cc4nncn4CCC)cc2)CCCN3CCC)cc1 52
CCCC[N@+]1(C)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 53
CCC[C@H]1CC[C@H]([C@H]2CC[C@@H](OC(=O)[C@H]3[C@H](c4ccc(O)cc4)[C@H](C(=O)O[C@H]4CC[C@@H]([C@H]5CC[C@H](CCC)CC5)CC4)[C@H]3c3ccc(O)cc3)CC2)CC1 54
CCCc1c(C#N)cc2ccccc2c1C(=O)N(C)C[C@@H](CCN1CCC(c2ccccc2[S@+](C)[O-])CC1)c1ccc(Cl)c(Cl)c1 55
CCN1C(=O)c2ccccc2[C@H]1[C@@H](CCN1CCC(c2ccccc2[S@+](C)[O-])CC1)c1ccc(Cl)c(Cl)c1 56
CCN1c2cc(C(=O)NCc3ccc(C#N)cc3)ccc2[S@@+]([O-])c2ccccc2C1=O 57
CCN1c2cc(C(=O)NCc3ccc(F)cc3)ccc2[S@@+]([O-])c2ccccc2C1=O 58
CCN1c2cc(C(=O)NCc3ccc(OC)cc3)ccc2[S@+]([O-])c2ccccc2C1=O 59
CCN1c2cc(C(=O)NCc3ccc(OC)cc3)ccc2[S@@+]([O-])c2ccccc2C1=O 60
CCN1c2cc(C(=O)N[C@H](C)c3ccc(Br)cc3)ccc2[S@@+]([O-])c2ccccc2C1=O 61
CCN1c2cc(C(=O)N[C@H](C)c3ccc4ccccc4c3)ccc2[S@@+]([O-])c2ccccc2C1=O 62
CCO[C@@H](CO)[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 63
CC[C@H](NC(=O)c1c2ccccc2nc(-c2ccccc2)c1C[S@+](C)[O-])c1ccccc1 64
CC[C@H](NC(=O)c1c2ccccc2nc(-c2ccccc2)c1[S@+](C)[O-])c1ccccc1 65
CC[S@+]([O-])c1ccc(-c2coc3ccc(-c4ccc(C)o4)cc32)cc1 66
CCc1c(C#N)cc2ccccc2c1C(=O)N(C)C[C@@H](CCN1CCC(c2ccccc2[S@+](C)[O-])CC1)c1ccc(Cl)c(Cl)c1 67
CN(C[C@@H](CCN1CCC(c2ccc(Br)cc2[S@+](C)[O-])CC1)c1ccc(Cl)c(Cl)c1)C(=O)c1cc(C#N)cc2ccccc21 68
CN(C[C@@H](CCN1CCC(c2ccc(F)cc2[S@+](C)[O-])CC1)c1ccc(Cl)c(Cl)c1)C(=O)c1cc(C#N)cc2ccccc21 69
CN(C[C@@H](CCN1CCC(c2ccccc2[S@+](C)[O-])CC1)c1ccc(Cl)c(Cl)c1)C(=O)c1c2ccccc2cc(C#N)c1-c1ccccc1 70
CN(C[C@@H](CCN1CCC(c2ccccc2[S@+](C)[O-])CC1)c1ccc(Cl)c(Cl)c1)C(=O)c1cc(C#N)c(Br)c2ccccc12 71
CN(C[C@@H](CCN1CCC(c2ccccc2[S@+](C)[O-])CC1)c1ccc(Cl)c(Cl)c1)C(=O)c1cc(C#N)c(C#N)c2ccccc12 72
CN(C[C@@H](CCN1CCC(c2ccccc2[S@+](C)[O-])CC1)c1ccc(Cl)c(Cl)c1)C(=O)c1cc(C#N)c(F)c2ccccc12 73
CN(C[C@@H](CCN1CCC(c2ccccc2[S@+](C)[O-])CC1)c1ccc(Cl)c(Cl)c1)C(=O)c1cc(C#N)cc2cc(C#N)ccc21 74
CN(C[C@@H](CCN1CCC(c2ccccc2[S@+](C)[O-])CC1)c1ccc(Cl)c(Cl)c1)C(=O)c1cc(C#N)cc2cc(O)ccc21 75
CN(C[C@@H](CCN1CCC(c2ccccc2[S@+](C)[O-])CC1)c1ccc(Cl)c(Cl)c1)C(=O)c1cc(C#N)cc2ccccc21 76
CN1c2cc(C(=O)NCCN3CCCC3)ccc2[S@@+]([O-])c2ccccc2C1=O 77
CN1c2cc(C(=O)NCCc3cccs3)ccc2[S@+]([O-])c2ccccc2C1=O 78
CN1c2cc(C(=O)NCCc3cccs3)ccc2[S@@+]([O-])c2ccccc2C1=O 79
CN1c2cc(C(=O)NCc3ccc(Br)cc3)ccc2[S@@+]([O-])c2ccccc2C1=O 80
CN1c2cc(C(=O)NCc3ccc(Cl)cc3)ccc2[S@@+]([O-])c2ccccc2C1=O 81
CN1c2cc(C(=O)NCc3cccnc3)ccc2[S@@+]([O-])c2ccccc2C1=O 82
COC(=O)[C@H]1CC2COCC(C1)[C@H]2NC(=O)[C@@]1(C)CCCN1S(=O)(=O)c1cccc(Cl)c1C 83
COC(=O)c1ccc2[nH]c([S@+]([O-])Cc3nccc(OC)c3OC)nc2c1 84
COC(=O)c1ccc2[nH]c([S@@+]([O-])Cc3nccc(OC)c3OC)nc2c1 85
COC1(C[S@@+]([O-])c2ccccc2)CCN(CCc2c[nH]c3ccc(F)cc23)CC1 86
COCCCOc1ccnc(C[S@+]([O-])c2nc3ccccc3[nH]2)c1C 87
CO[C@@H](CO)[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 88
CO[C@@H]([C@H](O)[C@@H](O)CO)[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 89
COc1c(C#N)cc2ccccc2c1C(=O)N(C)C[C@@H](CCN1CCC(c2ccc(Br)cc2[S@+](C)[O-])CC1)c1ccc(Cl)c(Cl)c1 90
COc1c(C#N)cc2ccccc2c1C(=O)N(C)C[C@@H](CCN1CCC(c2ccccc2[S@+](C)[O-])CC1)c1ccc(Cl)c(Cl)c1 91
COc1cc(-C=C-OC(=O)[C@H]2CC[C@@H](N(C)[C@H]3CC[C@H](C(=O)O-C=C-c4cc(OC)c(OC)c(OC)c4)CC3)CC2)cc(OC)c1OC 92
COc1cc(C(=O)N2CCO[C@@](CCN3CCC4(CC3)c3ccccc3C[S@+]4[O-])(c3ccc(Cl)c(Cl)c3)C2)cc(OC)c1OC 93
COc1cc(C(=O)N2CCO[C@@](CCN3CCC4(CC3)c3ccccc3C[S@@+]4[O-])(c3ccc(Cl)c(Cl)c3)C2)cc(OC)c1OC 94
COc1cc(C(=O)N2CCO[C@](CCN3CCC4(CC3)c3ccccc3C[S@+]4[O-])(c3ccc(Cl)c(Cl)c3)C2)cc(OC)c1OC 95
COc1cc(C(=O)N2CCO[C@](CCN3CCC4(CC3)c3ccccc3C[S@@+]4[O-])(c3ccc(Cl)c(Cl)c3)C2)cc(OC)c1OC 96
COc1cc(OC(=O)[C@@H]2CC[C@H](N(C)[C@@H]3CC[C@@H](C(=O)Oc4cc(OC)c(OC)c(OC)c4)CC3)CC2)cc(OC)c1OC 97
COc1ccc(C2CCN(CC[C@H](CN(C)C(=O)c3c4ccccc4cc(C#N)c3OC)c3ccc(Cl)c(Cl)c3)CC2)c([S@+](C)[O-])c1 98
COc1ccc(C2CCN(CC[C@H](CN(C)C(=O)c3cc(C#N)cc4ccccc43)c3ccc(Cl)c(Cl)c3)CC2)c([S@+](C)[O-])c1 99
COc1ccc(N(C(=O)OC(C)(C)C)[C@@H]2C(C)=C[C@H]([S@@+]([O-])c3ccc(C)cc3)[C@@H]3C(=O)N(C)C(=O)[C@@H]32)cc1 100
COc1ccc([C@@H]2C3(CO)C4[N@](C)C5C6(CO)C([N@@](C)C3C4(CO)[C@H]6c3ccc(OC)cc3)C52CO)cc1 101
COc1ccc([S@+]([O-])c2ccc(C(=O)C3CCN(C4CCCCC4)CC3)cc2)cc1 102
COc1ccc([S@+]([O-])c2ccc(C(C#N)C3CCN(C4CCCCC4)CC3)cc2)cc1 103
COc1ccc([S@+]([O-])c2ccc(C(C#N)N3CCN(C4CCCCC4)CC3)cc2)cc1 104
COc1ccc([S@@+]([O-])c2ccc(C(=O)C3CCN(C4CCCCC4)CC3)cc2)cc1 105
COc1ccc([S@@+]([O-])c2ccc(C(C#N)C3CCN(C4CCCCC4)CC3)cc2)cc1 106
COc1ccc([S@@+]([O-])c2ccc(C(C#N)N3CCN(C4CCCCC4)CC3)cc2)cc1 107
COc1ccc2c(cc(C#N)cc2C(=O)N(C)C[C@@H](CCN2CCC(c3ccccc3[S@+](C)[O-])CC2)c2ccc(Cl)c(Cl)c2)c1 108
COc1ccc2cc(-c3nc(-c4ccc([S@+](C)[O-])cc4C)[nH]c3-c3ccncc3)ccc2c1 109
COc1ccc2cc(-c3nc(-c4ccc([S@@+](C)[O-])cc4C)[nH]c3-c3ccncc3)ccc2c1 110
COc1ccc2nc([S@@+]([O-])Cc3ncc(C)c(OC)c3C)[nH]c2c1 111
COc1ccnc(C[S@@+]([O-])c2nc3ccc(OC(F)F)cc3[nH]2)c1OC 112
CSC[S@+]([O-])C[C@H](CO)NC(=O)-C=C-c1c(C)[nH]c(=O)[nH]c1=O 113
CSC[S@@+]([O-])CC(CO)NC(=O)-C=C-c1c(C)nc(O)nc1O 114
C[C@@H](Nc1cc(-c2[nH]c([S@+](C)[O-])nc2-c2ccc(F)cc2)ccn1)c1ccccc1 115
C[C@@H]1CC[C@H]2[C@@H](C)[C@@H](CCC(=O)Nc3cccc([S@+](C)[O-])c3)O[C@@H]3O[C@@]4(C)CC[C@@H]1[C@]32OO4 116
C[C@@H]1O[C@H]([C@@H]2CCC[N+]2(C)C)C[S@+]1[O-] 117
C[C@@H]1O[C@H]([C@@H]2CCC[N+]2(C)C)C[S@@+]1[O-] 118
C[C@@H]1O[C@H]([C@H]2CCC[N+]2(C)C)C[S@+]1[O-] 119
C[C@@H]1O[C@H]([C@H]2CCC[N+]2(C)C)C[S@@+]1[O-] 120
C[C@H](Nc1cc(-c2[nH]c([S@+](C)[O-])nc2-c2ccc(F)cc2)ccn1)c1ccccc1 121
C[C@H]1O[C@@H]([C@@H]2CCCN2C)C[S@+]1[O-] 122
C[C@H]1O[C@@H]([C@@H]2CCCN2C)C[S@@+]1[O-] 123
C[C@H]1O[C@@H]([C@@H]2CCC[N+]2(C)C)C[S@+]1[O-] 124
C[C@H]1O[C@@H]([C@@H]2CCC[N+]2(C)C)C[S@@+]1[O-] 125
C[N+](C)(C)C[C@@H]1C[S@+]([O-])C(C2CCCCC2)(C2CCCCC2)O1 126
C[N+](C)(C)C[C@@H]1C[S@@+]([O-])C(C2CCCCC2)(C2CCCCC2)O1 127
C[N@+]1([O-])[C@H]2CC[C@H]1C[C@H](OC(=O)[C@@H](CO)c1ccccc1)C2 128
C[N@@+]1(CC2CC2)C2CCC1C[C@@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 129
C[N@@+]1(CC2CC2)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 130
C[N@@+]1(CCCOCc2ccccc2)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 131
C[N@@+]1(CCCOc2ccccc2)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 132
C[N@@+]1(CCCc2ccccc2)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 133
C[N@@+]1(CCOCc2ccccc2)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 134
C[N@@+]1(CCOc2ccccc2)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 135
C[S@+]([O-])CCCCN=C=S 136
C[S@+]([O-])CC[C@@](CO)(C(=O)O[C@H]1CN2CCC1CC2)c1ccccc1 137
C[S@+]([O-])C[C@@H]1[C@@H](O)[C@]23CC[C@H]1C[C@H]2[C@]1(C)CCC[C@](C)(C(=O)O)[C@H]1CC3 138
C[S@+]([O-])C[C@](C)(O)[C@H]1OC(=O)C=C2[C@@]13O[C@@H]3[C@H]1OC(=O)[C@@]3(C)C=CC[C@@]2(C)[C@@H]13 139
C[S@+]([O-])c1ccc(C2=C(c3ccccc3)C(=O)OC2)cc1 140
C[S@+]([O-])c1cccc2c3c(n(Cc4ccc(Cl)cc4)c21)[C@@H](CC(=O)O)CCC3 141
C[S@+]([O-])c1nc(-c2ccc(F)cc2)c(-c2ccnc(NC(=O)Cc3ccc(F)cc3)c2)[nH]1 142
C[S@+]([O-])c1nc(-c2ccc(F)cc2)c(-c2ccnc(NC3CCCCC3)c2)[nH]1 143
C[S@+]([O-])c1nc(-c2ccc(F)cc2)c(-c2ccnc(NC3CCOCC3)c2)[nH]1 144
C[S@@+]([O-])CCCC-C(=N-OS(=O)(=O)[O-])S[C@@H]1O[C@H](CO)[C@@H](O)[C@H](O)[C@H]1O 145
C[S@@+]([O-])CCCCN=C=S 146
C[S@@+]([O-])C[C@@H]1[C@@H](O)[C@]23CC[C@H]1C[C@H]2[C@]1(C)CCC[C@](C)(C(=O)O)[C@H]1CC3 147
C[S@@+]([O-])Cc1ccc(C(=O)Nc2cccnc2C(=O)NCC2CCOCC2)c2ccccc12 148
Cc1c(C#N)cc(C(=O)N(C)C[C@@H](CCN2CCC(c3ccccc3[S@+](C)[O-])CC2)c2ccc(Cl)c(Cl)c2)c2ccccc12 149
Cc1c(C#N)cc2ccccc2c1C(=O)N(C)C[C@@H](CCN1CCC(c2ccccc2[S@+](C)[O-])CC1)c1ccc(Cl)c(Cl)c1 150
Cc1c(OCC(F)(F)F)ccnc1C[S@@+]([O-])c1nc2ccccc2[nH]1 151
Cc1cc(=O)c(Oc2ccc(Br)cc2F)c(-c2ccc([S@@+](C)[O-])cc2)o1 152
Cc1cc(=O)c(Oc2ccc(Cl)cc2F)c(-c2ccc([S@@+](C)[O-])cc2)o1 153
Cc1cc(=O)c(Oc2ccc(F)cc2F)c(-c2ccc([S@+](C)[O-])cc2)o1 154
Cc1ccc(-c2ccc3occ(-c4ccc([S@+](C)[O-])cc4)c3c2)o1 155
Cc1ccc(-c2ncc(Cl)cc2-c2ccc([S@+](C)[O-])cc2)cn1 156
Cc1ccc([S@+]([O-])-C(F)=C-c2ccccn2)cc1 157
Cc1ccc([S@+]([O-])c2occc2C=O)cc1 158
Cc1nc(O)nc(O)c1-C=C-C(=O)N[C@@H](CO)C[S@+]([O-])CCl 159
Cc1nc(O)nc(O)c1-C=C-C(=O)N[C@@H](CO)C[S@@+]([O-])CCl 160
NC(=O)C[S@+]([O-])C(c1ccccc1)c1ccccc1 161
NC(=O)C[S@@+]([O-])C(c1ccccc1)c1ccccc1 162
O.O.O.O.[Sr+2].COc1ccc2[n-]c([S@@+]([O-])Cc3ncc(C)c(OC)c3C)nc2c1.COc1ccc2[n-]c([S@@+]([O-])Cc3ncc(C)c(OC)c3C)nc2c1 163
O=C(O)CC-C=C-CC[C@H]1[C@H](OCc2ccc(-c3ccccc3)cc2)C[S@+]([O-])[C@@H]1c1cccnc1 164
O=C(O)CC-C=C-CC[C@H]1[C@H](OCc2ccc(-c3ccccc3)cc2)C[S@@+]([O-])[C@@H]1c1cccnc1 165
O=S(=O)([O-])OCCC[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 166
O=S(=O)([O-])OC[C@@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 167
O=S(=O)([O-])OC[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 168
O=S(=O)([O-])O[C@@H](CO)CC[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 169
O=S(=O)([O-])O[C@@H](CO)[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 170
O=S(=O)([O-])O[C@@H]([C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO)[C@@H](O)[C@@H](O)[C@@H](O)CO 171
O=S(=O)([O-])O[C@@H]([C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO)[C@@H](O)[C@@H](O)[C@H](O)CO 172
O=S(=O)([O-])O[C@@H]([C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO)[C@@H](O)[C@H](O)[C@@H](O)CO 173
O=S(=O)([O-])O[C@@H]([C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO)[C@@H](O)[C@H](O)[C@H](O)CO 174
O=S(=O)([O-])O[C@@H]([C@H](O)[C@@H](O)CO)[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 175
O=S(=O)([O-])O[C@H](CO)[C@@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 176
O=S(=O)([O-])O[C@H]([C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO)[C@@H](O)[C@@H](O)[C@@H](O)CO 177
O=S(=O)([O-])O[C@H]([C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO)[C@@H](O)[C@@H](O)[C@H](O)CO 178
O=S(=O)([O-])O[C@H]([C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO)[C@@H](O)[C@H](O)[C@@H](O)CO 179
O=S(=O)([O-])O[C@H]([C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO)[C@@H](O)[C@H](O)[C@H](O)CO 180
OCC12C3[N@](Cc4ccc(O)cc4)C4C5(CO)C([N@@](Cc6ccc(O)cc6)C1C3(CO)[C@H]5c1ccccc1)C4(CO)[C@H]2c1ccccc1 181
OCC12C3[N@](Cc4ccc(OCc5ccccc5)cc4)C4C5(CO)C([N@@](Cc6ccc(OCc7ccccc7)cc6)C1C3(CO)[C@@H]5c1ccccc1)C4(CO)[C@@H]2c1ccccc1 182
OCC12C3[N@](Cc4cccc(OCc5ccccc5)c4)C4C5(CO)C([N@@](Cc6cccc(OCc7ccccc7)c6)C1C3(CO)[C@H]5c1ccccc1)C4(CO)[C@H]2c1ccccc1 183
OCC12C3[N@](Cc4ccccc4)C4C5(CO)C([N@@](Cc6ccccc6)C1C3(CO)[C@H]5c1ccc(O)cc1)C4(CO)[C@H]2c1ccc(O)cc1 184
OCC12C3[N@](Cc4ccccc4)C4C5(CO)C([N@@](Cc6ccccc6)C1C3(CO)[C@H]5c1ccc(OCc3ccccc3)cc1)C4(CO)[C@H]2c1ccc(OCc2ccccc2)cc1 185
OCC12C3[N@](Cc4ccccc4)C4C5(CO)C([N@@](Cc6ccccc6)C1C3(CO)[C@H]5c1cccc(O)c1)C4(CO)[C@H]2c1cccc(O)c1 186
OCC12C3[N@](Cc4ccccc4)C4C5(CO)C([N@@](Cc6ccccc6)C1C3(CO)[C@H]5c1ccccc1)C4(CO)[C@H]2c1ccccc1 187
OCC12C3[N@](Cc4cccnc4)C4C5(CO)C([N@@](Cc6cccnc6)C1C3(CO)[C@H]5c1ccccc1)C4(CO)[C@H]2c1ccccc1 188
OCC12C3[N@](Cc4ccncc4)C4C5(CO)C([N@@](Cc6ccncc6)C1C3(CO)[C@H]5c1ccccc1)C4(CO)[C@H]2c1ccccc1 189
OC[C@@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 190
OC[C@@H](O)[C@@H](O)[C@H](O)[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 191
OC[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 192
OC[C@H](O)[C@@H](O)[C@@H](O)[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 193
OC[C@H](O)[C@@H](O)[C@@H](O)[C@H](O)[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 194
OC[C@H](O)[C@@H](O)[C@H](O)[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 195
OC[C@H](OCc1ccccc1)[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO.F[B-](F)(F)F 196
[Br-].C=CCCCC[N@+]1(C)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 197
[Br-].CCCCCCCC[N@+]1(C)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 198
[Br-].CCCCCC[N@+]1(C)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 199
[Br-].CCCC[N@+]1(C)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 200
[Br-].C[N@@+]1(CC2CC2)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 201
[Br-].C[N@@+]1(CCCOCc2ccccc2)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 202
[Br-].C[N@@+]1(CCCOc2ccccc2)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 203
[Br-].C[N@@+]1(CCCc2ccccc2)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 204
[Br-].C[N@@+]1(CCOc2ccccc2)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 205
[Cl-].CCCCCCCCCCCCCO[C@@H](CO)[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 206
[Cl-].CCO[C@@H](CO)[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 207
[Cl-].CO[C@@H](CO)[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 208
[Cl-].CO[C@@H]([C@H](O)[C@@H](O)CO)[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 209
[Cl-].OC[C@@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 210
[Cl-].OC[C@@H](O)[C@@H](O)[C@H](O)[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 211
[Cl-].OC[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 212
[Cl-].OC[C@H](O)[C@@H](O)[C@@H](O)[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 213
[Cl-].OC[C@H](O)[C@@H](O)[C@H](O)[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 214
[I-].C[C@@H]1O[C@H]([C@@H]2CCC[N+]2(C)C)C[S@+]1[O-] 215
[I-].C[C@@H]1O[C@H]([C@@H]2CCC[N+]2(C)C)C[S@@+]1[O-] 216
[I-].C[C@@H]1O[C@H]([C@H]2CCC[N+]2(C)C)C[S@+]1[O-] 217
[I-].C[C@@H]1O[C@H]([C@H]2CCC[N+]2(C)C)C[S@@+]1[O-] 218
[I-].C[C@H]1O[C@@H]([C@@H]2CCC[N+]2(C)C)C[S@+]1[O-] 219
[I-].C[C@H]1O[C@@H]([C@@H]2CCC[N+]2(C)C)C[S@@+]1[O-] 220
[I-].C[N+](C)(C)C[C@@H]1C[S@+]([O-])C(C2CCCCC2)(C2CCCCC2)O1 221
[I-].C[N+](C)(C)C[C@@H]1C[S@@+]([O-])C(C2CCCCC2)(C2CCCCC2)O1 222
[I-].C[N@@+]1(CCOCc2ccccc2)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 223
[K+].COc1ccc2[n-]c([S@@+]([O-])Cc3ncc(C)c(OC)c3C)nc2c1 224
[Mg+2].COc1ccc2[n-]c([S@@+]([O-])Cc3ncc(C)c(OC)c3C)nc2c1.COc1ccc2[n-]c([S@@+]([O-])Cc3ncc(C)c(OC)c3C)nc2c1 225
[Na+].COc1ccc2[n-]c([S@@+]([O-])Cc3ncc(C)c(OC)c3C)nc2c1 226
[O-]CC[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 227
[O-]C[C@@H](O)[C@@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 228
[O-][C@@H](CO)[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 229
[O-][S@+](CC1(O)CCN(CCc2c[nH]c3ccc(F)cc23)CC1)c1ccccc1 230
[O-][S@+](Cc1cc(OCC2CC2)ccn1)c1nc2cc(F)ccc2[nH]1 231
[O-][S@@+](Cc1cc(OCC2CC2)ccn1)c1nc2cc(F)ccc2[nH]1 232

I've been trying to make sense of the 232 failures. Some observations:

RDKit bug reports

The results of my investigations lead to three RDKit bug reports:

In the first, Greg identified that that FastFindRings() isn't putting the two chiral atoms into the same primitive ring, so AssignStereochemistry() isn't seeing that this is an instance of ring stereochemistry.

In the second, Greg points to the August 2015 thread titled "Stereochemistry - Differences between RDKit Indigo" in the RDKit mailing list". Greg comments about nitrogren chirality:

There are two things going on here in the RDKit: 1) Ring stereochemistry 2) stereochemistry about nitrogen centers. Let's start with the second, because it's easier: RDKit does not generally "believe in" stereochemistry around three coordinate nitrogens. ... Back to the first: ring stereochemistry. ... The way the RDKit handles this is something of a hack: it doesn't identify those atoms as chiral centers, but it does preserve the chiral tags when generating a canonical SMILES:

Need SanitizeMol(), not ClearComputedProps()

He proposes that as a workaround I sanitize the newly created molecule, so I replaced the call to ClearComputedProps() with one to "SanitizeMol()", near the end of fragment_chiral(), as shown here:

    # After breaking bonds, should re-sanitize or at least call
    # ClearComputedProps().
    # See https://github.com/rdkit/rdkit/issues/1022#issuecomment-239355482
    new_mol = rwmol.GetMol()
    #new_mol.ClearComputedProps()
    Chem.SanitizeMol(new_mol)
    return new_mol
With that in place, where there were 232 records which failed my test, now there are 195. All 181 of the chiral sulfurs still fail, 11 of the 34 chiral nitrogens still fail, the chiral carbon bridgeheads all pass, while the 3 remaining chiral carbons still fail.

(I also tested with both ClearComputedProps() and SanitizeMol(), but using both made no difference.)

While better, it's not substantially better. What's going on?

RDKit can produce non-canonical SMILES

At this point we're pushing the edge of what RDKit can handle. A few paragraphs ago I quoted Greg as saying that ring chirality is "something of a hack". I think that's the reason why, of the 232 records that cause a problem, 67 of them don't produce a stable SMILES string. That is, if I parse what should be a canonicalized SMILES string and recanonicalize it, I get a different result. The canonicalization is bi-stable, in that recanonicalization swaps between two possibilites, with a different chirality assignment each time.

Here's a reproducible if you want to try it out yourself. (By the time you read this it's likely been fixed!)

from rdkit import Chem

def Canon(smiles):
    mol = Chem.MolFromSmiles(smiles)
    return Chem.MolToSmiles(mol, isomericSmiles=True)

def check_if_canonical(smiles):
    s1 = Canon(smiles)
    s2 = Canon(s1)
    if s1 != s2:
        print("Failed to canonicalize", smiles)
        print(" input:", smiles)
        print("canon1:", s1)
        print("canon2:", s2)
        print("canon3:", Canon(s2)) 
    else:
        print("Passed", smiles)
    
for smiles in (
    "O[C@H]1CC2CCC(C1)[N@@]2C",
    "C[C@]1(c2cccnc2)CCCC[S@+]1O",
    "[C@]1C[S@+]1O"):
    check_if_canonical(smiles)
The output from this is:
Failed to canonicalize O[C@H]1CC2CCC(C1)[N@@]2C
 input: O[C@H]1CC2CCC(C1)[N@@]2C
canon1: C[N@]1C2CCC1C[C@@H](O)C2
canon2: C[N@]1C2CCC1C[C@H](O)C2
canon3: C[N@]1C2CCC1C[C@@H](O)C2
Failed to canonicalize C[C@]1(c2cccnc2)CCCC[S@+]1O
 input: C[C@]1(c2cccnc2)CCCC[S@+]1O
canon1: C[C@]1(c2cccnc2)CCCC[S@@+]1O
canon2: C[C@]1(c2cccnc2)CCCC[S@+]1O
canon3: C[C@]1(c2cccnc2)CCCC[S@@+]1O
Failed to canonicalize [C@]1C[S@+]1O
 input: [C@]1C[S@+]1O
canon1: O[S@@+]1[C]C1
canon2: O[S@+]1[C]C1
canon3: O[S@@+]1[C]C1

SanitizeMol adds some overhead

The SanitizeMol() is a work-around to a problem which is under investigation. I'll use SanitizeMol() for the rest of this essay, but bear in mind that that adds overhead. In one timing test I did, the version with ClearComputedProps() took 107 seconds while the version with SanitizeMol() took 255 seconds.

You'll have to decide for yourself if the small number of additional correct structures is worth the extra time. And, hopefully it's been fixed by the time you read this essay so you don't need a workaround.

Bridgeheads

Many of the failures were due to chiral bridgehead atoms. I used the following two SMARTS to detect bridgeheads:

depiction of two bridgehead topologies
*~1~*~*(~*~*~2)~*~*~2~*~1
*~1~*~*(~*~*~*~2)~*~*~2~*~1
Before I added the SanitizeMol() call, there were 34 chiral nitrogen structures which failed. Of those 34, only 11 are still failures after adding the SanitizeMol(). Of those 11, one is a normal-looking bridgehead:
a nitrogen bridgehead that has a bistable SMILES

CC(C(=O)O[C@@H]1CC2CCC(C1)[N@]2C)(c1ccccc1)c1ccccc1
It's the only one of the simple nitrogen bridgehead structures which doesn't have a stable canonicalization. (I used the core bridgehead from this structure as the first test case in the previous section, where I showed a few bi-stable SMILES strings.)

The other 10 of the 11 nitrogen bridgehead failures have a more complex ring system, like:

a complex chiral nitrogen ring system
OCC12C3[N@](Cc4cccnc4)C4C5(CO)C([N@@](Cc6cccnc6)C1C3(CO)[C@H]5c1ccccc1)C4(CO)[C@H]2c1ccccc1
All of these have a bi-stable canonicalization.

I also looked at the chiral carbon bridgeheads which failed. Of the original 14, all 14 of them pass after I added the SanitizeMol() call.

The remaining structures

There are three chiral structures which fail even after sanitization, which do not contain a chiral nitrogen or chiral sulfur, and which do not contain a bridgehead. These are:

  
CCC[C@H]1CC[C@H]([C@H]2CC[C@@H](OC(=O)[C@H]3[C@H](c4ccc(O)cc4)[C@H](C(=O)O[C@H]4CC[C@@H]([C@H]5CC[C@H](CCC)CC5)CC4)[C@H]3c3ccc(O)cc3)CC2)CC1
COc1cc(-C=C-OC(=O)[C@H]2CC[C@@H](N(C)[C@H]3CC[C@H](C(=O)O-C=C-c4cc(OC)c(OC)c(OC)c4)CC3)CC2)cc(OC)c1OC
COc1cc(OC(=O)[C@@H]2CC[C@H](N(C)[C@@H]3CC[C@@H](C(=O)Oc4cc(OC)c(OC)c(OC)c4)CC3)CC2)cc(OC)c1OC
Upon investigation, all three seem involve the ring chirality solution that Greg called a "hack". I did not investigate further.

The final code

That was lot of text. And a lot of work. If you made it this far, congratualtions. Oddly, I still have more to write about on the topic.

I'll leave you with the final version of the code, with various tweaks and comments that I didn't discuss in the essay. As a bonus, it includes an implementation of fragment_chiral() which uses RDKit's FragmentOnBonds() function, which is the function you should be using to fragment bonds.

# Cut an RDKit molecule on a specified bond, and replace the old terminals with wildcard atoms ("*").
# The code includes test suite which depends on an external SMILES file.
#
# This code is meant as a study of the low-level operations. For production use,
# see the commented out function which uses RDKit's built-in FragmentOnBonds().
#
# Written by Andrew Dalke <dalke@dalkescientific.com>.

from __future__ import print_function

from rdkit import Chem

# You can get a copy of this library from:
# http://www.dalkescientific.com/writings/diary/archive/2016/08/09/fragment_achiral_molecules.html#smiles_syntax.py
from smiles_syntax import convert_wildcards_to_closures


CHI_TETRAHEDRAL_CW = Chem.ChiralType.CHI_TETRAHEDRAL_CW
CHI_TETRAHEDRAL_CCW = Chem.ChiralType.CHI_TETRAHEDRAL_CCW

def parity_shell(values):
    # Simple Shell sort; while O(N^2), we only deal with at most 4 values 
    values = list(values)
    N = len(values)
    num_swaps = 0
    for i in range(N-1):
        for j in range(i+1, N):
            if values[i] > values[j]:
                values[i], values[j] = values[j], values[i]
                num_swaps += 1
    return num_swaps % 2


def get_bond_parity(mol, atom_id):
    """Compute the parity of the atom's bond permutation

    Return None if it does not have tetrahedral chirality,
    0 for even parity, or 1 for odd parity.
    """
    atom_obj = mol.GetAtomWithIdx(atom_id)
    
    # Return None unless it has tetrahedral chirality
    chiral_tag = atom_obj.GetChiralTag()
    if chiral_tag not in (CHI_TETRAHEDRAL_CW, CHI_TETRAHEDRAL_CCW):
        return None
    
    # Get the list of atom ids for the each atom it's bonded to.
    other_atom_ids = [bond.GetOtherAtomIdx(atom_id) for bond in atom_obj.GetBonds()]
    
    # Use those ids to determine the parity
    return parity_shell(other_atom_ids)


def set_bond_parity(mol, atom_id, old_parity, old_other_atom_id, new_other_atom_id):
    """Compute the new bond parity and flip chirality if needed to match the old parity"""
    
    atom_obj = mol.GetAtomWithIdx(atom_id)
    # Get the list of atom ids for the each atom it's bonded to.
    other_atom_ids = [bond.GetOtherAtomIdx(atom_id) for bond in atom_obj.GetBonds()]
    
    # Replace id from the new wildcard atom with the id of the original atom
    i = other_atom_ids.index(new_other_atom_id)
    other_atom_ids[i] = old_other_atom_id
    
    # Use those ids to determine the parity
    new_parity = parity_shell(other_atom_ids)
    if old_parity != new_parity:
        # If the parity has changed, invert the chirality
        atom_obj.InvertChirality()

# You should really use the commented-out function below, which uses
# RDKit's own fragmentation code. Both do the same thing.

def fragment_chiral(mol, atom1, atom2):
    """Cut the bond between atom1 and atom2 and replace with connections to wildcard atoms

    Return the fragmented structure as a new molecule.
    """
    rwmol = Chem.RWMol(mol)
    
    atom1_parity = get_bond_parity(mol, atom1)
    atom2_parity = get_bond_parity(mol, atom2)
    
    rwmol.RemoveBond(atom1, atom2)
    wildcard1 = rwmol.AddAtom(Chem.Atom(0))
    wildcard2 = rwmol.AddAtom(Chem.Atom(0))
    new_bond1 = rwmol.AddBond(atom1, wildcard1, Chem.BondType.SINGLE)
    new_bond2 = rwmol.AddBond(atom2, wildcard2, Chem.BondType.SINGLE)
    
    if atom1_parity is not None:
        set_bond_parity(rwmol, atom1, atom1_parity, atom2, wildcard1)
    if atom2_parity is not None:
        set_bond_parity(rwmol, atom2, atom2_parity, atom1, wildcard2)
    
    # After breaking bonds, should re-sanitize See
    # https://github.com/rdkit/rdkit/issues/1022#issuecomment-239355482
    # or at least call ClearComputedProps().  I found that
    # SanitizeMol() improves chiral carbon bridgeheads handling,
    # though using it doubles the execution time. I'll stick with
    # ClearComputedProps(), which matches what FragmentOnBonds() does
    new_mol = rwmol.GetMol()
    # If you don't sanitize then you must call ClearComputedProps() 
    #mol.ClearComputedProps()
    Chem.SanitizeMol(new_mol)
    return new_mol

#### Use this code for production
## def fragment_chiral(mol, atom1, atom2):
##     bond = mol.GetBondBetweenAtoms(atom1, atom2)
##     new_mol = Chem.FragmentOnBonds(mol, [bond.GetIdx()], dummyLabels=[(0, 0)])
##     # FragmentOnBonds() calls ClearComputedProps() at the end.  There
##     # is a current bug report where, as a downstream effect, that may
##     # cause some chiralities to change, most notably on some
##     # bridgeheads.. A workaround for now is to call SanitizeMol(),
##     # though that ends up tripling the time. I'll stay compatible
##     # with FragmentOnBonds() and not call it.
##     #Chem.SanitizeMol(new_mol)
##     return new_mol


##### ##### ##### ##### Test code ##### ##### ##### ##### #####

# Create a canonical isomeric SMILES from a SMILES string
# Used to put the manually-developed reference structures into canonical form.
def Canon(smiles):
    mol = Chem.MolFromSmiles(smiles)
    assert mol is not None, smiles
    return Chem.MolToSmiles(mol, isomericSmiles=True)
        
def simple_test():
    for smiles, expected in (
            ("CC", Canon("*C.*C")),
            ("F[C@](Cl)(Br)O", Canon("*F.*[C@](Cl)(Br)O")),
            ("F[C@@](Cl)(Br)O", Canon("*F.*[C@@](Cl)(Br)O")),
            ("F[C@@H](Br)O", Canon("*F.*[C@@H](Br)O")),
            ):
        mol = Chem.MolFromSmiles(smiles)
        fragmented_mol = fragment_chiral(mol, 0, 1)
        fragmented_smiles = Chem.MolToSmiles(fragmented_mol, isomericSmiles=True)
        if fragmented_smiles != expected:
            print("smiles:", smiles)
            print("fragmented:", fragmented_smiles)
            print("  expected:", expected)

# Match a single bond not in a ring
BOND_SMARTS = "[!#0;!#1]-!@[!#0;!#1]"
single_bond_pat = Chem.MolFromSmarts(BOND_SMARTS)


_bridgehead1_pat = Chem.MolFromSmarts("*~1~*~*(~*~*~*~2)~*~*~2~*~1")
_bridgehead2_pat = Chem.MolFromSmarts("*~1~*~*(~*~*~2)~*~*~2~*~1")

def is_bridgehead(mol):
    """Test if the molecule contains one of the bridgehead patterns"""
    return (mol.HasSubstructMatch(_bridgehead1_pat) or
            mol.HasSubstructMatch(_bridgehead2_pat))

def file_test():
    # Point this to a SMILES file to test
    filename = "/Users/dalke/databases/chembl_20_rdkit.smi"
    
    with open(filename) as infile:
        num_records = num_successes = num_failures = 0
        for lineno, line in enumerate(infile):
            # Give some progress feedback
            if lineno % 100 == 0:
                print("Processed", lineno, "lines and", num_records,
                      "records. Successes:", num_successes,
                      "Failures:", num_failures)
            
            # Only test structures with a chiral atom
            input_smiles = line.split()[0]
            if "@" not in input_smiles:
                continue
            
            # The code doesn't handle directional bonds. Convert them
            # to single bonds
            if "/" in input_smiles:
                input_smiles = input_smiles.replace("/", "-")
            if "\\" in input_smiles:
                input_smiles = input_smiles.replace("\\", "-")
            
            mol = Chem.MolFromSmiles(input_smiles)
            if mol is None:
                continue
            
            ### Uncomment as appropriate
            if is_bridgehead(mol):
                pass
                #continue
            else:
                pass
                continue
            num_records += 1
            
            # I expect the reassembled structure to match this canonical SMILES
            expected_smiles = Chem.MolToSmiles(mol, isomericSmiles=True)
            
            # Cut each of the non-ring single bonds between two heavy atoms
            matches = mol.GetSubstructMatches(single_bond_pat)
            has_failure = False
            for begin_atom, end_atom in matches:
                # Fragment
                fragmented_mol = fragment_chiral(mol, begin_atom, end_atom)
                fragmented_smiles = Chem.MolToSmiles(fragmented_mol, isomericSmiles=True)
                assert "." in fragmented_smiles, fragmented_smiles # safety check
                
                # Convert the "*"s to the correct "%90" closures
                closure_smiles = convert_wildcards_to_closures(fragmented_smiles, (0, 0))
                assert "%90" in closure_smiles, closure_smiles # safety check
                closure_mol = Chem.MolFromSmiles(closure_smiles)
                
                # canonicalize and compare; report any mismatches
                final_smiles = Chem.MolToSmiles(closure_mol, isomericSmiles=True)
                if final_smiles != expected_smiles:
                    print("FAILURE in record", num_records+1)
                    print("     input_smiles:", input_smiles)
                    print("  begin/end atoms:", begin_atom, end_atom)
                    print("fragmented smiles:", fragmented_smiles)
                    print("   closure smiles:", closure_smiles)
                    print("     final smiles:", final_smiles)
                    print("  expected smiles:", expected_smiles)
                    has_failure = True
            if has_failure:
                num_failures += 1
            else:
                num_successes += 1
                #print("SUCCESS", input_smiles)
        
        print("Done. Records:", num_records, "Successes:", num_successes,
              "Failures:", num_failures)
            
if __name__ == "__main__":
    simple_test()
    file_test()

Thanks!

Thanks to Greg Landrum both for RDKit and for help in tracking down some of the stubborn cases. Thanks also to the University of Hamburg for SMARTSViewer, which I use as a SMILES structure viewer so I don't have to worry about bond type, aromaticity, or chiral re-intepretations.

Edits

This essay has had a few post-publication edits. I added more emphasis that this essay exists for pegadogical reasons, described more that SanitizeMol() is meant as solution to get around a problem currently under investigation, moved the ChEMBL SDF analysis code to its own post.



Fragment achiral molecules in RDKit using low-level functions #

This essay shows a simple way to fragment a non-chiral molecule (okay, a "molecular graph") using low-level RDKit calls, and then a complex way to re-connect the SMILES fragment by manipulating the SMILES string at the syntax level then recanonicalizing. (The correct way to fragment a bond in RDKit is to use FragmentOnBonds(), and not the code I cover here, which is meant as the basis for understanding how something like FragmentOnBonds() works.)

It is part of a series of essays on how to fragment a molecular graph.

I'll start by fragmenting the methoxy group of vanillin. (I wrote an essay on the vanillin functional groups.) I want to turn the SMILES of "c1(C=O)cc(OC)c(O)cc1" into two parts, '[*]OC' (the methoxy group) and 'O=Cc1cc[*]c(O)cc1' (the rest of the structure). The "[*]" terms are a way to represent an R-group in SMILES.

The method I'll use is to identify the bond and its two end atoms, delete the bond, and for each of the end atoms add a new "*" atom, which is an atom with atomic number 0.

I'll identify the bond manually, by counting the atom indices and identifying the substring which contains the methoxy:

                              1   ] atom indices, from 0 to 10.
            0  1 2 34 56 7 8 90   ]   (The top line is the tens.)
 vanillin = c1(C=O)cc(OC)c(O)cc1
                      ^^-- methoxy group
		    ^-^ break this bond between atom[4] and atom[5]
I want to break the bond between atoms at 4 and 5. I'll use RDKit for this. The default RDKit molecule does not support topological modifications, so I'll have to turn the molecule into a "RWMol" (Read/Write Molecule) to edit it, then convert the result back to a normal molecule:
>>> mol = Chem.MolFromSmiles("c1(C=O)cc(OC)c(O)cc1")
>>> rwmol = Chem.RWMol(mol)
>>> rwmol.RemoveBond(4, 5)
>>> new_mol = rwmol.GetMol()
>>> Chem.MolToSmiles(new_mol)
'CO.O=Cc1ccc(O)cc1'
Visual inspection shows that this broke the correct bond, but it's missing the new "[*]" atoms. I need to go back to the RWMol and add two atoms with atomic number 0 (which Daylight SMILES terms a 'wildcard' atom):
>>> rwmol.AddAtom(Chem.Atom(0))
11
>>> rwmol.AddAtom(Chem.Atom(0))
12
>>> new_mol = rwmol.GetMol()
>>> Chem.MolToSmiles(new_mol)
'CO.O=Cc1ccc(O)cc1.[*].[*]'
That's getting closer! Now I need to connect each new atom to the correct atom which was as the end of the deleted bond. Remember, the old atom ids were 5 and 6, and you can see from the above code that th new atom ids are 11 and 12. I'll connect atom 4 to 11 and atom 5 to 12, both using a single bond:
>>> rwmol.AddBond(4, 11, Chem.BondType.SINGLE)
11
>>> rwmol.AddBond(5, 12, Chem.BondType.SINGLE)
12
>>> new_mol = rwmol.GetMol()
>>> Chem.MolToSmiles(new_mol)
'[*]OC.[*]c1cc(C=O)ccc1O'
That's the output I'm looking for. Now to wrap it up in a function:
from rdkit import Chem

def fragment_simple(mol, atom1, atom2):
    rwmol = Chem.RWMol(mol)
    rwmol.RemoveBond(atom1, atom2)
    wildcard1 = rwmol.AddAtom(Chem.Atom(0))
    wildcard2 = rwmol.AddAtom(Chem.Atom(0))
    rwmol.AddBond(atom1, wildcard1, Chem.BondType.SINGLE) 
    rwmol.AddBond(atom2, wildcard2, Chem.BondType.SINGLE) 
    return rwmol.GetMol()
and test it out on the vanillin structure:
>>> vanillin = Chem.MolFromSmiles("c1(C=O)cc(OC)c(O)cc1")
>>> fragmented_vanillin = fragment_simple(vanillin, 4, 5)
>>> Chem.MolToSmiles(fragmented_vanillin)
'[*]OC.[*]c1cc(C=O)ccc1O'

Reassemble two fragments

The above code looks right, but the real test is to reassemble the fragments and see if it gives the same result. Rather than work with the molecule at the molecular graph level, I'll manipulate the SMILES string directly, using a ring closure method I described in 2004.

The term "ring closure" sounds like it's only used for rings, but it's actually simply an alternate way to specify a bond. As Daylight's SMILES theory documentation points out, "C1.C1 specifies the same molecule as CC(ethane)".

I'll replace the "[*]" symbols with the ring closure "%90", which is high enough that won't occur in the structures I deal with. I can't do a simple string substitution since that would turn '[*]OC' into '%90OC'. If the '[*]' is the first atom, or is after a '.', then I need to place the %90 after the second atom symbol, giving "O%90C.c%901cc(C=O)ccc1O". I'll canonicalize that string, and compare it to the canonicalized vanillin structure:

>>> test_smiles = "O%90C.c%901cc(C=O)ccc1O"
>>> Chem.CanonSmiles(test_smiles)
'COc1cc(C=O)ccc1O'
>>> Chem.MolToSmiles(vanillin)
'COc1cc(C=O)ccc1O'
As hoped for, they are the same.

Convert wildcards to closures

I apologize. I'm about to show you a big block of code with very little commentary besides the comments. Feel free to skip to the next section.

In the previous section, I manually tranformed a SMILES string containing wildcards to one with ring closures so I could test if the fragmentation code was correct. I can automated that step, but it's a bit complicated and I won't go into the details. The key observation is that the wildcard atom is only connected by a single bond to the rest of the fragment. If the wildcard is the first atom term in the SMILES then the ring closure goes on the second atom term. Otherwise, the ring closure goes on the atom immediately before the wildcard atom. (The last is RDKit-specific becuse RDKit's canonical SMILES ends up placing branches with wildcard atoms before any other branches.)

The following code automates that process. I call the file "smiles_syntax.py" and it defines the new function "convert_wildcards_to_closures(smiles, offsets=None)". Here's an example:

>>> import smiles_syntax
>>> fragmented_smiles = "[*]OC.[*]c1cc(C=O)ccc1O"
>>> smiles_syntax.convert_wildcards_to_closures(fragmented_smiles)
'O%90C.c%911cc(C=O)ccc1O'
>>> smiles_syntax.convert_wildcards_to_closures(fragmented_smiles, (0, 0))
'O%90C.c%901cc(C=O)ccc1O'
>>> closure_smiles = smiles_syntax.convert_wildcards_to_closures(fragmented_smiles, (0, 0))
>>> from rdkit import Chem
>>> Chem.CanonSmiles(closure_smiles)
'COc1cc(C=O)ccc1O'
By default each wildcard atom gets a new closure number, starting from 90, which is why you see the "%90" and "%91". The offsets parameter specifies which offsets from 90 to use. With "(0, 0)", both wildcards get the '%90' ring closure.

This will be used later on work with multiple cuts of the same structure.

Here's the content of "smiles_syntax.py".

# I call this "smiles_syntax.py"
from __future__ import print_function

import re

# Match a '*' in the different forms that might occur,
# including with directional single bonds inside of ()s.
_wildcard_regex = " |\n".join(re.escape(regex) for regex in
  ("*", "[*]", "(*)", "([*])", "(/*)", "(/[*])", "(\\*)", "(\\[*])"))
_wildcard_pattern = re.compile(_wildcard_regex, re.X)

# Match the SMILES for an atom, followed by its closures
_atom_pattern = re.compile(r"""
(
 Cl? |             # Cl and Br are part of the organic subset
 Br? |
 [NOSPFIbcnosp] |  # as are these single-letter elements
 \[[^]]*\]         # everything else must be in []s
)
""", re.X)

def convert_wildcards_to_closures(smiles, offsets=None):
    # This is designed for RDKit's canonical SMILES output. It does
    # not handle all possible SMILES inputs.
    if offsets is None:
        # Use 0, 1, 2, ... up to the number of '*'s
        offsets = range(smiles.count("*"))
    closure_terms = []
    for offset in offsets:
        if not (0 <= offset <= 9):
            raise ValueError("offset %d out of range (must be from 0 to 9)"
                             % (offset,))
        closure_terms.append("%%%02d" % (90 + offset))
    
    new_smiles = smiles
    while 1:
        # Find the first '*'. If none are left, stop.
        wildcard_match = _wildcard_pattern.search(new_smiles)
        if wildcard_match is None:
            break

        closure_term = closure_terms.pop(0)
        
        wildcard_start = wildcard_match.start()
        if wildcard_start == 0 or new_smiles[wildcard_start-1] == ".":
            # At the start of the molecule or after a ".". Need to
            # put the closure after the second atom. Find the second
            # atom. Since we only ever break on single non-ring bonds,
            # and since the first atom is a terminal atom, the second
            # atom must either be immediately after the first atom, or
            # there is a directional bond between them.
            wildcard_end = wildcard_match.end()
            second_atom_match = _atom_pattern.match(new_smiles, wildcard_end)
            if second_atom_match is None:
                # There was no atom. Is it a "/" or "\"? If so,
                # we'll need to swap the direction when we move
                # to a closure after the second atom.
                bond_dir = new_smiles[wildcard_end:wildcard_end+1]
                if bond_dir == "/":
                    direction = "\\"
                elif bond_dir == "\\":
                    direction = "/"
                else:
                    raise AssertionError(new_smiles)
                # Look for the second atom, which must exist
                second_atom_match = _atom_pattern.match(new_smiles, wildcard_end+1)
                if second_atom_match is None:
                    raise AssertionError((new_smiles, new_smiles[match_end:]))
            else:
                direction = ""

            second_atom_term = second_atom_match.group(1)
            # I changed the bond configuration, so I may need to
            # invert chirality of implicit chiral hydrogens.
            if "@@H" in second_atom_term:
                second_atom_term = second_atom_term.replace("@@H", "@H")
            elif "@H" in second_atom_term:
                second_atom_term = second_atom_term.replace("@H", "@@H")

            # Reassemble the string with the wildcard term deleted and
            # the new closure inserted directly after the second atom
            # (and before any of its closures).
            new_smiles = (new_smiles[:wildcard_start] + second_atom_term
                          + direction + closure_term
                          + new_smiles[second_atom_match.end():])
                    
        else:
            # The match is somewhere inside of a molecule, so we attach
            # assign the closure to the atom it's bonded to on the left
            c = new_smiles[wildcard_start-1]
            if c == "(" or c == ")":
                # In principle, this could be something like "CCC(F)(Cl)[*]", 
                # where I would need to count the number of groups back to
                # the main atom, and flip chirality accordingly. Thankfully,
                # RDKit always puts the "[*]" terms immediately after the
                # preceeding atom, so I don't need to worry.
                raise NotImplementedError("intermediate groups not supported",
                                          new_smiles, new_smiles[wildcard_start-1:])
            
            elif c in "CNcnOS]Pos0123456789ABDEFGHIJKLMQRTUVWXYZabdefghijklmpqrtuvwxyz":
                # Double-check the the previous character looks like part of an atom.
                wildcard_term = wildcard_match.group()
                # Preserve the direction, if present
                if "/" in wildcard_term:
                    direction = "/"
                elif "\\" in wildcard_term:
                    direction = "\\"
                else:
                    direction = ""
                new_smiles = (new_smiles[:wildcard_start] + direction + closure_term
                              + new_smiles[wildcard_match.end():])

            else:
                raise AssertionError((new_smiles, c, new_smiles[wildcard_start-1:]))

    return new_smiles

if __name__ == "__main__":
    for smiles in ("*C", "*/CO.*CN", "C*.C(*)N"):
        print(smiles, convert_wildcards_to_closures(smiles, (0,)*smiles.count("*")))

Test driver for fragment_simple()

While the above code looks right, it got that way only as the result of a lot of testing. There are a lot of corner cases, like deuterium and directional bonds, which caused the first versions of this function to fail.

I've found it best to pass a lot of real-world SMILES through code like this, to see if there are any problem. I usually draw from PubChem SMILES or ChEMBL. I'll show you what that looks like.

NOTE: the file I test against, "pubchem.smi", excludes structures with chiral terms. This is deliberate, for reasons I'll get to soon.

The "verify()" function, below, takes a molecule and a list of a pairs. For each pair of atom describing a bond, it fragments the molecule on that bond then reassembles it and tests that the result matches the input.

from __future__ import print_function  
 
from rdkit import Chem
from smiles_syntax import convert_wildcards_to_closures

def fragment_simple(mol, atom1, atom2):
    rwmol = Chem.RWMol(mol)
    rwmol.RemoveBond(atom1, atom2)
    wildcard1 = rwmol.AddAtom(Chem.Atom(0))
    wildcard2 = rwmol.AddAtom(Chem.Atom(0))
    rwmol.AddBond(atom1, wildcard1, Chem.BondType.SINGLE) 
    rwmol.AddBond(atom2, wildcard2, Chem.BondType.SINGLE) 
    return rwmol.GetMol()

def verify(mol, bond_atoms):
    # Even though I'm not checking for stereochemistry, some of
    # my input files contain isotopes. If I don't use isomeric
    # SMILES then the input '[2H]' gets represented as '[H]',
    # which during recanonicalization is placed on the main atom.
    # (RDKit 2016 added an an option to disable that.)
    expected_smiles = Chem.MolToSmiles(mol, isomericSmiles=True)
    
    for begin_atom, end_atom in bond_atoms:
        fragmented_mol = fragment_simple(mol, begin_atom, end_atom)
        fragmented_smiles = Chem.MolToSmiles(fragmented_mol, isomericSmiles=True)
        assert "." in fragmented_smiles, fragmented_smiles # safety check
        
        closure_smiles = convert_wildcards_to_closures(fragmented_smiles, (0, 0))
        assert "%90" in closure_smiles, closure_smiles # safety check
        closure_mol = Chem.MolFromSmiles(closure_smiles)
	
        final_smiles = Chem.MolToSmiles(closure_mol, isomericSmiles=True)
        if final_smiles != expected_smiles:
            raise ValueError( (expected_smiles, begin_atom, end_atom, fragmented_smiles,
                               closure_smiles, final_smiles) )
I can run it manually, and also check what it does when I break one of the ring bonds:
>>> vanillin = Chem.MolFromSmiles("c1(C=O)cc(OC)c(O)cc1")
>>> verify(vanillin, [(4, 5)])
>>> verify(vanillin, [(3, 4)])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 12, in verify
AssertionError: [*]COc1([*])cc(C=O)ccc1O

I can have it break all single bond which aren't in a ring and which aren't connected to a wildcard atom (#0) or a hydrogen (#1):

>>> single_bond_pat = Chem.MolFromSmarts("[!#0;!#1]-!@[!#0;!#1]")
>>> matches = vanillin.GetSubstructMatches(single_bond_pat)
>>> matches
((0, 1), (4, 5), (5, 6), (7, 8))
>>> verify(vanillin, matches)

Once I'm happy that the verification function is correct, I can put it in a test driver:

def main():
    # Match a single bond not in a ring
    single_bond_pat = Chem.MolFromSmarts("[!#0;!#1]-!@[!#0;!#1]")

    # NOTE! This file contains no chiral structures (nothing with a '@')
    with open("pubchem.smi") as infile:
        ## You might try processing every 100th record
        # import itertools
        # infile = itertools.islice(infile, None, None, 100)

        ## Or process the lines in random order
        # import random
        # lines = infile.readlines()
        # random.shuffle(lines)
        # infile = lines
        
        recno = -1
        num_tests = 0
        for recno, line in enumerate(infile):
            if recno % 100 == 0:
                print("Processing record:", recno, "#tests:", num_tests)
                
            smiles, id = line.split()
            if "*" in smiles:
                print("Skipping record with a '*':", repr(line))
                continue
            if "@" in smiles:
                print("Skipping record with a chiral '@':", repr(line))
                continue
            
            mol = Chem.MolFromSmiles(smiles)
            if mol is None:
                continue

            matches = mol.GetSubstructMatches(single_bond_pat)
            verify(mol, matches)
            num_tests += len(matches)
            
        recno += 1
        print("Processed", recno, "records and", num_tests, "tests")
            
if __name__ == "__main__":
    main()
The output from this looks like:
% python fragment2.py
Processing record: 0 #tests: 0
Processing record: 100 #tests: 1102
Processing record: 200 #tests: 2078
Processing record: 300 #tests: 3111
 ...
Processing record: 6800 #tests: 72322
[00:24:27] Explicit valence for atom # 2 Cl, 3, is greater than permitted
Processing record: 6900 #tests: 73275
  ...

fragment_simple() does not preserve chirality

That seems like a lot of test code just to show that the function, "fragment_simple()", is correct. It's only seven lines long, without any branches. It's the sort of funciton that should be easy to test by eye or with a few test cases.

Which is what I did the first ten or so times I wrote that function. But then I found out that the function doesn't preserve chirality. Here's an example of the verify() function failing on what should be a simple test case:

>>> mol = Chem.MolFromSmiles("C[C@](N)(C=O)c1ccccc1")
>>> verify(mol, [(1, 3)])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 21, in verify
ValueError: ('C[C@](N)(C=O)c1ccccc1', 1, 3, '[*]C=O.[*][C@@](C)(N)c1ccccc1',
'C%90=O.[C@@]%90(C)(N)c1ccccc1', 'C[C@@](N)(C=O)c1ccccc1')

You can see the problem in the fragmentation step, where the chirality somehow inverted:

>>> mol = Chem.MolFromSmiles("C[C@](N)(C=O)c1ccccc1")
>>> new_mol = fragment_simple(mol, 1, 3)
>>> Chem.MolToSmiles(new_mol, isomericSmiles=True)
'[*]C=O.[*][C@@](C)(N)c1ccccc1'
The second fragment should have "[C@]" instead of "[C@@]". Why did it flip?

The problem comes back to the "RemoveBond()/AddBond()" in fragment_simple(). RDKit stores the chirality as an atom property, but that property is actually based on the permutation order of the bonds. If the permutation order changes, then the chirality changes even if the chirality property hasn't changed.

To demonstrate, I need a helper function to show the connection table:

_bond_symbol = {
    Chem.BondType.SINGLE: "-",
    Chem.BondType.DOUBLE: "=",
    Chem.BondType.TRIPLE: "#",
    Chem.BondType.AROMATIC: ":",
    }
def get_bond_info(from_atom_idx, bond):
    to_atom_idx = bond.GetBeginAtomIdx()
    if from_atom_idx == to_atom_idx:
        to_atom_idx = bond.GetEndAtomIdx()
    return _bond_symbol[bond.GetBondType()] + str(to_atom_idx)

# Example output line:
#   5 C -1 :6 :10 
# This says that atom[5] is a "C"arbon with 3 bond.
# The first bond is a single bond to atom[1].
# The second bond is an aromatic bond to atom[6]. 
# The third bond is an aromatic bond to atom[10].

def show_connection_table(mol):
    for atom in mol.GetAtoms():
        bonds = " ".join(get_bond_info(atom.GetIdx(), bond) for bond in atom.GetBonds())
        print(atom.GetIdx(), atom.GetSymbol(), bonds)
then use it to show the connection table for the two molecules.
>>> show_connection_table(mol)
0 C -1
1 C -0 -2 -3 -5
2 N -1
3 C -1 =4
4 O =3
5 C -1 :6 :10  
6 C :5 :7
7 C :6 :8
8 C :7 :9
9 C :8 :10
10 C :9 :5
>>> show_connection_table(new_mol)
0 C -1
1 C -0 -2 -5 -11
2 N -1
3 C =4 -12
4 O =3
5 C -1 :6 :10
6 C :5 :7
7 C :6 :8
8 C :7 :9
9 C :8 :10
10 C :9 :5
11 * -1
12 * -3
I deleted the bond from atom[1] to atom[3]. It's atom[1] which has the chiral property so look at the second line of each of the outputs.

The original bond configuration order was (0, 2, 3, 5), while now it's (0, 2, 5, 11), where the third bond which went to the old atom[5] was replaced with the new fourth bond to the new wildcard atom[11]. The permutation order changed. The third and fourth bonds must be swapped to get back to the original permutation order. Each flip in the permutation order corresponds to a chirality inversion, which is why the output chirality is inverted.

Stay tuned to this channel for the continuing story of how to fragment a molecule and preserve chirality after reassembly. (Remember, RDKit has a built-in function for this, FragmentOnBonds(). Use that if you want to fragment.)



Molecular fragments, R-groups, and functional groups #

For a change of pace, I figured I would do a basic chemistry lesson about molecular structures, instead of a more computer oriented blog post.

Chemists often think about a molecule as a core structure (usually a ring system) and a set of R-groups. Each R-group is attached to an atom in the core structure by a bond. Typically that bond is a single bond, and often "rotatable".

Here's an example of what I mean. The first image below shows the structure of vanillin, which is the primary taste behind vanilla. In the second image, I've circled ellipsed the three R-groups in the structure.
vanillin structure vanillin structure with three R-groups identified
Vanillin structure  
(the primary taste of vanilla)  
Vanillin with three R-groups identified

The R-groups in this case are R1=a carbonyl group (*-CH=O2), R2=a methoxy group (*-O-CH3), and R3=a hydroxyl group (*-OH), where the "*" inidicates where the R-group attaches to the core structure.

The R-group concept is flexible. Really it just means that you have a fixed group of connected atoms, which are connected along some bond to a variable group of atoms, and where the variable group is denoted R. Instead of looking at the core structure and a set of R-groups, I can invert the thinking and think of an R-group, like the carbonyl group, as "the core structure", and the rest of the vanillin as its R-group.

With that in mind, I'll replace the "*" with the "R" to get the groups "R-CH=O2", "R-O-CH3", and "R-OH". (The "*" means that the fragment is connected to an atom at this point, but it's really just an alternative naming scheme for "R".)

All three of these group are also functional groups. Quoting Wikipedia, "functional groups are specific groups (moieties) of atoms or bonds within molecules that are responsible for the characteristic chemical reactions of those molecules. The same functional group will undergo the same or similar chemical reaction(s) regardless of the size of the molecule it is a part of."

These three corresponding functional groups are R1 = aldehyde, R2 = ether. and R3 = hydroxyl.

As the Wikipedia quote pointed out, if you have reaction which acts on an aldehyde, you can likely use it on the aldehyde group of vanillin.

Vanillyl group and capsaicin

A functional group can also contain functional groups. I pointed to the three functional groups attached to the central ring of a vanillin, but most of the vanillin structure is itself another functional group, a vanillyn:
vanillyn functional group

Structures which contain a vanillyl group are called vanilloids. Vanilla is of course a vanilloid, but surprisingly so is capsaicin, the source of the "heat" to many a spicy food. Here's the capsaicin structure, with the vanillyl group circled:
capsaicin with vanillyl group circled

The feeling of heat comes because the capsaicin binds to TrpV1 (the transient receptor potential cation channel subfamily V member 1), also known as the "capsaicin receptor". It's a nonselective recepter, which means that many things can cause it to activate. Quoting that Wikipedia page: "The best-known activators of TRPV1 are: temperature greater than 43 °C (109 °F); acidic conditions; capsaicin, the irritating compound in hot chili peppers; and allyl isothiocyanate, the pungent compound in mustard and wasabi." The same receptor detects temperature, capsaicin, and a compound in hot mustard and wasabi, which is why your body interprets them all as "hot."

Capsaicin is a member of the capsaicinoid family. All capsaicinoids are vanillyls, all vanillyls are aldehydes. This sort of is-a family membership relationship in chemistry has lead to many taxonomies and ontologies, including ChEBI.

But don't let my example or the existence of nomenclature lead you to the wrong conclusion that all R-groups are functional groups! An R-group, at least with the people I usually work with, is a more generic term used to describe a way of thinking about molecular structures.

QSAR modeling

QSAR (pronounced "QUE-SAR") is short for "quantitative structure-activity relationship", which is a mouthful. (I once travelled to the UK for a UK-QSAR meeting. The border inspecter asked me where I was going, and I said "the UK-QSAR meeting; QSAR is .." and I blanked on the expansion of that term! I was allowed across the border, so it couldn't have been that big of a mistake.)

QSAR deals with the development of models which relate chemical structure to its activity in a biological or chemical system. Looking at that, I realize I just moved the words around a bit, so I'll give a simple example.

Consider an activity, which I'll call "molecular weight". (This is more of a physical property than a chemical one, but I am trying to make it simple.) My model for molecular weight assumes that each atom has its own weight, and the total molecular weight is the sum of the individual atom weights. I can create a training set of molecules, and for each molecule determine its structure and molecular weight. With a bit of least-squares fitting, I can determine the individual atom weight contribution. Once I have that model, I can use it to predict the molecular weight of any molecule which contains atoms which the model knows about.

Obviously this model will be pretty accurate. It won't be perfect, because isotopic ratios can vary. (A chemical synthesized from fossil oil is slightly lighter and less radioactive than the same chemical derived from from environmental sources, because the heavier radioactive 14C in fossil oil has decayed.) But for most uses it will be good enough.

A more chemically oriented property is the partition coefficient, measured in log units as "log P", which is a measure of the solubility in water compared to a type of oil. This gives a rough idea of if the molecule will tend to end up in hydrophobic regions like a cell membrane, or in aqueous regions like blood. One way to predict log P is with the atom-based approach I sketched for the molecular weight, where each atom type has a contribution to the overall measured log P. (This is sometimes called AlogP.)

In practice, atom-based solutions are not as accurate as fragment-based solutions. The molecular weight can be atom-centered because nearly all of the mass is in the atom's nucleous, which is well localized to the atom. But chemistry isn't really about atoms but about the electron density around atoms, and electrons are much less localized than nucleons. The density around an atom depends on the neighboring atoms and the configuration of the atoms in space.

As a way to improve on that, some methods look at the extended local environment (this is sometimes called XlogP) or at larger fragment contributions (for example, BioByte's ClogP). The more complex it is, the more compounds you need for the training and the slower the model. But hopefully the result is more accurate, so long as you don't overfit the model.

If you're really interested in the topic, Paul Beswick of the Sussex Drug Discovery Centre wrote a nice summary on the different nuances in log P prediction.

Matched molecular pairs

Every major method from data mining, and most of the minor methods, have been applied to QSAR models. The history is also quite long. There are cheminformatics papers back from the 1970s looking at supervised and unsupervised learning, building on even earlier work on clustering applied to biological systems.

A problem with most of these is the black-box nature. The data is noisy, and the quantum nature of chemistry isn't that good of a match to data mining tools, so these prediction are used more often to guide a pharmaceutical chemist than to make solid predictions. This means the conclusions should be interpretable by the chemist. Try getting your neural net to give a chemically reasonable explanation of why it predicted as it did!

Matched molecular pair (MMP) analysis is a more chemist-oriented QSAR method, with relatively little mathematics beyond simple statistics. Chemists have long looked at activities in simple series, like replacing a ethyl (*-CH3) with a methyl (*-CH2-CH3) or propyl (*-CH2-CH2-CH3), or replacing a fluorine with a heavier halogen like a chlorine or bromine. These can form consistent trends across a wide range of structures, and chemists have used these observations to develop techniques for how to, say, improve the solubility of a drug candidate.

MMP systematizes this analysis over all considered fragments, including not just R-groups (which are connected to the rest of the structure by one bond) but also so-called "core" structures with two or three R-groups attached to it. For example, if the known structures can be described as "A-B-C", "A-D-C", "E-B-F" and "E-D-F" with activities of 1.2, 1.5, 2.3, and 2.6 respectively then we can do the following analysis:

  A-B-C transforms to A-D-C with an activity shift of 0.3.
  E-B-F transforms to E-D-F with an activity shift of 0.3.
  Both transforms can be described as R1-B-R2 to R1-D-R2.
  Perhaps R1-B-R2 to R1-D-R2 in general causes a shift of 0.3?

Its not quite as easy as this, because the molecular fragments aren't so easily identified. A molecule might be described as "A-B-C", as well as "E-Q-F" and "E-H" and "C-T(-P)-A", where "T" has three R-groups connected to it.

Thanks

Thank to the EPAM Life Sciences for their Ketcher tool, which I used for the structure depictions that weren't public domain on Wikipedia.



Reading ASCII file in Python3.5 is 2-3x faster as bytes than string #

I'm porting chemfp from Python2 to Python3. I read a lot of ASCII files. I'm trying to figure out if it's better to read them as binary bytes or as text strings.

No matter how I tweak Python3's open() parameters, I can't get the string read performance to within a factor of 2 of the bytes read performance. As I haven't seen much discussion of this, I figured I would document it here.

chemfp reads chemistry file formats which are specified as ASCII. They contain user-specified fields which are 8-bit clean, so sometimes people use them to encode non-ASCII data. For example, the SD tag field "price" might include the price in £GBP or €EUR, and include the currency symbol either as Latin-1 or UTF-8. (I haven't come across other encodings, but I've also never worked with SD files used internally in, say, a Japanese pharamceutical company.)

These are text files, so it makes sense to read it as text, right? The main problem is, reading in "r" mode is a lot slower than reading "rb" mode. Here's my benchmark, which uses Python 3.5.2 on a Mac OS X 10.10.5 machine to read the first 10MiB from a 3.1GiB file:

% python -V
Python 3.5.2
% python -m timeit 'open("chembl_21.sdf", "r").read(10*1024*1024)'
100 loops, best of 3: 10.3 msec per loop
% python -m timeit 'open("chembl_21.sdf", "rb").read(10*1024*1024)'
100 loops, best of 3: 3.74 msec per loop
The Unicode string read() is much slower than the byte string read(), with a performance ratio of 2.75. (I'll give all numbers in ratios.)

Python2 had a similar problem. I originally used "U"niversal mode in chemfp to read the text files in FPS format, but found that if I switched from "rU" to "rB", and wrote my code to support both '\n' and '\r\n' conventions, I could double my overall system read performance - the "U" option gives a 10x slowdown!

% python2.7 -m timeit 'open("chembl_21.sdf", "rb").read(10*1024*1024)'
100 loops, best of 3: 3.7 msec per loop
% python2.7 -m timeit 'open("chembl_21.sdf", "rU").read(10*1024*1024)'
10 loops, best of 3: 36.7 msec per loop

This observation is not new. A quick Duck Duck Go search found a 2015 blog post by Nelson Minar which concluded:

The Python3 open() function takes more parameters than Python2, including 'newline', which affects how the text mode reader identifies newlines, and 'encoding':

open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True)
  ...
    newline controls how universal newlines works (it only applies to text
    mode). It can be None, '', '\n', '\r', and '\r\n'.  It works as
    follows:
    
    * On input, if newline is None, universal newlines mode is
      enabled. Lines in the input can end in '\n', '\r', or '\r\n', and
      these are translated into '\n' before being returned to the
      caller. If it is '', universal newline mode is enabled, but line
      endings are returned to the caller untranslated. If it has any of
      the other legal values, input lines are only terminated by the given
      string, and the line ending is returned to the caller untranslated.

I'll 'deuniversalize' the text reader and benchmark newline="\n" and newline="":

% python -m timeit 'open("chembl_21.sdf", "rb").read(10*1024*1024)'
100 loops, best of 3: 3.81 msec per loop
% python -m timeit 'open("chembl_21.sdf", "r", newline="\n").read(10*1024*1024)'
100 loops, best of 3: 8.8 msec per loop
python -m timeit 'open("chembl_21.sdf", "r", newline="").read(10*1024*1024)'
100 loops, best of 3: 10.2 msec per loop
% python -m timeit 'open("chembl_21.sdf", "r", newline=None).read(10*1024*1024)'
100 loops, best of 3: 10.2 msec per loop
The ratio of 2.3 for newline="\n" slowndown is better than the 2.75 for univeral newlines and the newline="" case that Nelson Minar tested, but still less than half the performance of the byte reader.

I also wondered if the encoding made a difference:

% python -m timeit 'open("chembl_21.sdf", "r", newline="\n", encoding="ascii").read(10*1024*1024)'
100 loops, best of 3: 8.8 msec per loop
% python -m timeit 'open("chembl_21.sdf", "r", newline="\n", encoding="utf8").read(10*1024*1024)'
100 loops, best of 3: 8.8 msec per loop
% python -m timeit 'open("chembl_21.sdf", "r", newline="\n", encoding="latin-1").read(10*1024*1024)'
100 loops, best of 3: 10.1 msec per loop
My benchmark shows that ASCII and UTF-8 encodings are equally fast, and Latin-1 is 14% slower, even though my data set contains only ASCII. I did not expect any difference. I assume a lot of time has been spent making the UTF-8 code go fast, but don't know why the Latin-1 reader is noticably slower on ASCII data.

Nelson Minar also tested the codecs.open() performance, so I'll repeat it:
% python -m timeit -s 'import codecs' 'codecs.open("chembl_21.sdf", "r").read(10*1024*1024)'
100 loops, best of 3: 10.2 msec per loop
I noticed no performance difference between codec.open() and builtin open() for this test case.

I've left with a bit of a quandary. I work with ASCII text data, with only the occasional non-ASCII field. For example, chemfp has specialized code to read an id tag and encoded fingerprint field from an SD file. In rare and non-standard cases, the handful of characters in the id/title line might be non-ASCII, but the hex-encoded fingerprint is never anything other than ASCII. It makes sense to use the text reader. But If I use the text reader, it will decode everything in each record (typically 2K-8K bytes), when I only need to decode at most 100 bytes of the record.

In chemfp, I used to have a two-pass solution to find records in an SD file. The first pass found the fields of interest, and the second counted newlines for better error reporting. I found that even that level of data re-scanning caused an observable slowdown, so I shouldn't be surprised that an extra pass to check for non-ASCII characters might also be a problem. But, two-fold slowdown?

This performance overhead leads me to conclude that I need to process my performance critical files as bytes, rather than strings, and delay the byte-to-string decoding as much as possible.

RDKit and (non-)Unicode

I checked with the RDKit, which is a cheminformatics toolkit. The core is in C++, with Python extensions through Boost.Python. It treats the files as bytes, and lazily exposes the data to Python as Unicode. For example, if I places a byte string which is not valid UTF-8 in the title or tag field, then it will read and write the data without problems, because the data is stored in C++ data structures based on the byte string. But if I try to get the properties from Python, I get a UnicodeDecodeError.

Here's an example. First, I'll get a valid record, which is all ASCII:

>>> block = open("chembl_21.sdf", "rb").read(10000)
>>> i = block.find(b"$$$$\n")
>>> i
973
>>> record = block[:i+5]
>>> chr(max(record))
'm'
I'll use the RDKit to parse the record, then show that I can read the molecule id, which comes from the first line (the title line) of the file:
>>> from rdkit import Chem
>>> mol = Chem.MolFromMolBlock(record)
>>> mol.GetProp("_Name")
'CHEMBL153534'
If I then create a 'bad_record', by prefixing chr(0xC2) as the first byte to the record, then I can still process the record, but I cannot get the title:
>>> bad_record = b"\xC2" + record
>>> bad_mol = Chem.MolFromMolBlock(bad_record)
>>>bad_mol.GetNumAtoms()
16
>>> bad_mol.GetProp("_Name")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 0: invalid continuation byte
This is because character 0xC2 is not a valid byte in UTF-8, and the RDKit uses a UTF-8 to bytes error handler which fails for invalid bytes.

I can even use C++ to write the string to a file, since the C++ code treats everything as bytes:

>>> outfile = Chem.SDWriter("/dev/tty")
>>> outfile.write(mol); outfile.flush()
CHEMBL153534
     RDKit          2D

 16 17  0  0  0  0  0  0  0  0999 V2000
   ...
>>> outfile.write(bad_mol); outfile.flush()
?CHEMBL153534
     RDKit          2D

 16 17  0  0  0  0  0  0  0  0999 V2000
 ...
(The symbol could not be represented in my UTF-8 terminal, so it uses a "?".)

On the other hand, I get a UnicodeDecodeError if I use a Python file object:

>>> from io import BytesIO
>>> f = BytesIO()
>>> outfile = Chem.SDWriter(f)
>>> outfile.write(bad_mol)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 0: invalid continuation byte
(It doesn't matter if I replace the BytesIO with a StringIO. It hasn't gotten to that point yet. When the SDWriter() is initialized with a Python file handle then told to write a molecule, it writes the molecule to a C++ byte string, converts that to a Python string, and passes that Python string to the file handle. The failure here is in the C++ to Python translation.)

The simple conclusion from this is the same as the punchline from the old joke "Doctor, doctor, it hurts when I do this." "Then don't do that." But it's too simple. SD files come from all sorts of places, including sources which may use '\xC2' as the Latin-1 encoding of Â. You don't want your seemingly well-tested system to crash because of something like this.

I'm not sure what the right solution is for the RDKit, but I can conclude that I need a test case something like this for any of my SD readers, and that correct support for the bytes/string translation is not easy.

Want to leave a comment?

If you have a better suggestion than using bytes, like a faster way to read ASCII text as Unicode strings, or have some insight into why reading an ASCII file as a string is so relatively slow in Python3, let me know.

But don't get me wrong. I do scientific programming, which with rare exceptions is centered around the 7-bit ASCII world of the 1960s. Python3 has made great efforts to make Unicode parsing and decoding fast, which is important for most real-world data outside of my niche.



Fun with SMILES I: Does an element exist? #

I first learned about SMILES in 1998, when I went to work for Bioreason and used the Daylight toolkit as part of a system to apply machine learning methods to high-throughput screening. Yet even after 18 years, I continue to find new ways to work with SMILES. I recently came up with methods which I haven't elsewhere, so figured I would share those discoveries. Then I started adding more examples. This is part 1.

Check for an element using a substring search

I like to work with SMILES at the syntax level. This is a light-weight approach based on string processing. A heavy-weight approach is to use a cheminformatics toolkit which can interpret the full SMILES chemistry model.

For example, if you have a set of valid SMILES strings and you want to select only those which contain at least one bromine atom, then the light-weight approach is to find the strings which contain the substring "Br" while the heavy-weight approach is to parse the string into a molecule object, iterate over the atoms, and select the molecule if one of the atoms is a bromine.

def lightweight_has_bromine_atom(smiles):
    return "Br" in smiles

from rdkit import Chem
def heavyweight_has_bromine_atom(smiles):
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        # SMILES could not be parsed
        return False
    return any(atom for atom in mol.GetAtoms() if atom.GetAtomicNum() == 35)
The light-weight approach works because a valid SMILES string contains the two characters "Br" if and only if the structure has a bromine.

Test the 'Br' search code

It's easy to make a subtle mistake with light-weight methods, so it's extra important to include good tests, and preferably also use a cross-comparison with a heavy-weight equivalent to verify that the tests are correct. Here is a simple test case for the above functions:

# Pass in the function to test
def _test_has_bromine_atom(has_bromine):
    for smiles in (
            "Br",
            "[Br]",
            "[80Br]",
            "CCBr",
            "BCCBr",
            "B[Br]",
            ):
        assert has_bromine(smiles), smiles
    for smiles in (
            "B",
            "c1ccccc1B",
            "B[Cr]",
            ):
        assert not has_bromine(smiles), smiles

# Test the light-weight and heavy-weight functions
def test_has_bromine_atom():
    _test_has_bromine_atom(lightweight_has_bromine_atom)
    _test_has_bromine_atom(heavyweight_has_bromine_atom)
(Once you've cross-verified that the test cases are correct, you can often remove away the heavy-weight versions that were part of the bootstrap process.)

Why use a light-weight method?

You might use a light-weight approach like this because string processing is much faster than fully processing a molecule. It's fast enough that it can even be used as part of substructure screening before doing the full atom-by-atom substructure test.

For an example that came up recently in one of my projects, the user can skip any structures with more than N non-hydrogen atoms, typically around 50, and structures with fewer then R rings, typically 1. If a couple of simple requirements about the SMILES format are met, then it's possible to do both tests before even touching the toolkit, which can save a lot of time if many structures can be skipped.

In future essays I'll describe how to do those two tests.

The heavy-weight approach has its own advantages. It can detect if an input like "BrBr" is valid or invalid, it can handle a wider variety of notations, and it requires less cleverness to use correctly. Indeed, this entire set of essays is all about "clever" ways to work on notation, and part of a tradition that goes back to at least the 1950s when people would do an often appoximate substructure search directly on the WLN syntax using punched-card machines.

Check for an element using a regular expression

You can test for many other elements using a simple substring search, like chlorine/"Cl", vanadium/"V" and mercury/"Hg" because Br, Cl, and Hg and many other element symbols exist if and only if the given element is part of the structure.

Some elements can't be matched by a simple substring test, either because the element symbol is a substring of a larger element symbol or because there are other ways for the characters of the element symbol to appear in a SMILES string. These can sometimes be handled with a regular expression.

Fluorine

You cannot identify fluorine-containing structure by a substring test for "F", unless you knew there were no iron/"Fe", francium/"Fr", fermium/"Fm", or flerovium/"Fl" atoms in the SMILES data set. Realistically, the last three have maximum known half-lives of 22 minutes, 100.5 days, and 5.5 seconds, so you're very likely to know that your compound collection doesn't contain those structures, but iron is pretty common.

Instead of using a substring test, a light-weight search might look for an "F" which either at the end of the string (as in "CF") or isn't immediately followed by one of letters in "[erml]". I'll implement this using the regular expression:

F($|[^erml]) # regular expression to test if a SMILES contains a fluorine atom

Here's the full implementation and a small test suite:

 
import re

def has_fluorine_atom(smiles):
    return _f_atom_pat.search(smiles) is not None

def test_has_fluorine_atom():
    for smiles in (
            "F",
            "[F]",
            "[19F]",
            "[19F-]",
            "F[Fe]",
            "[Fe]CF",
            "Fc1ccccc1",
            "c1ccccc1F",
            "c1cc(F)ccc1",
        ):
        assert has_fluorine_atom(smiles), smiles
    for smiles in (
            "[Fe]",
            "[Fr]",
            "[Fm]",
            "[Fl]",
        ):
        assert not has_fluorine_atom(smiles), smiles

You might think you could check for an "F" so long as it isn't followed by a lowercase letter, such as with "F[^a-z]". That would match "F", but not "Fe" or "Fr". This does not work because it would also ignore the "F" in "Fc1ccccc1".

The hard part of using a light-weight approch is figuring out corner cases like this. In addition to the hand-made tests like the ones you just saw, I typically cross-compare against a heavy-weight toolkit using a large, diverse SMILES collection, to help reduce the chances that I overlooked something.

Scandium

It's almost true that any test for a two letter symbol can be written as a simple substring test. One of the most notable counter-examples is scandium.

A search for "Sc" will indeed identify all SMILES with a scandium atom. It will also match structures like "Sc1ccncc1" where a sulfer is attached to an aromatic carbon. Most data sets contain many more organosulfurs than scandium-containing molecules, so a subsearch for "Sc" mostly contain false positives.

Instead, a scandium search will need to test for "Sc" inside of square brackets. According to OpenSMILES the content of a "bracket_atom" is:

bracket_atom ::= '[' isotope? symbol chiral? hcount? charge? class? ']'
(I use the OpenSMILES definition instead of the Daylight SMILES definition because the latter is less strict about the order of the modifiers after the atom symbol. Daylight SMILES allows both "[CH+]" and "[C+H]" while OpenSMILES only allows the first. There is near universal agreement among SMILES generation tools to produce only the first form, and the restriction is both less ambiguous and less error prone (what does "[C+H-]" mean?) and easier to reason with at the syntax level.)

I can match a scandium by looking for the character "[" followed by an optional isotope number, followed by "Sc". There's no need to match the rest of the bracket atom term, so I'll use the following regular expression:

\[[0-9]*Sc   # regular expression to test if a SMILES contains a scandium atom
which leads to the following implementation and test:
_scandium_atom_pat = re.compile(r"\[[0-9]*Sc")
def has_scandium_atom(smiles):
    return _scandium_atom_pat.search(smiles) is not None

def test_has_scandium_atom():
    for smiles in (
            "[Sc]",
            "C[Sc]",
            "[Sc]C",
            "[45Sc]",
            "[45ScH+]",
            "[46Sc-3]",
            "Cl[Scl](Cl)Cl",
            ):
        assert has_scandium_atom(smiles), smiles
    for smiles in (
            "Sc1ccccc1",
            "[S]c1ccccc1",
            "[S]C",
            "[Sb]C",
            ):
        assert not has_scandium_atom(smiles), smiles

I used scandium as the example because it's likely to give a large number of false positives. I could have given similar examples with aromatic boron, because lead/"Pb" looks the same as a phosphorous/"P" connected to an aromatic boron/"b", and niobium/"Nb" looks the same as a nitrogen/"N" connected to an aromatic boron.

If you don't want any false positives for Pb or Nb, you'll have to use a regular expression like "\[[0-9]*Pb" or "\[[0-9]*Nb". On the other hand, aromatic boron is rare, especially compared to lead, a substring search for "Pb" or "Nb" will have few false positives. That said, if you justify the approximate substring search for performance reasons over the exact regular expression tesst , please make sure to measure the difference to see if it's meaningful.

Cobalt and osmium

Cobalt/"Co" and osmium/"Os" are interesting middle case which highlight the difference between valid SMILES syntax and valid chemistry. In principle the substring "Co" can also match an aliphatic carbon attached to an aromatic oxygen, and "Os" match an aliphatic oxygen attached to an aromatic sulfer.

However, that's chemically unreasonable. An aromatic atom must be part of an aromatic ring. If the aliphatic notation indicates the first atom isn't part of a ring, then the second atom must be connected to two other (aromatic) bonds. For oxygen that can only happen if the atom is charged, but we know from the syntax that's not the case. For sulfer, well, I don't know enough chemistry to reject it outright, but I searched PubChem and found no matches for the SMARTS "So(:a):a".

This means it's safe to search a chemical database for "Co" and likely also for "Os" and not get any false positives.

Aromatic atoms don't have to indicate aromatic ring membership

With one odd exception.

The lower-case letter doesn't really mean the atom must be in an aromatic ring. Most toolkits interpret it as a hint that the atom is sp2 hybridized as part of the input aromaticity processing. Here's an example from RDKit with an aromatic oxygen in the input, but with no other aromatic atoms, and where the toolkit doesn't actually perceive it as aromatic:

>>> from rdkit import Chem
>>> Chem.CanonSmiles("C1CCCCCo1")
'C1CCCOCC1'
(Here's an obscure detail for you. Daylight SMILES allows "cc" as a SMILES representation of ethene, according to Weininger's chapter at page 85 of the Handbook of Chemoinformatics. This is not universally accepted by other toolkits.)

If you have an aromatic oxygen which isn't part of an aromatic ring then a substring test for "Co" has a higher chance of giving a false positive. However, you are not going to come across cases like this in real data sets and I'm not going to worry about non-portable SMILES like that.

Carbons

This section uses an advanced regular expression syntax. I don't suggest you use this in real code, at least not without a lot more testing. I present it here more to show off what can be done with a light-weight approach.

It's difficult but not impossible to test if a carbon atom is present. One problem is that the aliphatic carbon "C" is the first letter of 11 other atomic element symbols including chlorine. The other problem is the aromatic carbon "c" which is the second letter of actinium and scandium. That right away tells you that substring tests aren't possible. The third problem is that "C", "c", and "Cl" are all part of the organic subset, that is, the symbols which don't need to be in square brackets.

It's not that hard to detect a "C" or "c" inside of square brackets using a regular expression like you've seen earlier. Here I'll used the "verbose" regular expression syntax, which lets me write the expression over multiple lines and annotate each component:

 \[            # inside of brackets
 [0-9]*        # skip any isotope
 [Cc]          # either 'C' or 'c'
 [^a-z]        # and not part of a two-letter element

The problem is in how to detect if a C or c is outside of square brackets. For that, I use a lookahead assertion to test if the match is not followed by some other pattern. For example, here's a pattern to check if a "C" is not followed by an "l":

 C             # C
 (?!l)         # not followed by an 'l' (which would be chlorine) ...
I'll also require that it not be followed by any non-"[" characters and then a "]" character, because that would mean it's a bracket atom.
 C             # C
 (?!l)         # not followed by an 'l' (which would be chlorine) ...
 (?![^][]*\])  # .. and not inside of brackets
This isn't the best of regular expressions because it will search to the end of the string to look for any square bracket characters, which might not be present. It's possible to construct better regular expressions, perhaps by expressing the optional chirality, optional hydrogen count, etc.. This would be more complicated. A less complicated solution is to note that "[()=#]" are not allowed inside of bracket atoms but common outside of them. The negative lookahead assertion can stop if it finds one of those characters. I'll leave you to work that out.

Here is is all put together, with some basic tests:

# WARNING: I haven't fully tested this. Let me know of any problems.
_carbon_atom_pat = re.compile(r"""
(
 \[            # inside of brackets
 [0-9]*        # skip any isotope
 [Cc]          # either 'C' or 'c'
 [^a-z]        # and not part of a two-letter element
)| (
 C             # C (in the organic subset)
 (?!l)         # not followed by an 'l' (which would be chlorine) ...
 (?![^][]*\])  # .. and not inside of brackets
) | (
 c             # c (in the organic subset)
 (?![^][]*\])  # .. and not inside of brackets
)
""", re.X)  # re.X and re.VERBOSE enable verbose mode syntax

def has_carbon_atom(smiles):
    return _carbon_atom_pat.search(smiles) is not None

def test_has_carbon_atom():
    for smiles in (
            "C",
            "[C]",
            "[12C]",
            "[CH4]",
            "NC",
            "ClNC",
            "c1ccccc1",
            "c1ccccc1[Sc]",
            "n1nnnc1",
            "C[Cl]",
            "C[S]",
            "[cH]1[cH][cH][cH][cH][cH]1",
            "n1nnn[cH]1",
            "N[13C@](P)(O)Br",
            ):
        assert has_carbon_atom(smiles), smiles
    for smiles in (
            "Cl",
            "[Cl]",
            "[35Cl]",
            "O=O",
            "[Al]",
            "[Cl+]",
            "[Sc]",
            ):
        assert not has_carbon_atom(smiles), smiles
It isn't pretty, but I think it does work, and give you a fast way to find the inorganic structures in your data set.

Digressing a bit, it might be possible use a negative lookbehind assertion. Python's lookbehind assertions, like most regular expression implementations, are limited to fixed-length patterns. That's okay since real-world atoms are limited to three digits for the isotope, so there are only four possibilities to enumerate. Even with that limitation, I wasn't able to make a valid regular expression which passed the tests.

Hydrogens

It's impossble to count the number of hydrogens without having most of a SMILES parser. How is simple string analysis supposed to know that "O=O" has no hydrogens while "C=C" has four? Then add support for charges, atoms with multiple allowed valences, and aromaticity.

The only exceptions are if either the hydrogens are written explicitly, like "[C]([H])([H])([H])[H]", or written with an explicitly specified implicit hydrogen count, like "[CH4]". Both forms are rare.

If they are are all explicit, with no isotope, charged, or other atom properties, then a substring test for "[H]" will work. Otherwise, use the scandium-like regular expression:

\[[0-9]*H[^a-z]

As for expicitly specified implicit hydrogens, stay tuned for my next essay, where I will go into atom-based properties.

Counts and multiple matches

I started this essay by pointing out that a substring test for "Br" is equivalent to testing if the structure contains a bromine. This can easily be extended to count the number of bromine atoms by counting the number of time "Br" occurs in the SMILES string.

def get_num_bromine_atoms(smiles):
    return smiles.count("Br")

In fact, every substring test can be converted to a count function, so "smiles.count("V")" gives the number of vanadium atoms.

Count based on regular expressions

The regular expression tests can also be turned into a count function, by using the "findall()" method instead of the "search()" method. For example, the following counts the number of scandium atoms in a SMILES string using the _scandium_atom_pat from earlier:

_scandium_atom_pat = re.compile(r"\[[0-9]*Sc")
def get_num_scandium_atoms(smiles):
    return len(_scandium_atom_pat.findall(smiles))
along with some basic tests:
def test_get_num_scandium_atoms():
    for expected_count, smiles in (
            (0, "Sc1ccccc1"),
            (0, "[S]c1ccccc1"),
            (0, "[S]C"),
            (0, "[Sb]C"),
            (1, "[Sc]"),
            (1, "C[Sc]"),
            (1, "[Sc]C"),
            (1, "[45Sc]"),
            (1, "[45ScH+]"),
            (1, "[46Sc-3]"),
            (1, "Cl[Scl](Cl)Cl"),
            (2, "[Sc]Sc[Sc]"), # not a valid SMILES
            (3, "[Sc][S]Scc[Sc][45S]c[Sc]"), 
            ):
        got_count = get_num_scandium_atoms(smiles)
        assert got_count == expected_count, (smiles, got_count, expected_count)
If you do this, make sure that your regular expression pattern doesn't accidentally include additional characters. For example, you can test for boron by looking for a "B" followed by something that isn't in "[aehikr]", perhaps with the regular expression:
B[^aehikr] # finds boron, but will also match BB
This would find that the SMILES BB, contains a boron, by matching both the first and second B, since B is not one of those lowercase letters. If you use this regular expression in a findall, then the search would continue after the second B, so you'll end up with one match instead of finding two boron atoms.

To fix this, switch from "[^aehikr]" to the negative lookahead assertion (?![aehikr])".

B(?![aehikr]) # only matches boron element symbols

NOTE! This will also match some chiral specifications, like "@TB1". However, I have never seen those outside of the SMILES specification. If your data set does contain them, then you will need to use more advanced techniques, like the tokenizer approach I will present in a future essay.

Match multiple elements based on regular expressions

It's also possible to combine multiple regular expressions into a large regular expression, to detect if all of them exist in a single call to match(), rather than findall() to get match counts. Here is a regular expression which matches at least two scandium atoms:

\[[0-9]*Sc.*\[[0-9]*Sc
(You might also test the performance with the non-greedy ".*?" instead of the greedy ".*"; my tests with "grep -P" showed they were the same.)

However, down this route lies two different forms of combinatorial madness. If there are n scandium atoms, and you want n+1 then the backtracker will still try all 2n possible partial matches. This will slow things down. You may want to consider using a substring count first, then do the regular expression test filter out false positives.

The other combinatorial explosion comes if you want to find a SMILES string which contains multiple different element types. As a example, if you want the (admittedly unlikely) structures with 1 Pb, 1 Sc, and 1 Hg atom, you'll have to write out all six possible orderings of those terms in the combined regular expression.

Really, it's much better to tokenize and process each atom than it is to force everything into a regular expression test. I'll describe how to do that in a future essay (and link it from here once I've written it).



Copyright © 2001-2013 Andrew Dalke Scientific AB