Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2005/05/07/attachment_points

Attachment points

Last December I wrote an essay describing how to generate a combinitorial library with SMILES. Last month someone asked me how to make the fragment libraries in the first place. There are two ways: old-skool and next-gen. The one I outlined is old-skool. It uses the Daylight SMILES syntax that's been around since the 1980s. The next-gen style uses an OpenEye SMILES extension for external attachment points

To demonstrate them I'll need a set of compounds for the core and the side group. I am not a chemist. My background is physics, math and computer science and my graduate work was in computational biophysics of protein structures. The reaction I'm most familar with is amino acid synthesis. For this example I'll assume any primary amide (C-[NH3+]) can be made to react with any carboxyl (C(=O)[O-]) and ignore the real chemistry.

I'll use a couple of the PubChem data sets as my source library. Specifically, compounds 500001-510000 because I already had it for previous examples and 1-10000 because it contains some amides. From these I'll extract the compounds that have an one and only one amide group and those with one and only one carboxyl. I'll also skip any compound which has more than one of each group (don't want to make a polymer), compounds which have more than one component (most likely salts), or compounds with a net charge.

A proper system should do a better job than this. For example, if there's an amide then the structure file may have a negatively charged salt to balance the overall charge. That should be okay if matched with a similarly charged carboxyl structure. I don't want to handle all those details because that's not the point of this essay.

The following program reads through the data files and saves the SMILES and compound id of the matching amide and carboxyl groups to the SMILES files amide.smi and carboxyl.smi. Because I was curious I also printed some information about which compounds did not pass some of the filters. I made one helper function named read_multiple_files() to make it easier to read all of the compounds from a list of structure filenames.

from openeye.oechem import *

def read_multiple_files(*filenames):
    """iterate over each of the OEGraphMols in the named files"""
    istrm = oemolistream()
    for filename in filenames:
        # I should report problems on failure
        if istrm.open(filename):
            for cmpd in istrm.GetOEGraphMols():
                yield cmpd
            istrm.close()

def main():
    # Recognize the two sides of the reaction
    amide = OESubSearch("C[NH3+]")
    carboxyl = OESubSearch("[C](=O)[O-]")

    # Save the SMILES and compound id for each matching compound
    amide_out = open("amide.smi", "w")
    carboxyl_out = open("carboxyl.smi", "w")

    for cmpd in read_multiple_files(
        "/Users/dalke/databases/compounds_000001_010000.sdf.gz",
        "/Users/dalke/databases/compounds_500001_510000.sdf.gz",
        ):
        # Skip anything with a net charge
        charge = OENetCharge(cmpd)
        if charge:
            continue

        # Get the match counts and compound id
        n_amide = len(list(amide.Match(cmpd)))
        n_carboxyl = len(list(carboxyl.Match(cmpd)))
        # Skip compounds that don't have either group
        if not (n_amide or n_carboxyl):
            continue
        
        id = OEGetSDData(cmpd, "PUBCHEM_COMPOUND_CID")
        smi = OECreateCanSmiString(cmpd)

        # Skip compounds with salts or other components
        if "." in smi:
            print id, "has multiple components:", smi
            continue

        if n_amide:
            if n_carboxyl:
                print id, "has both groups:", smi
            elif n_amide > 1:
                print id, "has", n_amide, "amide groups:", smi
            else:
                # n_amide == 1
                print id, "has one amide group:", smi
                amide_out.write("%s %s\n" % (smi, id))
        else:
            # n_amide == 0
            if n_carboxyl > 1:
                print id, "has", n_carboxyl, "carboxyl groups:", smi
            else:
                print id, "has one carboxyl group:", smi
                carboxyl_out.write("%s %s\n" % (smi, id))

if __name__ == "__main__":
    # Suppress warning messages
    OEThrow.SetLevel(OEErrorLevel_Error)
    main()

Running the program produced two short files:

% wc amide.smi carboxyl.smi 
       2       4      65 amide.smi
      15      30     675 carboxyl.smi
      17      34     740 total
% cat amide.smi
C(CN(CCN)N(N=O)[O-])[NH3+] 4518
CC(C)N(CCC[NH3+])N(N=O)[O-] 4520
% cat carboxyl.smi
CC(=O)OC(CC(=O)[O-])C[N+](C)(C)C 1
C[N+](C)(C)CC(=O)[O-] 247
C[N+](C)(C)CC(CC(=O)[O-])O 288
CCCCCCCCCCCCCCCC(=O)OC(CC(=O)[O-])C[N+](C)(C)C 461
C[N+]1(CCCC1C(=O)[O-])C 554
C[N+](C)(C)CC=CC(=O)[O-] 589
C[N+](C)(C)CCCC(=O)[O-] 725
C[N+]1(CCCC1)CC2=C(N3C(C(C3=O)NC(=O)C(=NOC)c4csc(n4)N)SC2)C(=O)[O-] 2622
CON=C(c1nc(sn1)N)C(=O)NC2C3N(C2=O)C(=C(CS3)C[n+]4ccn5c4cccn5)C(=O)[O-] 2639
CC(C)(C(=O)O)ON=C(c1csc(n1)N)C(=O)NC2C3N(C2=O)C(=C(CS3)C[n+]4ccccc4)C(=O)[O-] 2650
C[n+]1ccccc1C(=O)[O-] 3620
C[S+](CCC(C(=O)[O-])N)CC1C(C(C(O1)n2cnc3c2ncnc3N)O)O 5136
C[n+]1cccc(c1)C(=O)[O-] 5570
c1cc[n+](cc1)CC2=C(N3C(C(C3=O)NC(=O)Cc4cccs4)SC2)C(=O)[O-] 5773
c1cc(c(cc1N[N+]#N)O)C(=O)[O-] 504537
% 

I want to transform the amides so the [NH3+] group looses a hydrogen and the positive charge and gains an attachment point. In old-skool style I'll label the attachment point with the incomplete ring closure %90, so the two amide SMILES strings will become

C(CN(CCN)N(N=O)[O-])[NH2]%90 4518
CC(C)N(CCC[NH2]%90)N(N=O)[O-] 4520
There's no way to generate that in traditional SMILES. The trick is to make a fake atom with atomic weight 0 (atomic symbol "*") and post-process the SMILES string to convert the "*" into a "%90". That works because that's the only way an asterisk can be in a SMILES string so it can be manipulated without needing a full SMILES parser.

If there are several attachment points or a need to tag multiple atoms then the trick is to encode that via atomic weights and generate the isomeric SMILES. The SMILES pattern "[\d+" can only occur because of the atomic weight. So if there's a * atom with atomic weight 90 the resulting isomeric SMILES will have "[90*]" which could be post-processed to become "%90". I don't need that nuance for this essay.

I'll show the steps interactively before writing the program.

    make the molecule and the substructure search pattern

>>> mol = OEMol()
>>> OEParseSmiles(mol, "CC(C)N(CCC[NH3+])N(N=O)[O-]")
True
>>> pat = OESubSearch("C[NH3+]")

    Do the substructure search; by construction there's only one match

>>> matches = pat.Match(mol)
>>> match = list(matches)[0]
>>> match
<C OEChem::OEMatchBase instance at _00344950_p_OEChem__OEMatchBase>

    Get the match atom for the nitrogen.  The order is that of the
    SMARTS pattern and [NH3+] is the 2nd pattern

>>> match_atom = list(match.GetAtoms())[1]
>>> match_atom
<C OEChem::OEMatchPair<(OEChem::OEAtomBase)> instance at
 _0186b418_p_OEChem__OEMatchPairTOEChem__OEAtomBase_t>

    The match atom is a pair of the query atom and the target atom

>>> match_atom.target.GetAtomicNum()
7
>>> match_atom.target.GetImplicitHCount()
3
>>> match_atom.target.GetFormalCharge()
1
>>> 

    Change the implicit H-count and formal charge

>>> match_atom.target.SetImplicitHCount(2)
True
>>> match_atom.target.SetFormalCharge(0)
True

    Add a new "dummy" atom to the amide, with atomic number 0
    
>>> dummy_atom = mol.NewAtom(0)
>>> mol.NewBond(match_atom.target, dummy_atom, 1)
<C OEChem::OEBondBase instance at _0186fcf0_p_OEChem__OEBondBase>

    I can't use the canonical SMILES string because the *
    can become the first atom in the canonical ordering

>>> OECreateCanSmiString(mol)
'*[NH2]CCCN(C(C)C)N(N=O)[O-]'

    Instead I'll use the lower-level function and without the
    OESMILESFlag::Canonical (or any other) flag.  This uses atom
    creation order, so the newly created atom will never be first

>>> OECreateSmiString(mol, 0)
'CC(C)N(CCC[NH2]*)N(N=O)[O-]'
>>> smi = OECreateSmiString(mol, 0)
>>> smi_with_attachment = smi.replace("*", "%90")
>>> smi_with_attachment
'CC(C)N(CCC[NH2]%90)N(N=O)[O-]'

The last part about the atom ordering requires some knowledge of SMILES and how SMILES strings are generated. The "*" is an atom symbol. When I replace it with a "%90" then it becomes a bond symbol. A atom symbol in SMILES can go anywhere a bond symbol can go, but not vice versa. A bond cannot be the first symbol in the SMILES string or be after a dot disconnect or another bond. This new dummy atom is singly bonded to only one other atom so the only worry is if the dummy atom becomes the first atom in the output SMILES. In that case the substitution would create an illegal SMILES string.

The algorithm to create a SMILES string starts by assigning a number to each atom. The first atom symbol in the string comes from the atom with the lowest number. The rest of the string is created by building a spanning tree rooted at that first atom. Ambiguities in the spanning are broken by looking at the atom numbering. The result is the SMILES string for that ordering.

If the number assignment algorithm is invariant up to molecular symmetry then it will always generate the same SMILES string for a given structure and the corresponding string is a unique, or canonical SMILES string.

To ensure that the "*" is not the first character in the output I need to make sure some other atom has priority. (Remember, this is old-skool style; I'll cover a better but OpenEye-specific approach in a bit). To control if canonical atom numbering is done in OEChem, use the OECreateSmiString() function. If the flag OESMILESFlag::Canonical (or OESMILESFlag_Canonical in PyOEChem) is set, it uses a canonical ordering. If not set, it uses the atom's index. That is, the first atom created for a molecule (more technically, each connected component) will always be the first atom in its non-canonical SMILES. The dummy atom is the last atom created and since the SMARTS matched I know there's at least one other carbon in the structure which means the non-canonical ordering will never have the dummy atom first.

In the Daylight toolkit you can go one step further and manually set the arborder of each atom. The dt_arbsmiles() function computes a compounds SMILES using this ordering instead of the standard canonical one computed by dt_cansmiles().

By the way, canonical atom number takes non-linear time. It may be exponential for the general case but molecules are degree limited so I think in practice it's polynomial time. If you need a SMILES string for a large protein and don't need it to be canonical then you should generate the non-canonical SMILES instead. It might save you a few seconds.

Here's a program that converts the existing amide.smi and prints a new SMILES file with "%90" attachment points on the amides.

# Make an attachment point on the primary amide of each
# compound in the file "amide.smi"

from openeye.oechem import *

amide_pattern = OESubSearch("C[NH3+]")

def make_amide_attachment(mol, attachment = "%90"):
    matches = amide_pattern.Match(mol)

    # Assume there is one match and that I only need
    # to change the first.
    match = list(matches)[0]

    # Get the nitrogen
    target = list(match.GetAtoms())[1].target

    # I know it's a [NH3+] so set the H-count and charge
    # without additional checking.
    target.SetImplicitHCount(2)
    target.SetFormalCharge(0)

    # Create a new dummy atom and bond it to the nitrogen
    dummy = mol.NewAtom(0)
    mol.NewBond(target, dummy, 1)

    # Make the non-canonical SMILES string so the "*" won't be
    # first, then replace the "*" with the given attachment text.
    smi = OECreateSmiString(mol, 0)
    return smi.replace("*", attachment)
    
def main():
    filename = "amide.smi"

    mol = OEMol()
    for line in open(filename):
        smi, id = line.split()

        if not OEParseSmiles(mol, smi):
            raise AssertionError("Cannot parse %r" % (smi,))
        
        new_smi = make_amide_attachment(mol)
        print new_smi, id

        # reset for the next compound
        mol.Clear()

if __name__ == "__main__":
    main()
When run it gives the desired result:
C(CN(CCN)N(N=O)[O-])[NH2]%90 4518
CC(C)N(CCC[NH2]%90)N(N=O)[O-] 4520

It works, but it's only half the solution. I need to convert the carboxyls as well. Here's another program which does that. The only real difference is I remove the [O-] atom and attach the dummy atom to the carbon of the carboxyl. It was easier to delete the oxygen and create the new atom than it was to transmute it.

from openeye.oechem import *

carboxyl_pattern = OESubSearch("[C](=O)[O-]")

def make_carboxyl_attachment(mol, attachment = "%90"):
    matches = carboxyl_pattern.Match(mol)
    
    # Assume there is one match and that I only need
    # to change the first.
    match = list(matches)[0]

    match_atoms = list(match.GetAtoms())
    C_atom = match_atoms[0].target
    O_atom = match_atoms[2].target

    # Replace the oxygen with a dummy atom
    mol.DeleteAtom(O_atom)

    dummy_atom = mol.NewAtom(0)
    mol.NewBond(C_atom, dummy_atom, 1)

    # Make the non-canonical SMILES string so the "*" won't be
    # first, then replace the "*" with the given attachment text.
    smi = OECreateSmiString(mol, 0)
    return smi.replace("*", attachment)

def main():
    filename = "carboxyl.smi"

    mol = OEMol()
    for line in open(filename):
        smi, id = line.split()

        if not OEParseSmiles(mol, smi):
            print "Cannot parse %r" % (smi,)
            mol.Clear()
            continue
        
        new_smi = make_carboxyl_attachment(mol)
        print new_smi, id
        
        mol.Clear()

if __name__ == "__main__":
    main()
and the output looks right, with
CC(=O)OC(CC(=O)%90)C[N+](C)(C)C 1
C[N+](C)(C)CC(=O)%90 247
C[N+](C)(C)CC(CC(=O)%90)O 288
CCCCCCCCCCCCCCCC(=O)OC(CC(=O)%90)C[N+](C)(C)C 461
C[N+]1(CCCC1C(=O)%90)C 554
C[N+](C)(C)CC=CC(=O)%90 589
C[N+](C)(C)CCCC(=O)%90 725
C[N+]1(CCCC1)CC2=C(N3C(C(C3=O)NC(=O)C(=NOC)c4csc(n4)N)SC2)C(=O)%90 2622
CON=C(c1nc(sn1)N)C(=O)NC2C3N(C2=O)C(=C(CS3)C[n+]4ccn5c4cccn5)C(=O)%90 2639
CC(C)(C(=O)O)ON=C(c1csc(n1)N)C(=O)NC2C3N(C2=O)C(=C(CS3)C[n+]4ccccc4)C(=O)%90 2650
C[n+]1ccccc1C(=O)%90 3620
C[S+](CCC(C(=O)%90)N)CC1C(C(C(O1)n2cnc3c2ncnc3N)O)O 5136
C[n+]1cccc(c1)C(=O)%90 5570
c1cc[n+](cc1)CC2=C(N3C(C(C3=O)NC(=O)Cc4cccs4)SC2)C(=O)%90 5773
c1cc(c(cc1N[N+]#N)O)C(=O)%90 504537

There's one fundamental problem with this solution. It's tedious. That increases the chances of writing buggy code and makes it harder to review and maintain the code.

A cleaner solution, somewhere between old-skool and next-gen, is to describe the transformation as a reaction. SMILES is a notation for chemical compounds. SMIRKS is a notation for chemical reactions. I can describe the two reactions implemented above as

amide_smirks = "[C:1][NH3+]>>[C:1][NH2]*"
carboxyl_smirks = "[C:1](=[O:2])[O-]>>[C:1](=[O:2])*"

The first SMIRKS says to map the carbon to itself (that's what the ":1" atom mapping means) and to remove the "[NH3+]" and replace it with the two atoms "[NH2]*". I tried using the atom map ":2" but it looks like OEChem's SMIRKS implementation, which is in beta now, doesn't yet support that. Daylight's should be fine but I don't have ready access to it.

The second SMIRKS says to keep the carbon and the double bonded oxygen of the carboxyl but to remove the negatively charged oxygen and attach a dummy atom instead. I didn't need the ":2"; the result would have been the same for this case. An advantage of keeping the atom is that this preserves other atomic information, like the atomic weight ("isotope" in OEChem).

To use a SMIRKS in OEChem, start by making an OEUniMolecularRxn() and Init() it with the SMIRKS pattern. This is exactly parallel to the normal way to create OESubSearch() objects for a SMARTS search. You can pass the pattern string into the constructor but I tend to do so only when I know the pattern is valid. I was experimenting with different SMIRKS so wasn't sure.

The reaction object is a callable object (meaning it's called like a function). It takes the molecule to transform and manipulates it in-place. Here's an example which converts any non-aromatic oxygen connected to any carbon into a sulpher.

>>> rxn = OEUniMolecularRxn()
>>> rxn.Init("[#6:1][O:2]>>[#6:1][S:2]")
True
>>> mol = OEMol()
>>> OEParseSmiles(mol, "OCc1ccccc1O")
True
>>> OECreateSmiString(mol, 0)
'OCc1ccccc1O'
>>> rxn(mol)
True
>>> OECreateSmiString(mol, 0)
'SCc1ccccc1S'
>>> 
What's neat is if I don't use the ":2" atom map for the O->S transition then the order of the atoms in the resulting non-canonical SMILES changes. In this form the the first atom (an oxygen) was deleted and a new atom, the sulpher, appended to the atom list. This makes the non-aromatic carbon be the first atom in the atom list, and hence the first atom in the non-canonical SMILES output.
>>> rxn = OEUniMolecularRxn()
>>> rxn.Init("[#6:1]O>>[#6:1]S")
True
>>> mol = OEMol()
>>> OEParseSmiles(mol, "OCc1ccccc1O")
True
>>> OECreateSmiString(mol, 0)
'OCc1ccccc1O'
>>> rxn(mol)
True
>>> OECreateSmiString(mol, 0)
'C(c1ccccc1S)S'
>>> 
(Well, I thought it was neat.)

Here's code that uses a reaction to put a dummy atom at the right spot of the the two data sets. As you can see it's smaller, more readable, and just generally nicer.

from openeye.oechem import *

amide_smirks = "[C:1][NH3+]>>[C:1][NH2]*"
carboxyl_smirks = "[C:1](=[O:2])[O-]>>[C:1](=[O:2])*"

mol = OEGraphMol()
for (filename, smirks) in (
    ("amide.smi", amide_smirks),
    ("carboxyl.smi", carboxyl_smirks)):

    rxn = OEUniMolecularRxn()
    if not rxn.Init(smirks, True):
        raise SystemExit("Cannot parse SMIRKS: %r" % (smirks,))

    print " ==== processing", filename, "===="
    for line in open(filename):
        smi, id = line.split()
        if not OEParseSmiles(mol, smi):
            raise SystemExit("Cannot parse SMILES: %r" % (smi,))

        # Transform the molecule using the current reaction
        rxn(mol)

        new_smi = OECreateSmiString(mol, 0).replace("*", "%90")
        print new_smi, id

        # Reset for the next time through the loop
        mol.Clear()
Here's the output
 ==== processing amide.smi ====
C(CN(CCN)N(N=O)[O-])N%90 4518
CC(C)N(CCCN%90)N(N=O)[O-] 4520
 ==== processing carboxyl.smi ====
CC(=O)OC(CC(=O)%90)C[N+](C)(C)C 1
C[N+](C)(C)CC(=O)%90 247
C[N+](C)(C)CC(CC(=O)%90)O 288
CCCCCCCCCCCCCCCC(=O)OC(CC(=O)%90)C[N+](C)(C)C 461
C[N+]1(CCCC1C(=O)%90)C 554
C[N+](C)(C)CC=CC(=O)%90 589
C[N+](C)(C)CCCC(=O)%90 725
C[N+]1(CCCC1)CC2=C(N3C(C(C3=O)NC(=O)C(=NOC)c4csc(n4)N)SC2)C(=O)%90 2622
CON=C(c1nc(sn1)N)C(=O)NC2C3N(C2=O)C(=C(CS3)C[n+]4ccn5c4cccn5)C(=O)%90 2639
CC(C)(C(=O)O)ON=C(c1csc(n1)N)C(=O)NC2C3N(C2=O)C(=C(CS3)C[n+]4ccccc4)C(=O)%90 2650
C[n+]1ccccc1C(=O)%90 3620
C[S+](CCC(C(=O)%90)N)CC1C(C(C(O1)n2cnc3c2ncnc3N)O)O 5136
C[n+]1cccc(c1)C(=O)%90 5570
c1cc[n+](cc1)CC2=C(N3C(C(C3=O)NC(=O)Cc4cccs4)SC2)C(=O)%90 5773
c1cc(c(cc1N[N+]#N)O)C(=O)%90 504537
and to double-check that I can mix fragments from the two libraries
>>> mol = OEMol()
>>> OEParseSmiles(mol, "CC(C)N(CCCN%90)N(N=O)[O-].C[N+]1(CCCC1C(=O)%90)C")
True
>>> OECreateCanSmiString(mol)
'CC(C)N(CCCNC(=O)C1CCC[N+]1(C)C)N(N=O)[O-]'
>>> 

That was old-skool style. It used knowledge of how SMILES generation works to create a SMILES that could be easily converted to a fragment. The approach works with both Daylight's and OpenEye's toolkits.

OEChem extended SMILES to support external bond attachment points. Quoting from the documentation:

The major advantage of these semantics, inspired by Daylight's CHUCKLES, is that it allows convenient enumeration of combinatorial libraries using string concatenation. For example, three components of a library may be specified as C&1CCC&2, F&1 and Br&2. The using the same notation C&1CCC&2.F&1.Br&2 is interpreted as the reaction product, i.e. FCCCCBr.
It's trival to change the above code to generate "&1" instead of "%90". One of the advantages of the & notation over hijacking ring closures is they the two values are in different numbering spaces. "%12" and "&12" do not refer to the same bond connection so I don't need to use an absurdly high ring closure number to avoid possible collisions.

The numbering space used is the same space at the RGroup attachement points and the atom maps. That is, "&1" is identical to "R1" and to "[*:1]". (Does that last one look familiar? :)

The new notation has exactly the same limitations as ring closures (ie, it can't be the first token in the SMILES string). There are no real advantages to using it in the post-processing system I described earlier in this essay. The advantage is that OEChem knows how to generate external bond attachment points directly from the molecule object, without having to resort to post-processing.

The OECreateSmiString() function takes a flag parameter. This is a bit-wise OR of many available flags. One I mentioned earlier is OESMILESFlag::Canonical which tells the function to generate a canonical atom numbering. Another is OESMILESFlag::ExtBonds. Quoting from the documentation:

The OESMILESFlag::ExtBonds flag controls whether atoms with atomic number zero (as determined by the OEAtomBase::GetAtomicNum method), and a non-zero map index (as determined by the OEAtomBase::GetMapIdx method) should be generated the external bond `&1' notation. In this notation, the integer value following the `&' corresponds to the atom's map index. When this flag isn't set, such atoms are written in the Daylight convention `[*:1]'.
The default flags (without the prefix) are RGroups|AtomMaps|Canonical. The first says to write atoms with an atom map and atomic number of 0 using R-group notation. The second says that any remaining atoms with an atom map are written the "[C:1]" notation. The last generates a canonical SMILES. Here's a nice example showing how the different combinations contribute:
>>> mol = OEMol()
>>> OEParseSmiles(mol, "C[*:1]")
True
>>> OECreateSmiString(mol, OESMILESFlag_DEFAULT)
'[R1]C'

    Expanding the DEFAULT flags into its constituent parts

>>> OECreateSmiString(mol, OESMILESFlag_RGroups | OESMILESFlag_AtomMaps | OESMILESFlag_Canonical)
'[R1]C'
>>> OECreateSmiString(mol, OESMILESFlag_AtomMaps | OESMILESFlag_Canonical)'
[*:1]C'
>>> OECreateSmiString(mol, OESMILESFlag_AtomMaps)
'C[*:1]'
>>> OECreateSmiString(mol, OESMILESFlag_ExtBonds)
'C&1'
>>> 
For this essay all I need is the ExtBonds flag. By the way, the SMILES generation knows that the external bond cannot be the first term of the output SMILES so there's no need to worry about that possibility. Feel free to use Canonical if you need it.

If you make the attachment structures "by hand" with something like the make_amide_attachment() and make_carboxyl_attachment() functions then there are two changes to make: give the newly created dummy atom an atom map index and add the ExtBonds flag to the OECreateSmiString() call. Here's how to modify one of the functions defined earlier:

def make_amide_attachment(mol, attachment = "%90"):
    matches = amide_pattern.Match(mol)

    # Assume there is one match and that I only need
    # to change the first.
    match = list(matches)[0]

    # Get the nitrogen
    target = list(match.GetAtoms())[1].target

    # I know it's a [NH3+] so set the H-count and charge
    # without additional checking.
    target.SetImplicitHCount(2)
    target.SetFormalCharge(0)

    # Create a new dummy atom which will have an external bond
    # attachment point label of 1, then bond it to the nitrogen
    dummy = mol.NewAtom(0)
    dummy.SetMapIdx(1)
    mol.NewBond(target, dummy, 1)

    # use the external bond attachment point labeling
    return OECreateSmiString(mol, OESMILESFlag_ExtBonds)

Unfortunately the external bond attachment point notation doesn't interact well with the reaction transforms because both use atom maps. The atom maps in a SMIRKS must be balanced. As far as I can tell there's no way to specify that the newly created dummy atom has an unmatched atom map number or R-Group. Instead I'll need to set the atom map index of any dummy atoms after doing the transform. It's still a post-processing step but one that works directly on the data structure instead of on the syntax, so it's less likely to fail in strange ways.

The changes are minor so here's the updated version of the SMIRKS-based fragment generator:

from openeye.oechem import *

amide_smirks = "[C:1][NH3+]>>[C:1][NH2]*"
carboxyl_smirks = "[C:1](=[O:2])[O-]>>[C:1](=[O:2])*"

# Give each dummy atoms a unique map index
def set_dummy_atom_map_indicies(mol):
    idx = 1
    for atom in mol.GetAtoms():
        if atom.GetAtomicNum() == 0:
            atom.SetMapIdx(idx)
            idx = idx + 1
    # I could check that one and only one dummy atom was
    # created but I won't worry about that for now.

mol = OEGraphMol()
for (filename, smirks) in (
    ("amide.smi", amide_smirks),
    ("carboxyl.smi", carboxyl_smirks)):

    rxn = OEUniMolecularRxn()
    if not rxn.Init(smirks, True):
        raise SystemExit("Cannot parse SMIRKS: %r" % (smirks,))

    print " ==== processing", filename, "===="
    for line in open(filename):
        smi, id = line.split()
        if not OEParseSmiles(mol, smi):
            raise SystemExit("Cannot parse SMILES: %r" % (smi,))

        # Transform the molecule using the current reaction
        rxn(mol)

        set_dummy_atom_map_indicies(mol)
        
        new_smi = OECreateSmiString(mol, OESMILESFlag_ExtBonds)
        print new_smi, id

        # Reset for the next time through the loop
        mol.Clear()
and its output is
 ==== processing amide.smi ====
C(CN(CCN)N(N=O)[O-])N&1 4518
CC(C)N(CCCN&1)N(N=O)[O-] 4520
 ==== processing carboxyl.smi ====
CC(=O)OC(CC&1=O)C[N+](C)(C)C 1
C[N+](C)(C)CC&1=O 247
C[N+](C)(C)CC(CC&1=O)O 288
CCCCCCCCCCCCCCCC(=O)OC(CC&1=O)C[N+](C)(C)C 461
C[N+]1(CCCC1C&1=O)C 554
C[N+](C)(C)CC=CC&1=O 589
C[N+](C)(C)CCCC&1=O 725
C[N+]1(CCCC1)CC2=C(N3C(C(C3=O)NC(=O)C(=NOC)c4csc(n4)N)SC2)C&1=O 2622
CON=C(c1nc(sn1)N)C(=O)NC2C3N(C2=O)C(=C(CS3)C[n+]4ccn5c4cccn5)C&1=O 2639
CC(C)(C(=O)O)ON=C(c1csc(n1)N)C(=O)NC2C3N(C2=O)C(=C(CS3)C[n+]4ccccc4)C&1=O 2650
C[n+]1ccccc1C&1=O 3620
C[S+](CCC(C&1=O)N)CC1C(C(C(O1)n2cnc3c2ncnc3N)O)O 5136
C[n+]1cccc(c1)C&1=O 5570
c1cc[n+](cc1)CC2=C(N3C(C(C3=O)NC(=O)Cc4cccs4)SC2)C&1=O 5773
c1cc(c(cc1N[N+]#N)O)C&1=O 504537

In the SMIRKS approach the details of how to transform the molecule into a fragment are completely described by the SMIRKS. There's no need to write Python code for it. With very little work you can change the last code sample into a program which takes the SMIRKS on the command-line and a list of filenames (or use stdin) and generates a fragment for each the compounds in those files.


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2013 Andrew Dalke Scientific AB