Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2004/12/12/library_generation_with_smiles

Combinitorial Library Generation with SMILES

Someone recently asked me how to generate a combinitorial library given a set of fragments.

For the non-chemist readers, combinitorial chemistry uses a core structure and reactions that can attach fragments at a given point to the core. This lets chemists search a structure family to find a compound that's is "better" in a chemistry space with dimensions including effectiveness, toxicity, digestability, and ability to reach the right part of the body. (This is for pharmaceutical chemistry; combinitorial chemistry can be used for other domains.)

A core may have 1, 2, or more fragment attachment points so many new compounds can be created with this technique. Companies use robots to generate the new compounds and test them against the target, which might be a protein or cell. There can be well over 100,000 tests in an assay. I've worked with a couple companies to develop tools that help the scientists better understand these sorts of data sets.

To limit the number of compounds created, many people will generate virtual libraries and use software to pick the compounds that will be tested via the robots. If the software was good we wouldn't need the robots. We've a long way to go.

The email asked if any software is available to generate the virtual libraries. He had been using SMILES strings for the core and fragments and simply concatenating them together. This doesn't work because that allows at most two attachment points on the core. One for the front and one for the back of the SMILES string.

The easiest way to do this is with ring closures. Suppose the core structure is O1CNCCC1 with attachment points on the 3nd and 5th atoms (the N and the third C) shown in bold. Pick very high ring closure numbers not seen in real life, like 90 and 91 and add them to the appropriate atoms. The '%' is needed in SMILES for closure numbers greater than 9.

The result is O1CN%90CC%91C1.

Use the same sort of trick to label the fragments. Suppose a fragment is OC=CC=C- and the terminal carbon (the "C-") is to be attached to the nitrogen. The ring closure number for the N is 90 so label the terminal carbon the same, as OC=CC=C%90. To make it easier on me, assume a methyl is attached at the core's C attachment point labeled 91. The corresponding fragment in SMILES is C%91.

To make it all work, concatenate the three strings using the dot disconnect character. The result is

 O1CN%90CC%91C1.OC=CC=C%90.C%91
That's all that's required. When the SMILES parses puts the molecule together it matches the two %90 and the two %91 ring closures to stitch the three parts together.

The dot disconnect only says there isn't an implicit bond between the atoms on either side of it. It doesn't mean that the two atoms can't be covalently bonded through ring closures or must be parts of different connected subgraphs. (That's another way of saying "covalent bonded molecules")

The same fragment library might be used for two different fragment points. Because the '%' character only occurs in SMILES before a two digit ring closure you can label all your fragment terminals with, say, "%99" and use simple text substitution as needed for the given core attachment point.

Make sure the bond types match across the ring closure. C%1.C%1 and C%1.C-%1 are the same as CC and C=%1.C1 is the same as CC, but C=%1.C-%1 is illegal because the two explicit bond types conflict. You'll need to be even more careful with chiral bonds to make sure the order of the core and fragments is correct.

It's very cool that a text editor and a couple shell commands are all that's needed to make a virtual library using SMILES.


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2013 Andrew Dalke Scientific AB