Dalke Scientific Software: More science. Less time. Products

Diary RSS | All of Andrew's writings | Diary archive

Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure. Code to taste. Best served at room temperature.

ChEMBL target sets association network #

This is part 3 of a three-part series in generating fingerprint-based set similarities using chemfp. Read part 1 to see some of the ways to compare two fingerprint sets, and part 2 where I figure out how to use the ChEMBL bioactivity data.

I usually work with entity-based similarities. I have a molecule X and I want to find other molecules which are similar to it. Set-based similarities are a bit different. I think of them as comparing two objects by the clouds around them. Instead of comparing two proteins based on a more intrinsic property like their sequence or 3D structure, I might want to compare two proteins by the types of molecules which bind to them or affect them. This might reveal if two proteins have similar binding pockets or are involved in the same chemical pathway.

Before jumping into the nitty-gritty, I thought I would try a non-molecular example. Suppose you read Neal Stephenson's Cryptonomicon and enjoy it so much that you want to read more like it. "Like it" can mean many things: books that talk about technology, books which combine two different time periods, books with historical fiction taking place during the Second World War, books with many vignettes, and so on. For some, Foucault's Pendulum is like Cryptonomicon. And of course "like" can mean other books by the same author, or even by the same publishing house or imprint.

The "entity" in this case is a book. While there are many possible scoring functions, the end result is a book or list of books, likely ranked in preference order.

Suppose however you have read many books by Stephenson and want to find another author like him. Here too there are many ways to make a comparison. One is to use the book similarity function. For each author under consideration, compare all of that author's books to all of Stephenson's books, and come up with some aggregate scoring function to give the set similarity. Use that to figure out the most similar author.

If you repeat this many time you can create a network of authors, associated by similarity based on their books.

IC50 activity sets from ChEMBL

Back into the world of molecules. I want to compare target proteins in human assays based on the set of molecules with an IC50 of at least 1 micromolar for each molecule. I'll use the ChEMBL 21 bioactivity data to generate a data.

The following SQL query is based on an example Iain Wallace sent me, adapted to use the SQLite console. I'll first turn on the timer and have it save the output to the file "chembl_sets.tsv", as tab-separated fields. The query looks for "single protein" targets in humans (tax_id=9606), where the assay activity is an IC50 with better than 1000nM. For each of those assays, get the target name and the ChEMBL id for the compound used in the assay.

% sqlite3 chembl_21.db
SQLite version 3.8.5 2014-08-15 22:37:57
Enter ".help" for usage hints.
sqlite>
   ...>
   ...>
   ...>
   ...>
   ...>
   ...>
   ...>
   ...>
   ...>
select distinct target_dictionary.pref_name, molecule_dictionary.chembl_id
  from target_dictionary, assays, activities, molecule_dictionary
 where target_dictionary.tax_id = 9606
   and target_dictionary.target_type = "SINGLE PROTEIN"
   and target_dictionary.tid = assays.tid
   and assays.assay_id = activities.assay_id
   and activities.published_type = "IC50"
   and activities.standard_units = "nM"
   and activities.standard_value < 1000
   and activities.molregno = molecule_dictionary.molregno;
Run Time: real 25.771 user 4.440337 sys 2.698867 sqlite> quit
Most of the 26 seconds was likely spent in reading from the hard disk. However, do note that this was after I did an ANALYZE on some of the tables in the database. Without the ANALYZE, I suspect the query will take a lot longer.

The above console commands produce the file "chembl_sets.tsv" where the first few line for me look like:

Beta-1 adrenergic receptor      CHEMBL305153
Dopamine D4 receptor    CHEMBL303519
Endothelin-converting enzyme 1  CHEMBL415967
Neprilysin      CHEMBL415967
FK506-binding protein 1A        CHEMBL140442
Coagulation factor X    CHEMBL117716
Trypsin I       CHEMBL117716
Retinoid X receptor alpha       CHEMBL111217
Epidermal growth factor receptor erbB1  CHEMBL68920
Receptor protein-tyrosine kinase erbB-2 CHEMBL68920
Epidermal growth factor receptor erbB1  CHEMBL69960
Receptor protein-tyrosine kinase erbB-2 CHEMBL69960
Proteinase-activated receptor 1 CHEMBL330643
Tyrosine-protein kinase LCK     CHEMBL69638
Neuropeptide Y receptor type 5  CHEMBL75193
Gonadotropin-releasing hormone receptor CHEMBL65614
 ...
(That's a copy&paste of the terminal output, which doesn't preserve spaces.)

Compare two real sets

In an earlier essay I came up with an ad hoc comparison function I called "sss()":

from chemfp import search

def sss(arena1, arena2, threshold=0.8):
  results = search.threshold_tanimoto_search_arena(
                 arena1, arena2, threshold=threshold)
  # The scaling factor of 300 was chosen so that random ChEMBL
  # subsets have a score of about 0.01. It doesn't matter for
  # ranking as it's used as an arbitrary scaling factor.
  similarity = 300 * results.cumulative_score_all() / (len(arena1) * len(arena2))
  return similarity
What score does it give to two sets which are similar?

Load set data

The file "chembl_sets.tsv" which I just created in SQLite contains the set names and the compounds ids which are in the set. The file "chembl_21.rdkit2048.fpb" created in part 1 contains compound ids and fingerprints. I can combine the two to get the fingerprints for each compound id. The first step is to read the set information, which I do with the following function:

import collections

# Load a file with lines of the form <set_name> <tab> <compound_id>
# Example:  "DNA-dependent protein kinase\tCHEMBL104450\n"
# Return a dictionary mapping set name to a list of all of its ids
def load_set_members(filename):
    set_members_table = collections.defaultdict(list)
    with open(filename) as infile:
        for line in infile:
            set_name, chembl_id = line.rstrip("\n").split("\t")
            set_members_table[set_name].append(chembl_id)
    # Turn the defaultdict into a dict so that a lookup of
    # a name which doesn't exist raises an exception instead
    # of creating and returning an empty list.
    return dict(set_members_table)
I'll use that function to read the set file. I also looked through the list of target names and guessed that "Estrogen receptor alpha" and "Estrogen receptor beta" might be similar, so I'll use that as my initial test case:
>>> set_members_table = load_set_members("chembl_sets.tsv")
>>> len(set_members_table["Estrogen receptor alpha"])
912
>>> len(set_members_table["Estrogen receptor beta"])
1066
>>> len(set(set_members_table["Estrogen receptor alpha"])
...   & set(set_members_table["Estrogen receptor beta"]))
697

Load set fingerprints

Next I'll extract the fingerprints from the FPB file. The easiest way is to find the index for each of the compound ids and pass the list of indices to the copy() method.

>>> import chemfp
>>> chembl_21 = chemfp.load_fingerprints("chembl_21.rdkit2048.fpb")
>>> target_ids1 = set_members_table["Estrogen receptor alpha"]
>>> indices1 = [chembl_21.get_index_by_id(target_id) for target_id in target_ids1]
>>> target1 = chembl_21.copy(indices1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "chemfp/arena.py", line 576, in copy
    new_indices.append(range_check[i])
TypeError: sequence index must be integer, not 'NoneType'
Well, that was unexpected. What happened? It looks like there are 8 None elements in the list of indices:
>>> indices1.count(None)
8
How could I have an activity for a compound, but not have the compound?

There can be a few reasons. Perhaps ChEMBL didn't include the structure data. Except they do. Perhaps RDKit couldn't parse the record. Except they could. The real clue came in beause Iain Watson also sent me the dataset he generated with his sample SQL. There are 76 additions in my file which aren't in his, including 8 estrogen receptor alpha records:

% diff iain_sorted.tsv chembl_sets_sorted.tsv
5697a5698
> Acetylcholinesterase	CHEMBL2448138
7776a7778
> Adenosine A3 receptor	CHEMBL1386
9350a9353
> Alpha-2a adrenergic receptor	CHEMBL1366
9410a9414
> Alpha-2a adrenergic receptor	CHEMBL508338
9455a9460
> Alpha-2b adrenergic receptor	CHEMBL1366
9479a9485
  ...
54058a54091
> Epidermal growth factor receptor erbB1	CHEMBL1909064
58244a58278,58279
> Estrogen receptor alpha	CHEMBL219003
> Estrogen receptor alpha	CHEMBL219004
58246a58282
> Estrogen receptor alpha	CHEMBL219390
58247a58284
> Estrogen receptor alpha	CHEMBL219763
58506a58544
> Estrogen receptor alpha	CHEMBL373625
58555a58594
> Estrogen receptor alpha	CHEMBL385993
58556a58596,58597
> Estrogen receptor alpha	CHEMBL386948
> Estrogen receptor alpha	CHEMBL387335
65855a65897
> Glutathione reductase	CHEMBL2068507
65874a65917
 ...
He generated his data from one of the dump files for a server-based database, like MySQL. I generated my data from the SQLite dump file. My guess is that the SQLite file was generated slightly later, and includes a few records which were added during that time delta.

The reason my fingerprint file doesn't contain the entries is that the chembl_21.sdf file I used was also generated from the first snapshot, so doesn't include those new structures.

At least, that's my working theory. It's also purely coincidence that I happened to start with one of the few set names where this was a problem.

I'll write a function to skip ids when the id can't be found in the fingerprint arena:

def create_subset(arena, ids):
    indices = []
    for id in ids:
        idx = arena.get_index_by_id(id)
        if idx is not None:
          indices.append(idx)
    return arena.copy(indices=indices)

>>> target1 = create_subset(chembl_21, set_members_table["Estrogen receptor alpha"])
>>> target2 = create_subset(chembl_21, set_members_table["Estrogen receptor beta"])

Compare two fingerprint sets

With the two data sets in hand, it's a simple matter of calling the scoring function:

>>> sss(target1, target2)
4.986669863305914
I'll make a little helper function to compare two classes by name:
def compare(name1, name2):
    target1 = create_subset(chembl_21, set_members_table[name1])
    target2 = create_subset(chembl_21, set_members_table[name1])
    return sss(target1, target2)
and use it to compare a few other sets, judiciously chosen after I implemented the next section:
>>> compare("Estrogen receptor alpha", "Estrogen sulfotransferase")
6.107895939764545
>>> compare("Estrogen receptor alpha", "Vitamin D receptor")
6.107895939764545
>>> compare("Dopamine D5 receptor", "Histamine H1 receptor")
110.6279251170047
>>> compare("Dopamine D5 receptor", "Histamine H2 receptor")
110.6279251170047
>>> compare("Histamine H1 receptor", "Histamine H2 receptor")
24.037402406896796
>>> compare("NEDD8-activating enzyme E1 regulatory subunit", "RNase L")
100.0
A real chemist or biologist would need to tell me if these make sense.

Compare all sets

I can use the code from the previous section to generate the Nx(N-1)/2 comparison of every set to every other set. The general algorithm is to load each of the sets into its own object, which I'll call a "Subset" instance. A Subset has a "name" and an "arena", and I'll ignore empty subsets (like "Transient receptor potential cation channel subfamily M member 6"):

class Subset(object):
    def __init__(self, name, arena):
        self.name = name
        self.arena = arena

def load_subsets(arena, set_members_table):
    subsets = []
    for i, (name, ids) in enumerate(sorted(set_members_table.items())):
        set_arena = create_subset(arena, ids)
        if not set_arena:
            sys.stderr.write("No members: %s\n" % (name,))
            continue
        subsets.append(Subset(name, set_arena))
    return subsets
This returns N subsets. I'll iterate over the upper-triangle of the comparison matrix to generate all pair scores. If the score is at least 1.0, I'll print the score and the two set ids. Otherwise I won't report anything.

I also include some progress and run-time information to stderr, which makes the code a bit more complicated to read, but helps soothe the nerves of at least this observer.

Here's the full code, which I saved into the file "set_compare.py":

# set_compare.py
from __future__ import print_function, division

import sys
import collections
import chemfp
from chemfp import search
import time

# Load a file with lines of the form <set_name> <tab> <compound_id>
# Example:  "DNA-dependent protein kinase\tCHEMBL104450\n"
# Return a dictionary mapping set name to a list of all of its ids
def load_set_members(filename):
    set_members_table = collections.defaultdict(list)
    with open(filename) as infile:
        for line in infile:
            set_name, chembl_id = line.rstrip("\n").split("\t")
            set_members_table[set_name].append(chembl_id)
    # Turn the defaultdict into a dict so that a lookup of
    # a name which doesn't exist raises an exception instead
    # of creating and returning an empty list.
    return dict(set_members_table)


# Ad hoc scoring function using the sum of scores.
# Please don't use this for real work unless you've validated it.
def sss(arena1, arena2, threshold=0.8):
  results = search.threshold_tanimoto_search_arena(
                 arena1, arena2, threshold=threshold)
  # The scaling factor of 300 was chosen so that random ChEMBL
  # subsets have a score of about 0.01. It doesn't matter for
  # ranking as it's used as an arbitrary scaling factor.
  similarity = 300 * results.cumulative_score_all() / (len(arena1) * len(arena2))
  return similarity


# Make a subset arena of the given arena using the given ids
def create_subset(arena, ids):
    indices = []
    for id in ids:
        idx = arena.get_index_by_id(id)
        if idx is not None:
          indices.append(idx)
    return arena.copy(indices=indices)

class Subset(object):
    def __init__(self, name, arena):
        self.name = name
        self.arena = arena

def load_subsets(arena, set_members_table):
    subsets = []
    for i, (name, ids) in enumerate(sorted(set_members_table.items())):
        set_arena = create_subset(arena, ids)
        if not set_arena:
            sys.stderr.write("No members: %s\n" % (name,))
            continue
        subsets.append(Subset(name, set_arena))
    return subsets

def main():
    chembl_21 = chemfp.load_fingerprints("chembl_21.rdkit2048.fpb")
    set_members_table = load_set_members("chembl_sets.tsv")
    start_time = time.time()
    subsets = load_subsets(chembl_21, set_members_table)
    load_time = time.time()

    N = len(subsets)
    for i in range(N-1):
        sys.stderr.write("\rProcessing %d/%d" % (i, N-1))
        subset1 = subsets[i]
        for j in range(i+1, N):
            subset2 = subsets[j]
            score = sss(subset1.arena, subset2.arena)
            if score > 1.0:
                sys.stderr.write("\r                     \r")
                print("%.2f\t%s\t%s" % (score, subset1.name, subset2.name))
                sys.stderr.write("\rProcessing %d/%d" % (i, N-1))
        sys.stderr.write("\r                     \r")
    compare_time = time.time()
    
    print("load time:", load_time-start_time, file=sys.stderr)
    print("compare time:", compare_time-load_time, file=sys.stderr)

if __name__ == "__main__":
    main()

I ran it like this, which ran with 4 OpenMP threads:

% python set_compare.py > set_compare_output.txt
No members: Transient receptor potential cation channel subfamily M member 6
No members: Transient receptor potential cation channel subfamily V member 2
No members: Transient receptor potential cation channel subfamily V member 5
No members: Voltage-gated P/Q-type calcium channel alpha-1A subunit
load time: 4.57191491127
compare time: 491.902451038

That command found 5410 set comparisons. I'll show the 5 smallest, 5 largest, and values near the median and quartiles:

% wc -l set_compare_output.txt
    5410 set_compare_output.txt
% sort -n set_compare_output.txt | head -5
1.00	Apoptosis regulator Bcl-2	Membrane-associated guanylate kinase-related 3
1.00	Cyclin-dependent kinase 2	Testis-specific serine/threonine-protein kinase 2
1.00	Cytochrome P450 1B1	Tubulin beta-1 chain
1.00	Fructose-1,6-bisphosphatase	Receptor tyrosine-protein kinase erbB-3
1.00	Glucagon receptor	Serine/threonine-protein kinase GAK
% awk 'NR==int(5410*1/4)' set_compare_output.txt  # first quartile
1.55	Cyclin-dependent kinase 4	Cyclin-dependent kinase-like 1
% awk 'NR==5410/2' set_compare_output.txt  # median
1.24	Glyceraldehyde-3-phosphate dehydrogenase liver	Histone-lysine N-methyltransferase, H3 lysine-9 specific 3
% awk 'NR==int(5410*3/4)' set_compare_output.txt  # third quartile
7.62	Neuropeptide Y receptor type 1	Neuropeptide Y receptor type 2
% sort -n set_compare_output.txt | tail -5
300.00	Serine palmitoyltransferase 1	Serine palmitoyltransferase 2
300.00	Serine/threonine-protein kinase 24	Serine/threonine-protein kinase MST1
300.00	Sodium channel protein type I alpha subunit	Sodium channel protein type XI alpha subunit 
300.00	Ubiquitin carboxyl-terminal hydrolase 10	Ubiquitin carboxyl-terminal hydrolase 13
300.00	Xaa-Pro aminopeptidase 2	Xaa-Pro dipeptidase
The largest possible score in my similarity metric is 300.0. These last few lines indicate perfect matches.

Pre-compile sets (advanced topic)

It took 4-5 seconds to load the dataset. This was less than 1% of the overall run-time, so optimizing it is usually not worthwhile. However, suppose you want to write a set similarity web service. You'll often end up reloading the server during development, and the 4-5 second wait each time will become annoying.

One possibility is to create an FPB file for each of the subsets. The problem is there are over 1,400 sets. By design the FPB file uses a memory map for each file. Each memory map uses a file descriptor, and many OSes limit the number of file descriptors that a process may use. On my Mac, the default resource limit ("rlimit") is 256, though that can be increased.

The way I usually solve this in chemfp is to store all of the subsets sequentially in a single FPB file, and have a ".range" file which specifies the start/end range for each set. By default the FPB file reorders the fingerprints by popcount, so the fingerprints with 0 on-bits comes first, then those with 1 on-bit, etc. When they are ordered that way then I can create the sublinear search index.

I can disable reordering, so that the fingerprints are stored in the same order they were added to the FPB file. If I know that set A is between indices start and end then I can use arena[start:end] to get the subset.

Create an FPB file with the sets in input order

The following program reads the chembl_21.rdkit2048.fpb and chembl_sets.tsv file to compile a single FPB file named chembl_sets.fpb with the fingerprint sets, in order, and an range file named "chembl_sets.range" with the start/end indices of each set and its range.

from __future__ import print_function, division

import collections
import chemfp

# Load a file with lines of the form <set_name> <tab> <compound_id>
# Example:  "DNA-dependent protein kinase\tCHEMBL104450\n"
# Return a dictionary mapping set name to a list of all of its ids
def load_set_members(filename):
    set_members_table = collections.defaultdict(list)
    with open(filename) as infile:
        for line in infile:
            set_name, chembl_id = line.rstrip("\n").split("\t")
            set_members_table[set_name].append(chembl_id)
    # Turn the defaultdict into a dict so that a lookup of
    # a name which doesn't exist raises an exception instead
    # of creating and returning an empty list.
    return dict(set_members_table)

# return a list of (id, fingerprint) pairs for the given ids in the arena
def get_indices(arena, ids):
    indices = []
    for id in ids:
        idx = arena.get_index_by_id(id)
        if idx is None:
            continue
        indices.append(idx)
    return indices

def main():
    chembl_21 = chemfp.load_fingerprints("chembl_21.rdkit2048.fpb")
    set_members_table = load_set_members("chembl_sets.tsv")

    with open("chembl_sets.ranges", "w") as range_file:
        with chemfp.open_fingerprint_writer("chembl_sets.fpb", reorder=False,
                                            metadata=chembl_21.metadata) as writer:
            start_index = 0
            for name, chembl_ids in sorted(set_members_table.items()):
                indices = get_indices(chembl_21, chembl_ids)
                if not indices:
                    continue  # no ids found
                end_index = start_index + len(indices)
                
                range_file.write("%d\t%d\t%s\n" % (start_index, end_index, name))
                writer.write_fingerprints(chembl_21.copy(indices=indices))
                start_index = end_index

if __name__ == "__main__":
    main()
I ran it and generated the two files. The "chembl_sets.ranges" file starts with the following:
0	77	1-acylglycerol-3-phosphate O-acyltransferase beta
77	1834	11-beta-hydroxysteroid dehydrogenase 1
1834	1881	11-beta-hydroxysteroid dehydrogenase 2
1881	1885	14-3-3 protein gamma
1885	1992	15-hydroxyprostaglandin dehydrogenase [NAD+]
1992	2016	2-acylglycerol O-acyltransferase 2
2016	2019	25-hydroxyvitamin D-1 alpha hydroxylase, mitochondrial
2019	2025	26S proteasome non-ATPase regulatory subunit 14
2025	2027	3-beta-hydroxysteroid dehydrogenase/delta 5--<4-isomerase type I
2027	2030	3-keto-steroid reductase

If you've been following along then nothing here should be new here in the code, except this line:

                writer.write_fingerprints(chembl_21.copy(indices=indices))
I use the indices to make a new arena containing just those fingerprints. (By default these fingerprints will be in popcount order, which comes in useful in a bit.) I then pass the new arena to write_fingerprints(). This function takes an iterator of (id, fingerprint) pairs, which is what an arena returns if you try to iterate over it.

The end result is to save the selected ids and fingerprints to a file, in popcount order.

Search using pre-compiled fingerprints

It's tempting to load the FPB file, get the set arena using the start/end from the ranges file, and do the comparison. This will work, but it will be slow. The FPB file is not in popcount order and has no popcount index. This means that chemfp's sublinear search optimization will not be used.

If the arena is not indexed by popcount then its slice, like unordered_arena[start:end] will also not be indexed by popcount. Making a simple copy() doesn't help because the copy() by default preserves the ordering. That is, the copy will be ordered if and only if the original arena is ordered.

Instead, I need to tell the copy() to always reorder, with:

unordered_arena[start:end].copy(reorder=True)

As a further optimization, if the copy(reorder=True) notices that the input is already in popcount order then it will skip the step to sort() the fingerprints by popcount. That's the case for us since the compiled chembl_sets.fpb file was created by passing ordered subset arenas to write_fingerprints(), and iterating through ordered arenas returns the (id, fingerprints) in increasing popcount order.

I changed the previous "set_compare.py" program to use the compiled sets file. I call this new program "set_compare2.py". The new code is the function "load_subsets()", which reads the .ranges file, gets subsets from the compiled FPB file, and makes an ordered copy of it.

# set_compare2.py
from __future__ import print_function, division

import sys
import collections
import chemfp
from chemfp import search
import time

# Ad hoc scoring function using the sum of scores.
# Please don't use this for real work unless you've validated it.
def sss(arena1, arena2, threshold=0.8):
  results = search.threshold_tanimoto_search_arena(
                 arena1, arena2, threshold=threshold)
  # The scaling factor of 300 was chosen so that random ChEMBL
  # subsets have a score of about 0.01. It doesn't matter for
  # ranking as it's used as an arbitrary scaling factor.
  similarity = 300 * results.cumulative_score_all() / (len(arena1) * len(arena2))
  return similarity

class Subset(object):
    def __init__(self, name, arena):
        self.name = name
        self.arena = arena
        
def load_subsets(arena, filename):
    subsets = []
    with open(filename) as range_file:
        for line in range_file:
            start, end, name = line.rstrip("\n").split("\t")
            start = int(start)
            end = int(end)
            subset = Subset(name, arena[start:end].copy(reorder=True))
            subsets.append(subset)
    return subsets

def main():
    chembl_sets_arena = chemfp.load_fingerprints("chembl_sets.fpb")
    start_time = time.time()
    subsets = load_subsets(chembl_sets_arena, "chembl_sets.ranges")
    load_time = time.time()

    N = len(subsets)
    for i in range(N-1):
        sys.stderr.write("\rProcessing %d/%d" % (i, N-1))
        subset1 = subsets[i]
        for j in range(i+1, N):
            subset2 = subsets[j]
            score = sss(subset1.arena, subset2.arena)
            if score > 1.0:
                sys.stderr.write("\r                     \r")
                print("%.2f\t%s\t%s" % (score, subset1.name, subset2.name))
                sys.stderr.write("\rProcessing %d/%d" % (i, N-1))
        sys.stderr.write("\r                     \r")
    compare_time = time.time()
    
    print("load time:", load_time-start_time, file=sys.stderr)
    print("compare time:", compare_time-load_time, file=sys.stderr)

if __name__ == "__main__":
    main()

I ran "set_compare2.py" and compared the results to "set_compare.py". The output files were exactly the same, as expected. The biggest difference was the load time:

# times from set_compare.py
load time: 4.57191491127
compare time: 491.902451038

# times from set_compare2.py
load time: 0.801107883453
compare time: 451.236635923
That saves nearly 4 seconds of load time. The run time also looks faster, but I think that's due to the variability of my desktop, which also has a few Firefox windows and more than a few tabs open.

Z-score

Up until now I've been using an ad hoc scoring function with an arbitrary scaling factor to make it so the background score is 0.01 across the range of input set sizes. It has a few flaws: 1) it has a maximum score of 300, which is unusual, 2) no one will understand how to interpret it, without specific experience with it, and 3) the standard deviation is a function of the input set sizes.

One common technique to remove those flaws is to transform the score in a "z-score" or "standard score":

zscore = (score - background_score) / background_standard_deviation
The "background score" is about 0.01 for my scoring function, but the "background standard deviation" varies based on the input set sizes. I can determine it by generating enough (let's say 25) pairs of subsets containing randomly selecting fingerprints, and with the same sizes as the pair I'm interested in, then computing the standard deviation of all of those comparisons. With this approach my "300*" scaling factor is irrelevant as the numerator and denominator are equally scaled. The division by the product of the sizes also disappears, for the same reason.

You likely see the downside of this scoring function - each Z-score requires 25 additional Tanimoto searches!

There are a few ways to mitigate this. First, while I have 1,400+ sets, there are only 365 different set sizes. I can reuse values for multiple comparisons of the same pair of sizes, which will reduce the number of tests I need to do by about a factor of 15. Second, random sets rarely have matches with a 0.8 similarity. The sublinear index should quickly reject those obvious mismatches. Third, I could cache the values to the file system, database, or other permanent storage and re-use them for future searches. It won't help the first time, but I end up with mistakes in my code, or have something I want to tweak, and will often re-run the code many times before it finally works.

Fourth, and not something I'll do in this essay (if at all), I could do some curve fitting and come up with an equation which does a reasonable job of interpolating or predicting values.

(Oh, my. I was certainly optimistic by using 25 samples. After over an hour it had only processed 229 of 1408 elements. I killed the code, changed the number of samples to 10, and restarted it. I'm glad I put that cache in! If you do this for real, you might vary the number of samples as a function of the sizes. It's much harder getting a non-zero standard deviation for a 1x1 comparison than for a 1000x1000 comparison.)

The biggest change to this code is a new "ZScorer" class, which replaces the "sss" function. The main entry point is "compute_zscore()". That in turn calls "compute_raw_score()" to compute the cumulative score between two arenas, and "compute_background_values()", which computes the mean and standard deviation for the requested set sizes. The latter function also caches to an internal dictionary.

The ZScorer class also has a way to load and save the cache to a file. Before computing the set similarities I ask it to "load()" from the file named ""zscore.cache". I then wrapped the main code with a try/finally block so I can save() the cache no matter if the code went to completion or if it was interrupted by a ^C or coding error.

I also modified the scoring threshold so the Z-score had to be at least 10 (that is, the difference from the average background must be at least 10 standard deviations).

To make the data a bit more useful, I also included information about the number of elements in each set. These are node properties easily determined by getting the length of the corresponding arena. I also added the number of ids common to both sets (this is a new edge property). For this I turned to Python's native 'set' type. Each Subset instance make a set of arena ids:

class Subset(object):
    def __init__(self, name, arena):
        self.name = name
        self.arena = arena
        self.id_set = set(arena.ids) # used to find intersection counts
so the number of elements in common is:
num_in_common = len(subset1.id_set & subset2.id_set)

As a final note before presenting the code, I called this program "set_zcompare.py". It's based on the original "set_compare.py" and does not use the pre-compiled FPB file of set_compare2.py".

# I call this "set_zcompare.py"
from __future__ import print_function, division

import sys
import random
import collections
import time

import numpy as np

import chemfp
from chemfp import search

# Support both Python 2 and Python 3
try:
  xrange # Check if this is Python 2
except NameError:
  xrange = range # Use 'range' for Python 3
  
# Load a file with lines of the form <set_name> <tab> <compound_id>
# Example:  "DNA-dependent protein kinase\tCHEMBL104450\n"
# Return a dictionary mapping set name to a list of all of its ids
def load_set_members(filename):
    set_members_table = collections.defaultdict(list)
    with open(filename) as infile:
        for line in infile:
            set_name, chembl_id = line.rstrip("\n").split("\t")
            set_members_table[set_name].append(chembl_id)
    # Turn the defaultdict into a dict so that a lookup of
    # a name which doesn't exist raises an exception instead
    # of creating and returning an empty list.
    return dict(set_members_table)


# Z-score based on the sum of the similar scores.  Please don't use
# this scoring function for real work unless you've validated it.

def make_random_subarena(arena, n):
  indices = random.sample(xrange(len(arena)), n)
  return arena.copy(indices=indices)

class ZScorer(object):
    def __init__(self, arena, threshold=0.8, num_samples=25):
        self.arena = arena
        self.threshold = threshold
        self.num_samples = num_samples
        self.cached_values = {}

    def compute_zscore(self, arena1, arena2):
        # The main entry point
        score = self.compute_raw_score(arena1, arena2)
        mean, std = self.compute_background_values(len(arena1), len(arena2))
        if std == 0.0:
            return 0.0
        return (score - mean) / std

    def compute_raw_score(self, arena1, arena2):
        return search.threshold_tanimoto_search_arena(
            arena1, arena2, threshold=self.threshold).cumulative_score_all()

    def compute_background_values(self, i, j):
        # The scoring function is symmetric so normalize so i <= j
        if i > j:
            i, j = j, i
        # Check if it exists in the cache
        key = (i, j)
        try:
            return self.cached_values[key]
        except KeyError:
            pass

        # Does not already exist, so compute the mean and standard deviation
        scores = []
        for _ in range(self.num_samples):
            subarena1 = make_random_subarena(self.arena, i)
            subarena2 = make_random_subarena(self.arena, j)
            scores.append(self.compute_raw_score(subarena1, subarena2))

        mean, std = np.mean(scores), np.std(scores)
        values = (mean, std)
        self.cached_values[key] = values # cache the result and return it
        return values

    def load(self, filename):
        # Load values from filename into self.cached_values
        try:
            infile = open(filename)
        except IOError:
            sys.stderr.write("Warning: cache file %r does not exist\n" % (filename,))
        
        with infile:
            for line in infile:
                terms = line.split()
                i, j, mean, std = int(terms[0]), int(terms[1]), float(terms[2]), float(terms[3])
                self.cached_values[(i, j)] = (mean, std)
            
    def save(self, filename):
        # Save values from self.cached_values into the named file
        with open(filename, "w") as outfile:
            for key, value in sorted(self.cached_values.items()):
                i, j = key
                mean, std = value
                outfile.write("%s %s %s %s\n" % (i, j, mean, std))
    
# Make a subset arena of the given arena using the given ids
def create_subset(arena, ids):
    indices = []
    for id in ids:
        idx = arena.get_index_by_id(id)
        if idx is not None:
          indices.append(idx)
    return arena.copy(indices=indices)

class Subset(object):
    def __init__(self, name, arena):
        self.name = name
        self.arena = arena
        self.id_set = set(arena.ids) # used to find intersection counts

def load_subsets(arena, set_members_table):
    subsets = []
    for i, (name, ids) in enumerate(sorted(set_members_table.items())):
        set_arena = create_subset(arena, ids)
        if not set_arena:
            sys.stderr.write("No members: %s\n" % (name,))
            continue
        subsets.append(Subset(name, set_arena))

    return subsets

                        

def main():
    chembl_21 = chemfp.load_fingerprints("chembl_21.rdkit2048.fpb")
    set_members_table = load_set_members("chembl_sets.tsv")
    zscorer = ZScorer(chembl_21, threshold=0.8, num_samples=10) # 25 took too much time
    zscorer.load("zscore.cache")
    try:
        start_time = time.time()
        subsets = load_subsets(chembl_21, set_members_table)
        load_time = time.time()

        print("name1\tsize1\tname2\tsize2\t#in_common\tzscore") # write a header
        N = len(subsets)
        for i in range(N-1):
            sys.stderr.write("\rProcessing %d/%d" % (i, N-1))
            subset1 = subsets[i]
            for j in range(i+1, N):
                subset2 = subsets[j]
                zscore = zscorer.compute_zscore(subset1.arena, subset2.arena)
                if zscore > 10.0:  # Use 10 standard deviations as my threshold of importance
                    sys.stderr.write("\r                     \r")
                    num_in_common = len(subset1.id_set & subset2.id_set)
                    print("%s\t%d\t%s\t%d\t%d\t%.2f" % (subset1.name, len(subset1.arena),
                                                        subset2.name, len(subset2.arena),
                                                        num_in_common, zscore))
                    sys.stderr.write("\rProcessing %d/%d" % (i, N-1))
            sys.stderr.write("\r                     \r")
        compare_time = time.time()
    finally:
        zscorer.save("zscore.cache")
    
    print("load time:", load_time-start_time, file=sys.stderr)
    print("compare time:", compare_time-load_time, file=sys.stderr)

if __name__ == "__main__":
    main()

I ran this and saved the 6351 lines of output to "chembl_target_network_zscore_10.tsv. The first few lines look like:
name1	size1	name2	size2	#in_common	zscore
1-acylglycerol-3-phosphate O-acyltransferase beta	77	TRAF2- and NCK-interacting kinase	16	0	23.56
11-beta-hydroxysteroid dehydrogenase 1	1757	11-beta-hydroxysteroid dehydrogenase 2	47	35	92.93
14-3-3 protein gamma	4	Androgen Receptor	847	0	13.05
14-3-3 protein gamma	4	Histone deacetylase 1	1598	0	12.41
14-3-3 protein gamma	4	Histone deacetylase 6	799	0	20.43
14-3-3 protein gamma	4	Histone deacetylase 8	396	0	33.06
15-hydroxyprostaglandin dehydrogenase [NAD+]	107	Aldehyde dehydrogenase 1A1	16	1	91.58
15-hydroxyprostaglandin dehydrogenase [NAD+]	107	Serine/threonine-protein kinase PIM1 	540	0	17.87
15-hydroxyprostaglandin dehydrogenase [NAD+]	107	Serine/threonine-protein kinase PIM2	346	0	17.48

Network Visualization

Code, code, everywhere, and not an image to see! How do I know that the above code works as I epxected, much less gives useful information which a biochemist might be interested in?

Association network visualization is widely used in bioinformatics. With some pointers from Iain, I downloaded the Java-based Cytoscape 3.4 and used it to visualize the Z-score results from the previous section.

Actually, that network proved too complicated to visualize. I used awk to reduce it to 1450 node pairs with a Z-score of at least 100, in the file named chembl_target_network_zscore_100.tsv. I loaded the file into Cytoscape (File->Import->Network->File...), selected chembl_target_network_zscore_100.tsv, made sure the 'name1' column was the query, 'size1' a query attribute, 'name2' was the target, 'size2' a target attribute, '#in_common' an integer node attribute, and 'zscore' a floating point node attribute. I tried different layouts, but the "preferred layout" seemed best. Here's a screenshot of one part:

portion of the ChEMBL target sets network with Z-score of at least 100

It looks reasonable, in that I know dopamine and serotonin are related. I'm neither a chemist nor biologist, nor do I play one on the internet. My goal was to show how you might use chemfp to do this sort of analysis. This image shows I'm in the right neighborhood, so it's good enough for me to say I'm done.



ChEMBL bioactivity data #

This is part 2 of a three-part series in generating fingerprint-based set similarities using chemfp. Read part 1 to see some of the ways to compare two fingerprint sets, and part 3 where I put everything together and visualize the results.

I almost only use ChEMBL structure files. I download the .sdf files and process them. ChEMBL also supplies bioactivity data, which I've never worked with. Iain Wallace suggested I look to it as a source of compound set data, and provided some example SQL queries. This blog post is primarily a set of notes for myself as I experiment with the queries and learn more about what is in the data file.

There is one bit of general advice. If you're going to use the SQLite dump from ChEMBL, make sure that you did "ANALYZE" at least on the tables of interest. This may take a few hours. I'm downloading ChEMBL-22.1 to see if it comes pre-analyzed. If it doesn't, I'll ask them to do so as part of their releases. (26 March 2017: I verified that 22.1 does not come pre-analyzed and send them an email. With a few hours, on a Sunday afternoon(!), I got an email thanking me for the suggestion. They only started shipping SQLite dumps with v21, and "will definitely include ANALYZE command when producing the dump" in the future. You won't need this piece of advice in the future.)

For those playing along from home (or the office, or whereever fine SQL database engines may be found), I downloaded the SQLite dump for ChEMBL 21, which is a lovely 2542883 KB (or 2.4) compressed, and 12 GB uncompressed. That link also includes dumps for MySQL, Oracle, and Postgres, as well as schema documentation.

Unpack it the usual way (it takes a while to unpack 12GB), cd into the directory, and open the database using sqlite console:

% tar xf chembl_21_sqlite.tar.gz
% cd chembl_21_sqlite
% sqlite3 chembl_21.db
SQLite version 3.8.5 2014-08-15 22:37:57
Enter ".help" for usage hints.
sqlite>

compound_structures

The 'compound_structures' table looks interesting. How many structures are there?

sqlite> select count(*) from compound_structures;
1583897
Wow. Just .. wow. That took a several minutes to execute. This is a problem I've had before with large databases. SQLite doesn't store the total table size, so the initial count(*) ends up doing a full table scan. This brings in every B-tree node from disk, which requires a lot of random seeks for my poor hard disk made of spinning rust. (Hmm, Crucible says I can get a replacement 500GB SSD for only EUR 168. Hmmm.)

The second time and onwards is just fine, thanks to the power of caching.

What does the structures look like? I'll decided to show only a few of the smallest structures to keep the results from overflowing the screen:

sqlite>
   ...>
select molregno, standard_inchi, standard_inchi_key, canonical_smiles
  from compound_structures where length(canonical_smiles) < 10 limit 4;
1813|InChI=1S/C4H11NO/c1-2-3-4-6-5/h2-5H2,1H3|WCVVIGQKJZLJDB-UHFFFAOYSA-N|CCCCON 3838|InChI=1S/C2H4INO/c3-1-2(4)5/h1H2,(H2,4,5)|PGLTVOMIXTUURA-UHFFFAOYSA-N|NC(=O)CI 4092|InChI=1S/C4H6N2/c5-4-6-2-1-3-6/h1-3H2|VEYKJLZUWWNWAL-UHFFFAOYSA-N|N#CN1CCC1 4730|InChI=1S/CH4N2O2/c2-1(4)3-5/h5H,(H3,2,3,4)|VSNHCAURESNICA-UHFFFAOYSA-N|NC(=O)NO

For fun, are there canonical SMILES which are listed multiple times? There are a few, so I decided to narrow it down to those with more than 2 instances. (None occur more than 3 times.)

sqlite> select canonical_smiles, count(*) from compound_structures group by canonical_smiles having count(*) > 2;
CC(C)Nc1cc(ccn1)c2[nH]c(nc2c3ccc(F)cc3)[S+](C)[O-]|3
CC(C)Nc1cc(ccn1)c2[nH]c(nc2c3ccc(F)cc3)[S+]([O-])C(C)C|3
CC(C)[C@@H](C)Nc1cc(ccn1)c2[nH]c(nc2c3ccc(F)cc3)[S+](C)[O-]|3
CC(C)[C@H](C)Nc1cc(ccn1)c2[nH]c(nc2c3ccc(F)cc3)[S+](C)[O-]|3
CC(C)[S+]([O-])c1nc(c2ccc(F)cc2)c([nH]1)c3ccnc(NC4CCCCC4)c3|3
  ...
Here are more details about the first output where the same SMILES is used multiple times:
sqlite>
   ...>
select molregno, standard_inchi from compound_structures
 where canonical_smiles = "CC(C)Nc1cc(ccn1)c2[nH]c(nc2c3ccc(F)cc3)[S+](C)[O-]";
1144470|InChI=1S/C18H19FN4OS/c1-11(2)21-15-10-13(8-9-20-15)17-16(22-18(23-17)25(3)24)12-4-6-14(19)7-5-12/h4-11H,1-3H3,(H,20,21)(H,22,23) 1144471|InChI=1S/C18H19FN4OS/c1-11(2)21-15-10-13(8-9-20-15)17-16(22-18(23-17)25(3)24)12-4-6-14(19)7-5-12/h4-11H,1-3H3,(H,20,21)(H,22,23)/t25-/m1/s1 1144472|InChI=1S/C18H19FN4OS/c1-11(2)21-15-10-13(8-9-20-15)17-16(22-18(23-17)25(3)24)12-4-6-14(19)7-5-12/h4-11H,1-3H3,(H,20,21)(H,22,23)/t25-/m0/s1
The differences are in the "/t(isotopic:stereo:sp3)", "/m(fixed_:stereo:sp3:inverted)", and "/s(fixed_H:stereo_type=abs)" layers. Got that?

I don't. I used the techniques of the next section to get the molfiles for each structure. The differences are in the bonds between atoms 23/24 (the sulfoxide, represented in charge-separated form) and atoms 23/25 (the methyl on the sulfur). The molfile for the first record has no asigned bond stereochemistry, the second has a down flag for the sulfoxide, and the third has a down flag for the methyl.

molfile column in compound_structures

There's a "molfile" entry. Does it really include the structure as a raw MDL molfile? Yes, yes it does:

sqlite> select molfile from compound_structures where molregno = 805;

          11280714442D 1   1.00000     0.00000     0

  8  8  0     0  0            999 V2000
    6.0750   -2.5667    0.0000 C   0  0  0  0  0  0           0  0  0
    5.3625   -2.9792    0.0000 N   0  0  3  0  0  0           0  0  0
    6.7917   -2.9792    0.0000 N   0  0  0  0  0  0           0  0  0
    5.3625   -3.8042    0.0000 C   0  0  0  0  0  0           0  0  0
    4.6542   -2.5667    0.0000 C   0  0  0  0  0  0           0  0  0
    6.0750   -1.7417    0.0000 C   0  0  0  0  0  0           0  0  0
    4.6542   -1.7417    0.0000 C   0  0  0  0  0  0           0  0  0
    5.3625   -1.3292    0.0000 C   0  0  0  0  0  0           0  0  0
  2  1  1  0     0  0
  3  1  2  0     0  0
  4  2  1  0     0  0
  5  2  1  0     0  0
  6  1  1  0     0  0
  7  8  1  0     0  0
  8  6  1  0     0  0
  7  5  1  0     0  0
M  END

Why did I choose molregno = 805? I looked for a structure with 8 atoms and 8 bond by searching for the substring "  8  8  0", which is in the counts line. (It's not a perfect solution, but rather a good-enough one.

sqlite> select molregno from compound_structures where molfile LIKE "%  8  8  0%" limit 1;
805
I bet with a bit of effort I could count the number of rings by using the molfile to get the bond counts and use the number of "."s in the canonical_smiles to get the number of fragments.

compound_properties and molecule_dictionary tables

The compound_properties table stores some molecular properties. I'll get the number of heavy atoms, the number of aromatic rings, and the full molecular weight for structure 805.

sqlite> select heavy_atoms, aromatic_rings, full_mwt from compound_properties where molregno = 805;
8|0|112.17
I've been using "805", which is an internal identifier. What's its public ChEMBL id?
sqlite> select chembl_id from molecule_dictionary where molregno = 805;
CHEMBL266980
What are some of the records with only 1 or 2 atoms?
sqlite>
   ...>
   ...>
select chembl_id, heavy_atoms from molecule_dictionary, compound_properties
 where molecule_dictionary.molregno = compound_properties.molregno
 and heavy_atoms < 3 limit 5;
CHEMBL1098659|1 CHEMBL115849|2 CHEMBL1160819|1 CHEMBL116336|2 CHEMBL116838|2

InChI and heavy atom count for large structures

I showed that some of the SMILES were used for two or three records. What about the InChI string? I started with:

sqlite>
   ...>
select molregno, standard_inchi, count(*) from compound_structures
  group by standard_inchi having count(*) > 1;
1378059||9
After 10 minutes with no other output, I gave up. Those 9 occurrences have a NULL value, that is:
sqlite> select count(*) from compound_structures where standard_inchi is NULL;
9
I was confused at first because there are SMILES string (I'll show only the first 40 characters), so there is structure information. The heavy atom count is also NULL:
sqlite>
   ...>
   ...>
select compound_structures.molregno, heavy_atoms, substr(canonical_smiles, 1, 40)
 from compound_structures, compound_properties
 where standard_inchi is NULL and compound_structures.molregno = compound_properties.molregno;
615447||CC1=CN([C@H]2C[C@H](OP(=O)(O)OC[C@H]3O[C 615448||CC1=CN([C@H]2C[C@H](OP(=O)(O)OC[C@H]3O[C 615449||CC1=CN([C@H]2C[C@H](OP(=O)(O)OC[C@H]3O[C 615450||CC1=CN([C@H]2C[C@H](OP(=O)(O)OC[C@H]3O[C 615451||CC1=CN([C@H]2C[C@H](OP(=O)(O)OC[C@H]3O[C 1053861||CN(C)P(=O)(OC[C@@H]1CN(C[C@@H](O1)n2cnc3 1053864||CN(C)P(=O)(OC[C@@H]1CN(C[C@@H](O1)n2cnc3 1053865||CN(C)P(=O)(OC[C@@H]1CN(C[C@@H](O1)N2C=CC 1378059||CC[C@H](C)[C@H](NC(=O)[C@H](CCCNC(=N)N)N
Then I realized it's because the schema specifies the "heavy_atoms" field as "NUMBER(3,0)". While SQLite ignores that limit, it looks like ChEMBL doesn't try to store a count above 999.

What I'll do instead is get the molecular formula, which shows that there are over 600 heavy atoms in those structures:

sqlite>
   ...>
   ...>
   ...>
   ...>
select chembl_id, full_molformula
  from compound_structures, compound_properties, molecule_dictionary
  where standard_inchi is NULL
  and compound_structures.molregno = compound_properties.molregno
  and compound_structures.molregno = molecule_dictionary.molregno;
CHEMBL1077162|C318H381N118O208P29 CHEMBL1077163|C319H383N118O209P29 CHEMBL1077164|C318H382N119O208P29 CHEMBL1077165|C325H387N118O209P29 CHEMBL1631334|C361H574N194O98P24S CHEMBL1631337|C367H606N172O113P24 CHEMBL1631338|C362H600N180O106P24 CHEMBL2105789|C380H614N112O113S9
Those are some large structures! The reason there are no InChIs for them is that InChI didn't support large molecules until version 1.05, which came out in early 2017. Before then, InChI only supported 1024 atoms. Which is normally fine as most compounds are small (hence "small molecule chemistry"). In fact, there aren't any records with more than 79 heavy atoms:
sqlite>
   ...>
select heavy_atoms, count(*) from compound_properties
  where heavy_atoms > 70 group by heavy_atoms;
71|364 72|207 73|46 74|29 75|3 76|7 78|2 79|2
How in the world do these large structures have 600+ atoms? Are they peptides? Mmm, no, not all. The first 8 contain a lot of phosphorouses. I'm guessing some sort of nucleic acid. The last might be a protein. Perhaps I can get a clue from the chemical name, which is in the compound_records table. Here's an example using the molregno 805 from earlier:
sqlite> select * from compound_records where molregno = 805;
1063|805|14385|14|1-Methyl-piperidin-(2Z)-ylideneamine|1|
Some of the names of the 600+ atom molecules are too long, so I'll limit the output to the first 50 characters of the name:
sqlite>
   ...>
   ...>
   ...>
   ...>
   ...>
select chembl_id, full_molformula, substr(compound_name, 1, 50)
  from compound_structures, molecule_dictionary, compound_properties, compound_records
 where standard_inchi is NULL
   and compound_structures.molregno = molecule_dictionary.molregno
   and compound_structures.molregno = compound_properties.molregno
   and compound_structures.molregno = compound_records.molregno;
CHEMBL1077161|C307H368N116O200P28|{[(2R,3S,4R,5R)-5-(4-amino-2-oxo-1,2-dihydropyrimi CHEMBL1077162|C318H381N118O208P29|{[(2R,3S,5R)-2-{[({[(2R,3S,4R,5R)-5-(4-amino-2-oxo CHEMBL1077163|C319H383N118O209P29|{[(2R,3S,5R)-2-{[({[(2R,3S,4R,5R)-5-(4-amino-2-oxo CHEMBL1077164|C318H382N119O208P29|{[(2R,3S,5R)-2-{[({[(2R,3S,4R,5R)-5-(4-amino-2-oxo CHEMBL1077165|C325H387N118O209P29|{[(2R,3S,5R)-2-{[({[(2R,3S,4R,5R)-5-(4-amino-2-oxo CHEMBL1631334|C361H574N194O98P24S|HRV-EnteroX CHEMBL1631337|C367H606N172O113P24|PV-5'term CHEMBL1631338|C362H600N180O106P24|PV-L4 CHEMBL2105789|C380H614N112O113S9|Mirostipen
That didn't help much, but I could at least do a web search for some of the names. For example, HRV-EnteroX is a PPMO (peptide-conjugated phosphorodiamidate morpholino oligomers), which is where those phosphorous atoms come from.

The names weren't really help, and the images at ChEMBL were too small to make sense of the structures, so I looked at them over at PubChem. HRV-EnteroX looks like a 12-mer peptide conjugated to about 25 morpholino oligomers. Mirostipen looks like a peptide. CHEMBL1077161 looks like a nucleic acid strand.

I don't think there's anything interesting to explore in this direction so I'll move on.

Assay data

I'll take a look at assay data, which I deal with a lot less often than I do structure data. How many assays are there?
sqlite> select count(*) from assays;
1212831
Okay, and how many of them are human assays? For that I need the NCBI taxonomy id. Iain's example code uses 9606, which the NCBI web site tells me is for Homo sapiens. I don't think there's a table in the SQLite data dump with all of the taxonomy ids. The organism_class table says only:
sqlite> select * from organism_class where tax_id = 9606;
7|9606|Eukaryotes|Mammalia|Primates
The assay table "assay_organism" column stores the "[n]ame of the organism for the assay system", with the caution "[m]ay differ from the target organism (e.g., for a human protein expressed in non-human cells, or pathogen-infected human cells)." I'll throw caution to the wind and check that field:
sqlite> select count(*) from assays where assay_organism = "Homo sapiens";
291143
sqlite>
   ...>
select assay_organism, count(*) from assays
 where assay_tax_id = 9606 group by assay_organism;
|17 Homo sapiens|291135 sqlite> select count(*) from assays where assay_tax_id = 9606 and assay_organism is NULL; 17
It looks like 9606 is indeed for humans.

Assay activities

What sort of assay activities are there?
sqlite> select distinct published_type from activities;

 ED50
 Transactivation
%
% Cell Death
 ...
AUC
AUC (0-24h)
AUC (0-4h)
AUC (0-infinity)
 ...
Change
Change HDL -C
Change MAP
Change TC
 ...
Okay, quite a few. There appear to be some typos as well:
sqlite>
   ...>
   ...>
select published_type, count(*) from activities where published_type in ("Activity", "A ctivity",
  "Ac tivity", "Act ivity", "Acti vity", "Activ ity", "Activi ty", "Activit y", "Activty")
  group by published_type;
A ctivity|1 Activ ity|2 Activit y|1 Activity|700337 Activty|1
After another 20 minutes of data exploration, I realized that there are two different types. The "published_type" is what the assayer published, while there's also a "standard_type", which looks to be a normalized value by ChEMBL:
sqlite>
   ...>
select published_type, standard_type from activities
  where published_type in ("A ctivity", "Activ ity", "Activit y", "Activty");
A ctivity|Activity Activ ity|Activity Activ ity|Activity Activit y|Activity Activty|Activity
There are a many ways to publish a report with IC50 data. I'll show only those that end with "IC50".
sqlite> select distinct published_type from activities where published_type like "%IC50";
-Log IC50
-Log IC50/IC50
-logIC50
Average IC50
CC50/IC50
CCIC50
CIC IC50
CIC50
Change in IC50
Cytotoxicity IC50
Decrease in IC50
EIC50
FIC50
Fold IC50
I/IC50
IC50
IC50/IC50
Increase in IC50
Log 1/IC50
Log IC50
MBIC50
MIC50
Mean IC50
RIC50
Ratio CC50/IC50
Ratio CIC95/IC50
Ratio ED50/MIC50
Ratio IC50
Ratio LC50/IC50
Ratio LD50/IC50
Ratio pIC50
Ratio plasma concentration/IC50
Relative ET-A IC50
Relative IC50
TBPS IC50
TC50/IC50
Time above IC50
fIC50
log1/IC50
logIC50
pIC50
pMIC50
rIC50
The "p" prefix, as in "pIC50", is shorthand for "-log", so "-Log IC50", "Log 1/IC50", and "pIC50" are almost certainly the same units. Let's see:
sqlite>
   ...>
select distinct published_type, standard_type from activities
  where published_type in ("-Log IC50", "Log 1/IC50", "pIC50");
-Log IC50|IC50 -Log IC50|pIC50 -Log IC50|-Log IC50 Log 1/IC50|IC50 Log 1/IC50|Log 1/IC50 pIC50|IC50 pIC50|pIC50 pIC50|Log IC50 pIC50|-Log IC50
Well color me confused. Oh! There's a "standard_flag", which "[s]hows whether the standardised columns have been curated/set (1) or just default to the published data (0)." Perhaps that will help enlighten me.
sqlite>
   ...>
select distinct published_type, standard_flag, standard_type from activities
 where published_type in ("-Log IC50", "Log 1/IC50", "pIC50");
-Log IC50|1|IC50 -Log IC50|1|pIC50 -Log IC50|0|-Log IC50 Log 1/IC50|1|IC50 Log 1/IC50|0|Log 1/IC50 pIC50|1|IC50 pIC50|1|pIC50 pIC50|0|Log IC50 pIC50|1|-Log IC50 pIC50|0|pIC50
Nope, I still don't understand what's going on. I'll assume it's all tied to the complexities of data curation. For now, I'll assume that the data set is nice and clean.

IC50 types

Let's look at the "IC50" values only. How do the "published_type" and "standard_type" columns compare to each other?

sqlite>
   ...>
select published_type, standard_type, count(*) from activities
  where published_type = "IC50" group by standard_type;
IC50|% Max Response|21 IC50|Change|2 IC50|Control|1 IC50|Electrophysiological activity|6 IC50|Fold Inc IC50|1 IC50|Fold change IC50|12 IC50|IC50|1526202 IC50|IC50 ratio|1 IC50|Inhibition|12 IC50|Log IC50|20 IC50|Ratio IC50|40 IC50|SI|4 IC50|T/C|1
sqlite>
   ...>
select published_type, standard_type, count(*) from activities
  where standard_type = "IC50" group by published_type;
-Log IC50|IC50|1736 -Log IC50(M)|IC50|28 -Log IC50(nM)|IC50|39 -logIC50|IC50|84 3.3|IC50|1 Absolute IC50 (CHOP)|IC50|940 Absolute IC50 (XBP1)|IC50|940 Average IC50|IC50|34 CIC50|IC50|6 I 50|IC50|202 I-50|IC50|25 I50|IC50|6059 IC50|IC50|1526202 IC50 |IC50|52 IC50 app|IC50|39 IC50 max|IC50|90 IC50 ratio|IC50|2 IC50(app)|IC50|457 IC50_Mean|IC50|12272 IC50_uM|IC50|20 ID50|IC50|3 Log 1/I50|IC50|280 Log 1/IC50|IC50|589 Log 1/IC50(nM)|IC50|88 Log IC50|IC50|7013 Log IC50(M)|IC50|3599 Log IC50(nM)|IC50|77 Log IC50(uM)|IC50|28 Mean IC50|IC50|1 NA|IC50|5 NT|IC50|20 log(1/IC50)|IC50|1016 pI50|IC50|2386 pIC50|IC50|43031 pIC50(mM)|IC50|71 pIC50(nM)|IC50|107 pIC50(uM)|IC50|419
Yeah, I'm going to throw my hands up here, declare "I'm a programmer, Jim, not a bioactivity specialist", and simply use the published_type of IC50.

IC50 activity values

How are the IC50 values measured? Here too I need to choose between "published_units" and "standard_units". A quick look at the two shows that the standard_units are less diverse.

sqlite>
   ...>
select standard_units, count(*) from activities where published_type = "IC50"
  group by standard_units;
|167556 %|148 % conc|70 10'-11uM|1 10'-4umol/L|1 M ml-1|15 equiv|64 fg ml-1|1 g/ha|40 kJ m-2|20 liposomes ml-1|5 mMequiv|38 mg kg-1|248 mg.min/m3|4 mg/kg/day|1 milliequivalent|22 min|9 ml|3 mmol/Kg|10 mol|6 molar ratio|198 nA|6 nM|1296169 nM g-1|1 nM kg-1|4 nM unit-1|7 nmol/Kg|1 nmol/mg|5 nmol/min|1 ppm|208 ppm g dm^-3|7 uL|7 uM hr|1 uM tube-1|9 uM well-1|52 uM-1|25 uM-1 s-1|1 ucm|6 ucm s-1|2 ug|168 ug cm-2|1 ug g-1|2 ug well-1|12 ug.mL-1|61139 ug/g|16 umol kg-1|3 umol.kg-1|8 umol/dm3|2
"Less diverse", but still diverse. By far the most common is "nM for "nanomolar", which is the only unit I expected. How many IC50s have an activities better than 1 micromolar, which is 1000 nM?

sqlite>
   ...>
select count(*) from activities where published_type = "IC50"
   and standard_value < 1000 and standard_units = "nM";
483041
That's fully 483041/1212831 = 40% of the assays in the data dump.

How many of the IC50s are in humans? For that I need a join with the assays table using the assay_id:

sqlite>
   ...>
   ...>
   ...>
   ...>
select count(*) from activities, assays
 where published_type = "IC50"
   and standard_value < 1000 and standard_units = "nM"
   and activities.assay_id = assays.assay_id
   and assay_tax_id = 9606;
240916
About 1/2 of them are in humans.

Assay target type from target_dictionary

What are the possible assay targets in humans? That information is stored in the target_dictionary:

sqlite> select distinct target_type from target_dictionary where tax_id = 9606;
SINGLE PROTEIN
ORGANISM
CELL-LINE
UNKNOWN
PROTEIN COMPLEX
SUBCELLULAR
TISSUE
NUCLEIC-ACID
PROTEIN FAMILY
PROTEIN-PROTEIN INTERACTION
PROTEIN COMPLEX GROUP
SELECTIVITY GROUP
CHIMERIC PROTEIN
MACROMOLECULE
SMALL MOLECULE
OLIGOSACCHARIDE

Remember earlier when I threw caution to the wind? How many of the assays are actually against human targets? I can join on the target id "tid" to compare the taxon id in the target vs. the taxon id in the assay:

sqlite>
   ...>
   ...>
select count(*) from assays, target_dictionary
 where assays.tid = target_dictionary.tid
   and target_dictionary.tax_id = 9606;
301810
sqlite>
   ...>
   ...>
select count(*) from assays, target_dictionary
 where assays.tid = target_dictionary.tid
   and assays.assay_tax_id = 9606;
291152

Compare assay organisms with target organism

What are some of the non-human assay organisms where the target is humans?

sqlite>
   ...>
   ...>
   ...>
   ...>
select distinct assay_organism from assays, target_dictionary
 where assays.tid = target_dictionary.tid
   and assays.assay_tax_id != 9606
   and target_dictionary.tax_id = 9606
 limit 10;
rice Saccharomyces cerevisiae Oryza sativa Rattus norvegicus Sus scrofa Cavia porcellus Oryctolagus cuniculus Canis lupus familiaris Proteus vulgaris Salmonella enterica subsp. enterica serovar Typhi

Compounds tested against a target name

I'm interested in the "SINGLE PROTEIN" target names in humans. The target name is a manually curated field.

sqlite> select distinct pref_name from target_dictionary where tax_id = 9606 limit 5;
Maltase-glucoamylase
Sulfonylurea receptor 2
Phosphodiesterase 5A
Voltage-gated T-type calcium channel alpha-1H subunit
Dihydrofolate reductase
What are structures used in "Dihydrofolate reductase" assays? This requires three table joins, one on 'tid' to go from target_dictionary to assays, another on 'assay_id' to get to the activity, and another on 'molregno' to go from assay to molecule_dictionary so I can get the compound's chembl_id. (To make it more interesting, three of the tables have a chembl_id column.)
sqlite>
   ...>
   ...>
   ...>
   ...>
   ...>
   ...>
select distinct molecule_dictionary.chembl_id
  from target_dictionary, assays, activities, molecule_dictionary
 where target_dictionary.pref_name = "Dihydrofolate reductase"
   and target_dictionary.tid = assays.tid
   and assays.assay_id = activities.assay_id
   and activities.molregno = molecule_dictionary.molregno
 limit 10;
CHEMBL1679 CHEMBL429694 CHEMBL106699 CHEMBL422095 CHEMBL1161155 CHEMBL350033 CHEMBL34259 CHEMBL56282 CHEMBL173175 CHEMBL173901
sqlite>
   ...>
   ...>
   ...>
   ...>
   ...>
select count(distinct molecule_dictionary.chembl_id)
   from target_dictionary, assays, activities, molecule_dictionary
  where target_dictionary.pref_name = "Dihydrofolate reductase"
    and target_dictionary.tid = assays.tid
    and assays.assay_id = activities.assay_id
    and activities.molregno = molecule_dictionary.molregno;
3466
There are 3466 of these, including non-human assays. I'll limit it to human ones only:
sqlite>
   ...>
   ...>
   ...>
   ...>
   ...>
   ...>
select count(distinct molecule_dictionary.chembl_id)
   from target_dictionary, assays, activities, molecule_dictionary
  where target_dictionary.pref_name = "Dihydrofolate reductase"
    and target_dictionary.tax_id = 9606
    and target_dictionary.tid = assays.tid
    and assays.assay_id = activities.assay_id
    and activities.molregno = molecule_dictionary.molregno;
1386
I'll further limit it to those with an IC50 of under 1 micromolar:
sqlite>
sqlite>
   ...>
   ...>
   ...>
   ...>
   ...>
   ...>
   ...>
   ...>
   ...>
.timer on
select count(distinct molecule_dictionary.chembl_id)
   from target_dictionary, assays, activities, molecule_dictionary
  where target_dictionary.pref_name = "Dihydrofolate reductase"
    and target_dictionary.tax_id = 9606
    and target_dictionary.tid = assays.tid
    and assays.assay_id = activities.assay_id
    and activities.published_type = "IC50"
    and activities.standard_units = "nM"
    and activities.standard_value < 1000
    and activities.molregno = molecule_dictionary.molregno;
255 Run Time: real 174.561 user 18.073715 sys 23.285346
I turned on the timer to show that the query took about 3 minutes! I repeated it to ensure that it wasn't a simple cache issue. Still about 3 minutes.

ANALYZE the tables

The earlier query, without the activity filter, took 5.7 seconds when the data wasn't cached, and 0.017 seconds when cached. It found 1386 matches. The new query takes almost 3 minutes more to filter those 1386 matches down to 255. That should not happen.

This is a strong indication that the query planner used the wrong plan. I've had this happen before. My solution then was to "ANALYZE" the tables, which "gathers statistics about tables and indices and stores the collected information in internal tables of the database where the query optimizer can access the information and use it to help make better query planning choices."

It can take a while, so I limited it to the tables of interest.

sqlite> analyze target_dictionary;
Run Time: real 0.212 user 0.024173 sys 0.016268
sqlite> analyze assays;
Run Time: real 248.184 user 5.890109 sys 4.793236
sqlite> analyze activities;
Run Time: real 6742.390 user 97.862790 sys 129.854073
sqlite> analyze molecule_dictionary;
Run Time: real 33.879 user 2.195662 sys 2.043848
Yes, it took almost 2 hours to analyze the activities table. But it was worth it from a pure performance view. I ran the above code twice, with this pattern:
% sudo purge  # clear the filesystem cache
% sqlite3 chembl_21.db  # start SQLite
SQLite version 3.8.5 2014-08-15 22:37:57
Enter ".help" for usage hints.
sqlite> .timer on
sqlite> .... previous query, with filter for IC50 < 1uM ...
255
Run Time: real 8.595 user 0.038847 sys 0.141945
sqlite> .... repeat query using a warm cache
255
Run Time: real 0.009 user 0.005255 sys 0.003653
Nice! Now I only need to do about 60 such queries to justify the overall analysis time.



Fingerprint set similarity #

This is part 1 of a three-part series in generating fingerprint-based set similarities using chemfp. Read part 2 where I figure out how to use the ChEMBL bioactivity data and part 3 where I put everything together and visualize the results.

Someone recently asked how I might use chemfp to compute the similarity between two sets of fingerprints. This would be used for something like SEA (Similarity Ensemble Approach), which "relates proteins based on the set-wise chemical similarity among their ligands. It can be used to rapidly search large compound databases and to build cross-target similarity maps."

The underlying techniques of document set similarity are used in many fields. Bioinformatics, for example, uses it to analyze gene expression microarrays to find correlations between genes which might indicate they affect similar pathways. That said, I am no expert on the topic. I implemented a SEA-like algorithm for one client, but I don't know enough to judge the effectiveness of one approach over another, nor do I know the underlying statistics.

But that's okay. I'll assume that you know these details, and just want to know how to use chemfp to get the raw numbers to feed into your scoring function.

The first step is to create and load fingerprints. I have some pre-computed fingerprints for ChEMBL 21, so I'll use that as my example. I'll use the 2048-bit RDKit fingerprints. To generate them yourself, can use the command-line tool rdkit2fps, like this:

% rdkit2fps chembl_21.sdf.gz -o chembl_21.rdkit2048.fps.gz

The FPS file is a text-based, line-oriented format which is easy to for both humans and software to read and write. Easy, however, does not mean fast. It takes about 20 seconds to load this fps.gz file:

>>> import time
>>> import chemfp
>>> filename = "/Users/dalke/databases/chembl_21.rdkit.rdkit2048.fps.gz"
>>> t1 = time.time(); chemfp.load_fingerprints(filename);t2=time.time()
<chemfp.arena.FingerprintArena object at 0x1021aef90>
>>> t2-t1
19.388361930847168

Chemfp 2.0 introduced the FPB binary file format, which is more structured around the chemfp internal data structures. It's very fast for chemfp to load, but much more complicated. All of the chemfp tools know how to read and write the FPB format, so you could create the FPB file like this:

% rdkit2fps chembl_21.sdf.gz -o chembl_21.rdkit2048.fpb
If you already have an FPS (or gzip-ed FPS) file then use the fpcat tool to convert it to an FPB file:
% fpbcat chembl_21.rdkit2048.fps.gz -o chembl_21.rdkit2048.fpb
The conversion took about 32 seconds on my 7 year old laptop.

I'll switch back to Python and show why I developed the FPB format. The first time I loaded it took 0.4 seconds:

>>> t1=time.time(); chemfp.load_fingerprints("chembl_21.rdkit2048.fpb");t2=time.time()
<chemfp.fpb_io.FPBFingerprintArena object at 0x113809790>
>>> t2-t1
0.3760688304901123
Much of that time is likely waiting for the hard disk. When I open it again, the data is already in disk cache, so the load time is now under 0.1 seconds:
>>> t1=time.time(); chemfp.load_fingerprints("chembl_21.rdkit2048.fpb");t2=time.time()
<chemfp.fpb_io.FPBFingerprintArena object at 0x113809650>
>>> t2-t1
0.06340503692626953
That's much better than the 20 seconds it took to read from the gzipped FPS file. For what it's worth, for the RDKit 2048-bit fingerprint, the fps.gz file is slightly more compact than the .fpb file:
% ls -l chembl_21.rdkit2048.fpb chembl_21.rdkit2048.fps.gz
-rw-r--r--  1 dalke  staff  416031861 May 12  2016 chembl_21.rdkit2048.fps.gz
-rw-r--r--  1 dalke  staff  457103145 Mar 18 02:36 chembl_21.rdkit2048.fpb

Select a subset

In chemfp, a set of fingerprints is called an "arena". (I struggled for a while to come up with a good name. "A 'hand' of fingerprints" was humorous, but in the end I re-used the name from memory management.) I want to use a subset of the arena. The best solution is to use the arena.copy() method. By itself it makes a copy of the entire data set:

>>> import chemfp
>>> arena = chemfp.load_fingerprints("chembl_21.rdkit2048.fpb")
>>> subarena = arena.copy()
>>> len(arena), len(subarena)
(1583839, 1583839)

Arenas act like an array of (id, fingerprint) pairs. The first element is term [0], the second [1], and so on. For example:

>>> arena[1234]
(u'CHEMBL79699', '\x04\x00\x00\x08 ... many bytes omitted ... \x00\xc0')
>>> arena[5678]
(u'CHEMBL291214', '\x00\x00\x00\x08 ... many bytes omitted ... \x00 \x80')

I'll make another copy, but this time ask chemfp to only copy indices 1234 and 5678:

>>> subarena = arena.copy(indices=[1234, 5678])
>>> len(subarena)
2
>>> subarena.ids
[u'CHEMBL79699', u'CHEMBL291214']
>>> subarena[0]
(u'CHEMBL79699', '\x04\x00\x00\x08 ... many bytes omitted ... \x00\xc0')
(Notes: by default the subarena may rearrange the fingerprints for better search performance. If you want the new arena to have the fingerprint in the same order as the list of indices then also pass in "reorder=False". Also, the u'' here shows that I'm using chemfp 3.0 on Python 2.7. Earlier versions of chemfp return both the id and fingerprint as byte strings, while chemfp 3.0 returns the id as a Unicode string and the fingerprint as a byte string. Under Python 2, Unicode strings are shown using the 'u' prefix. Under Python 3, byte strings are represented with the 'b' prefix.)

Make a random subset

I need some subsets so I can compare them. To get started I'll just use randomly selected elements. This is pleasently easy, using a technique I learned from reading the Python documentation for random.sample:

To choose a sample from a range of integers, use an xrange() object as an argument. This is especially fast and space efficient for sampling from a large population: sample(xrange(10000000), 60).
Note: the xrange() object is Python 2 specific. If you are using Python 3 then replace "xrange" with "range".

Here are some examples where I sample 4 values (without replacement) from 0...99:

>>> import random
>>> random.sample(xrange(100), 4)
[68, 49, 79, 38]
>>> random.sample(xrange(100), 4)
[77, 57, 33, 43]
>>> random.sample(xrange(100), 4)
[32, 70, 13, 53]
I'll change this to get 4 randomly indices in the arena:
>>> random.sample(xrange(len(arena)), 4)
[1089497, 196869, 1474590, 376331]
>>> random.sample(xrange(len(arena)), 4)
[161904, 1136119, 1455737, 518548]
>>> random.sample(xrange(len(arena)), 4)
[1092929, 218864, 1436330, 1357672]

Here's a function to create a subarena arena of size n, in easy-to-copy form, which I'll then execute interactively:

import random
def make_subset(arena, n):
    indices = random.sample(xrange(len(arena)), n)
    return arena.copy(indices=indices)

>>> a = make_subset(arena, 4)
>>> len(a)
4
>>> a.ids
[u'CHEMBL1448181', u'CHEMBL1532656', u'CHEMBL3480613', u'CHEMBL1765556']

Jaccard index between two sets

The Jaccard index is a widely used scoring function. (In cheminformatics we know it better as the Tanimoto, for historical reasons tied more to how knowledge diffuses than to strict date priority.) Simply put, it's defined as:

                number of elements common to both A and B
Jaccard(A, B) = -----------------------------------------
                    number of unique elements in A or B
The harder part is to turn this into something meaningful. What does it mean for an fingerprint record to be common to both A and B?

One possibility is to look at the identifier, and say that a record is common to A and B if and only if the record's identifier is in both sets. However, it would be nice to capture some sense of chemical similarity. We can instead say that a record is in common if and only if they have the same fingerprints, but 100% similarity is very strict.

Let's lower the threshold a bit, and say that a record 'a' in A is common to B if and only if there is at least one record 'b' in B such that the fingerprint Tanimoto(a, b) >= 0.8.

If you think at this for a bit you'll see a couple of things. First off, this definition is not symmetric. If there is at least one similar 'b' for some 'a' then there is at least one similar 'a' for that 'b', but 'a' might be the nearest neighbor of two different 'b's. (The asymmetry could be fixed by also including the results of a search of B against A. I'm not going to consider that approach.) Second, "has a neighbor in B" can be implemented as a k=1 nearest neighbor search of A against B, which is a built-in chemfp function.

I'll make two subsets with 1,000 elements each, randomly sampled from the ChEMBL-21 arena:

>>> subset1 = make_subset(arena, 1000)
>>> subset2 = make_subset(arena, 1000)
then use chemfp to do the k=1 nearest-neighbor search between two arenas, with a minimum required score of 0.8:
>>> from chemfp import search
>>> results = search.knearest_tanimoto_search_arena(subset1, subset2, k=1, threshold=0.8)

Most randomly chosen fingerprints are below 0.8 threshold, so they find 0 neighbors, but some of them have 1 neighbor:

>>> len(results[0])
0
>>> len(results[1])
0
>>> len(results[26])
1
There are a few ways to count the number of element in common. You can use normal Python functions, like:
>>> sum(len(result) for result in results)
24
A faster way is to let chemfp count all of the hits for you:
>>> results.count_all()
24

Okay, so Tanimoto similarity gives the numerator to the Jaccard set similarity. What about the denominator, which is "the number of unique elements in A or B"?

For that I'll count the total number of unique identifiers in both A and B. You saw earlier that arena.ids gives the list of identifiers for the arena. I'll use those to make a Python set made of the union of the two ids, then get the size of the resulting set:

>>> len(set(subset1.ids).union(subset2.ids))
2000
Thus, the Jaccard index of the two sets is 24/2000, or 0.012. They are not that similar.

A slighly more interesting case

The denominator of 2000 is rather boring, as it's 1000+1000, that is, the sum of the individual set sizes. That's because the two randomly chosen sets have no elements in common:

>>> len(set(subset1.ids).intersection(subset2.ids))
0
I'll make some random samples until there are at least 3 elements in common:
>>> while len(set(subset1.ids).intersection(subset2.ids)) < 3:
...   subset2 = make_subset(arena, 1000)
... 
>>> len(set(subset1.ids).intersection(subset2.ids))
4
>>> set(subset1.ids).intersection(subset2.ids)
set([u'CHEMBL64050', u'CHEMBL129835', u'CHEMBL255712', u'CHEMBL88792'])
>>> len(set(subset1.ids).union(subset2.ids))
1996
These two subsets have 40 elements in common:
>>> search.knearest_tanimoto_search_arena(subset1, subset2, k=1, threshold=0.8).count_all()
40
given them a Jaccard index of 40/1996, or 0.02. That's better than 0.012, but still not very similar.

Estimating background similarity

I want to get some idea of the average similarity between two randomly chosen sets, at different sizes and threshold scores. I'll define a 'jaccard' function to help with the calculation:

from __future__ import division  # for Python 2.x, make 1/2 be 0.5, not 0
from chemfp import search

# Note: this isn't quite the Jaccard similarity as the non-reflexive
# numerator means that often jaccard(A, B) != jaccard(B, A)
def jaccard(arena1, arena2, threshold=0.8):
  numerator = search.knearest_tanimoto_search_arena(
                 arena1, arena2, k=1, threshold=threshold).count_all()
  denominator = len(set(arena1.ids).union(arena2.ids))
  return numerator / denominator

>>> jaccard(subset1, subset2) 
0.02004008016032064 
>>> jaccard(make_subset(arena, 1000), make_subset(arena, 1000))
0.017508754377188594
>>> jaccard(make_subset(arena, 1000), make_subset(arena, 1000))
0.021
>>> jaccard(make_subset(arena, 1000), make_subset(arena, 1000))
0.0155

It varies quite a bit, so I'll compute 100 scores and get some information about the min, max, average, and standard deviation:

>>> scores = []
>>> for i in range(50):
...   scores.append(jaccard(make_subset(arena, 1000), make_subset(arena, 1000)))
... 
>>> min(scores), max(scores)
(0.008, 0.022)

What about the average and standard deviation? Python 3.4 introduced the statistics module, which has those as built-in functions. While chemfp 3.0 supports Python 3.5 and greater, I want this blog post to work on both Python 2.7 and 3.3, so I'll import some functionality from numpy:

>>> import numpy as np 
>>> np.mean(scores)
0.014819138037618857
>>> np.std(scores)
0.0025880786854096645
(One obvious question is, is standard deviation an appropriate characterization, or am I hammering a square block into a round hole? I'll let someone more knowledgeable about statistics figure that out.)

I then wrote a function to show the mean and standard deviation of 100 Jaccard scores, across the NxM cross product of some set sizes, that is size 10x10, size 10x20, ... 10x2000, 20x10, ... up to size 2000x2000. Here's the function, which is mostly taken up by code that tries to format things nicely:

def compute_score_table(arena):
  sizes = (10, 20, 50, 100, 200, 500, 1000, 2000)
  print("     " + "".join(str(size).center(12) for size in sizes))
  for size1 in sizes:
    output1 = str(size1).rjust(4) + "   "
    output2 = "       "
    for size2 in sizes:
      scores = []
      for i in range(100):
          score = jaccard(make_subset(arena, size1), make_subset(arena, size2))
          scores.append(score)
      output1 += ("%.3f +/-" % (np.mean(scores),)).ljust(12) 
      output2 += ("  %.5f" % (np.std(scores),)).ljust(12)
    print(output1) 
    print(output2)

The output shows the average and standard deviations on alternate lines, so I could make better use of vertical space:

>>> compute_score_table(arena)
          10          20          50         100         200         500         1000        2000    
  10   0.000 +/-   0.000 +/-   0.000 +/-   0.000 +/-   0.000 +/-   0.000 +/-   0.000 +/-   0.000 +/-   
         0.00000     0.00000     0.00233     0.00127     0.00121     0.00059     0.00036     0.00033   
  20   0.000 +/-   0.000 +/-   0.001 +/-   0.001 +/-   0.001 +/-   0.001 +/-   0.001 +/-   0.001 +/-   
         0.00332     0.00000     0.00345     0.00266     0.00199     0.00115     0.00068     0.00054   
  50   0.000 +/-   0.001 +/-   0.001 +/-   0.002 +/-   0.002 +/-   0.001 +/-   0.001 +/-   0.001 +/-   
         0.00233     0.00371     0.00271     0.00318     0.00322     0.00146     0.00097     0.00067   
 100   0.000 +/-   0.001 +/-   0.001 +/-   0.001 +/-   0.003 +/-   0.003 +/-   0.003 +/-   0.002 +/-   
         0.00127     0.00198     0.00256     0.00256     0.00382     0.00193     0.00174     0.00121   
 200   0.000 +/-   0.001 +/-   0.002 +/-   0.002 +/-   0.004 +/-   0.005 +/-   0.005 +/-   0.005 +/-   
         0.00121     0.00276     0.00316     0.00311     0.00305     0.00286     0.00208     0.00119   
 500   0.000 +/-   0.001 +/-   0.002 +/-   0.003 +/-   0.006 +/-   0.009 +/-   0.010 +/-   0.010 +/-   
         0.00110     0.00108     0.00236     0.00298     0.00314     0.00318     0.00251     0.00177   
1000   0.000 +/-   0.001 +/-   0.002 +/-   0.004 +/-   0.006 +/-   0.012 +/-   0.014 +/-   0.016 +/-   
         0.00061     0.00132     0.00225     0.00280     0.00291     0.00275     0.00272     0.00254   
2000   0.001 +/-   0.001 +/-   0.002 +/-   0.004 +/-   0.007 +/-   0.014 +/-   0.019 +/-   0.024 +/-   
         0.00128     0.00130     0.00192     0.00269     0.00293     0.00303     0.00309     0.00250   
(By the way, if all of the values are 0.0 then you are likely using Python 2.7 and have omitted the "from __future__ import division" line when you defined the jaccard() function. I did that a few times when developing this essay.)

My jaccard() function is not reflexive, which you can verify by comparing terms i by j with j by i, as in 1000x2000 vs 2000x100. Their respective values are 0.019+/-0.00309 and 0.016+/-0.00254. The averages are multiple standard deviations away from each other.

The scores are also a function of set size, which is not what you want for a good scoring function. The 2000x2000 term has an average score of 0.024 while the 1000x1000 has an average score 0.014. The elements in the sets are randomly selected, so I expected the two comparisons to have about the same score. This bias towards larger sets is built into my similarity definition, so I conclud that my jaccard() implementation is not a good scoring function for sets.

There are ways to improve the scoring function, for example by normalizing to the background average/standard deviation to give a Z-score, or by better use of statistics. That is "left to the student as an exercise."

Here's the final code in a single block so you can use it more easily:

from __future__ import division

import random
import numpy as np
import chemfp
from chemfp import search

# Support both Python 2 and Python 3
try:
  xrange # Check if this is Python 2
except NameError:
  xrange = range # Use 'range' for Python 3 

def make_subset(arena, n):
  indices = random.sample(xrange(len(arena)), n)
  return arena.copy(indices=indices)

# Note: this isn't quite the Jaccard similarity as the non-reflexive
# numerator means that often jaccard(A, B) != jaccard(B, A)
def jaccard(arena1, arena2, threshold=0.8):
  numerator = search.knearest_tanimoto_search_arena(
                 arena1, arena2, k=1, threshold=threshold).count_all()
  denominator = len(set(arena1.ids).union(arena2.ids))
  return numerator / denominator

def compute_score_table(arena):
  sizes = (10, 20, 50, 100, 200, 500, 1000, 2000)
  print("     " + "".join(str(size).center(12) for size in sizes))
  for size1 in sizes:
    output1 = str(size1).rjust(4) + "   "
    output2 = "       "
    for size2 in sizes:
      scores = []
      for i in range(100):
          score = jaccard(make_subset(arena, size1), make_subset(arena, size2))
          scores.append(score)
      output1 += ("%.3f +/-" % (np.mean(scores),)).ljust(12) 
      output2 += ("  %.5f" % (np.std(scores),)).ljust(12)
    print(output1) 
    print(output2) 


arena = chemfp.load_fingerprints("chembl_21.rdkit2048.fpb")
compute_score_table(arena)

Details about the best match

While it isn't needed for the set-to-set comparison, you might be interested in knowing more about the closest match in the k=1 nearest search. It's easy to get the list of ids and scores for the non-empty rows:

>>> subset1 = make_subset(arena, 1000)
>>> subset2 = make_subset(arena, 1000)
>>> results = search.knearest_tanimoto_search_arena(
...     subset1, subset2, k=1, threshold=0.8)
...
>>> for result in results:
...   if result:
...     print("%-13s %.3f" % result.get_ids_and_scores()[0])
... 
CHEMBL386477  0.926
CHEMBL386477  1.000
CHEMBL275879  0.954
CHEMBL2138462 0.921
CHEMBL83932   0.813
CHEMBL2403669 0.805
CHEMBL2372083 0.819
CHEMBL1213903 0.988
CHEMBL1574001 0.901
CHEMBL2337012 0.913
CHEMBL370254  0.806
CHEMBL555410  0.908
CHEMBL2337012 0.897
CHEMBL1683180 0.822
CHEMBL73244   0.801
CHEMBL605903  0.831
CHEMBL1834630 0.806
CHEMBL1475722 0.821
CHEMBL65904   0.912
CHEMBL2106627 0.812
CHEMBL108353  0.816
CHEMBL108353  0.825

It becomes a bit more complicated if you also want information about the query and its fingerprint, so let's do that. I'll also print the query id, the number of 1-bits in the query and target fingerprint (better known as the popcount), and the number of 1-bits in their fingerprint intersection, that is, the number of bits in common. Here's the code:

import random
import numpy as np
import chemfp
from chemfp import search, bitops

# Support both Python 2 and Python 3
try:
  xrange # Check if this is Python 2
except NameError:
  xrange = range # Use 'range' for Python 3 

def make_subset(arena, n):
    indices = random.sample(xrange(len(arena)), n)
    return arena.copy(indices=indices)

arena = chemfp.load_fingerprints("chembl_21.rdkit2048.fpb")

subset1 = make_subset(arena, 1000)
subset2 = make_subset(arena, 1000)
results = search.knearest_tanimoto_search_arena(
  subset1, subset2, k=1, threshold=0.8)

print("  query_id     #bits    target_id    #bits common  score")
for i, result in enumerate(results):
    if not result:
        continue   # skip elements with no matches
    query_id, query_fp = subset1[i]
    num_query_bits = bitops.byte_popcount(query_fp)
    j, score = result.get_indices_and_scores()[0]
    target_id, target_fp = subset2[j]
    num_target_bits = bitops.byte_popcount(target_fp)
    in_common = bitops.byte_intersect_popcount(query_fp, target_fp)
    print("%-13s %5d   %-13s %5d  %5d   %.4f" %
          (query_id, num_query_bits, target_id, num_target_bits, in_common, score))
It generates the following output:
  query_id     #bits    target_id    #bits common  score
CHEMBL340793    710   CHEMBL443821    657    657   0.9254
CHEMBL2115272   710   CHEMBL1096675   758    667   0.8327
CHEMBL262239    761   CHEMBL411281    761    761   1.0000
CHEMBL529191    821   CHEMBL3434725   843    784   0.8909
CHEMBL389943    868   CHEMBL414961    887    807   0.8513
CHEMBL350381   1008   CHEMBL162288    899    899   0.8919
CHEMBL1407603  1042   CHEMBL1610293  1001    955   0.8778
CHEMBL484872   1058   CHEMBL1610293  1001    967   0.8855
CHEMBL381366   1066   CHEMBL426130   1060    985   0.8633
CHEMBL2058350  1098   CHEMBL2058354  1005   1005   0.9153
CHEMBL1898269  1112   CHEMBL1893417  1049    982   0.8329
CHEMBL132434   1123   CHEMBL15324    1158   1079   0.8977
CHEMBL332868   1128   CHEMBL77685    1160   1052   0.8511
CHEMBL2143647  1188   CHEMBL1893417  1049   1008   0.8202
CHEMBL1711444  1189   CHEMBL1310059  1231   1102   0.8361
CHEMBL3249400  1232   CHEMBL2371463  1128   1062   0.8182
CHEMBL1367025  1242   CHEMBL1352709  1345   1213   0.8828
CHEMBL77799    1290   CHEMBL1081652  1218   1135   0.8267
CHEMBL2303747  1296   CHEMBL15324    1158   1096   0.8071
CHEMBL451899   1656   CHEMBL477245   1551   1433   0.8078
CHEMBL362938   1708   CHEMBL1731136  1966   1641   0.8072
CHEMBL525425   1730   CHEMBL1731136  1966   1658   0.8135
CHEMBL516074   1742   CHEMBL3113501  1876   1691   0.8775
CHEMBL2029145  1764   CHEMBL2304149  1713   1706   0.9633
CHEMBL298265   1865   CHEMBL1731136  1966   1788   0.8752
CHEMBL373910   1906   CHEMBL1731136  1966   1830   0.8962

Cumulative value scoring methods

I earlier pointed out two big problems with the jaccard() implementation I proposed. It isn't symmetric (jaccard(a, b) is rarely equal to jaccard(b, a)), and it has a bias towards larger sizes.

There's another problem. Suppose A and B both have n elements, with no overlaps (making the denominator 2n), and suppose also that each element in A has exactly one neighbor with at least a 0.8 similarity. Then the jaccard() method will report a similarity of 1.0.

Now consider a set C which again has n elements, with no overlaps with elements in A. But this time each element of A has 5 neighbors in C, and vice versa. The jaccard() function only checks if there is at least one neighbor, so will still return a score of 1.0, even though sets A and C are more compact and closer to each other than A was to B.

One way to capture that information is to look at the total number of matches between A and B, and between A and C. Instead of doing a k=1 nearest neighbor search, I'll do a threshold search and count the total number of matches between the two:

>>> A = make_subset(arena, 1000)
>>> B = make_subset(arena, 1000)
>>> C = make_subset(arena, 1000)
>>> results_AB = search.threshold_tanimoto_search_arena(A, B, threshold=0.8)
>>> results_AB.count_all()
34
>>> results_AC = search.threshold_tanimoto_search_arena(A, C, threshold=0.8)
>>> results_AC.count_all()
49
This result is also reflexive, that is, comparing (A, B) and (B, A) give the same values.

One possible concern is that the match count assumes that all similarities are of equal importance, so a similarity of 0.8 is just as strong as 1.0. A more subtle method might use the cumulative sum of the scores in AxB and AxC, which is returned by the cumulative_score_all() method:

>>> results_AB.cumulative_score_all()
29.499029322763217
>>> results_AC.cumulative_score_all()
42.23963171858597

I pointed out earlier that my naive jaccard() implementation is biased towards comparing large sets. A similar bias applies when using counts or cumulative scores. With randomly selected elements, the count or cumulative scores will scale as the product of the set sizes:

>>> def get_count(n, m, repeat=100):
...   count = 0
...   for i in range(repeat):
...     subset1 = make_subset(arena, n)
...     subset2 = make_subset(arena, m)
...     results = search.threshold_tanimoto_search_arena(subset1, subset2)
...     count += results.count_all()
...   return count
... 
>>> get_count(1000, 1000)
29709
>>> get_count(1000, 1000)
29803
>>> get_count(2000, 1000)
58237
>>> get_count(2000, 1000)
61533
>>> get_count(3000, 1000)
91164
>>> get_count(3000, 1000)
91833
>>> get_count(3000, 3000)
265121
>>> get_count(3000, 3000)
Change the count_all() to cumulative_score_all() to get similar results using the sum of all the scores.

As I mentioned, an advantage of the count and cumulative score approach over the k=1 nearest-neighbor search is that they are symmetric, or more precisely, reflexive. Compare the following to the (3000, 1000) output of the previous code and you'll see they give about the same values:

>>> get_count(1000, 3000)
91185
>>> get_count(1000, 3000)
90586

The joy of floating point arithmetic

This is an aside. Here are two alternative ways to compute the cumulative sum of all of the scores, the first by computing the cumulative_sum() of each individual result, and the second by computing the sum of the sum of each list of scores using Python.

>>> sum(result.cumulative_score() for result in results_AB)
29.499029322763214
>>> sum(sum(result.get_scores()) for result in results_AB)
29.499029322763214
It really bugs me that the cumulative_score_all() and the other two approaches differ by -3.552713678800501e-15, which is the least significant bit of the two representations:
>>> sum(sum(result.get_scores()) for result in results_AB).hex()
'0x1.d7fc062bd0356p+4'
>>> results_AB.cumulative_score_all().hex()
'0x1.d7fc062bd0357p+4'

After a bit of research, I confirmed that it's not due to a chemfp bug, but rather because floating point addition using doubles is not associative. If I sum up all of the scores in order, I get the value from cumulative_score_all():

>>> scores = []
>>> for result in results_AB:
...   scores.extend(result.get_scores())
... 
>>> sum(scores)
29.499029322763217

If I group the scores so I first sum up each row, I get the other value:

>>> scores = []
>>> for result in results_AB:
...   scores.append(sum(result.get_scores()))
... 
>>> sum(scores)
29.499029322763214

Minor differences in the last significant digit are an all too common consequence of using floating point numbers, so while it annoyed me, it was the first reason I thought of.

Cumulative score-based similarity

In an earlier section I showed that the total count or the cumulative score between two randomly chosen sets scales as the product of the two set sizes. What about normalizing the cumulative score by that product?

There's probably a name for this similarity measure, but I don't know much about this field and don't know the name. I'll call it "sss" for "sum of scores similarity".

# I don't actually use this one
def sss(arena1, arena2, threshold=0.8):
  results = search.threshold_tanimoto_search_arena(
                 arena1, arena2, threshold=threshold)
  return results.cumulative_score_all() / (len(arena1) * len(arena2))
After some experimentation, I decided to scale the score by 300, because that results in two sets with randomly chosen terms having a similarity of 0.01.
# This is the one I use.
def sss(arena1, arena2, threshold=0.8):
  results = search.threshold_tanimoto_search_arena(
                 arena1, arena2, threshold=threshold)
  # The scaling factor of 300 was chosen so that random ChEMBL
  # subsets have a score of about 0.01. It doesn't matter for
  # ranking as it's used as an arbitrary scaling factor.
  similarity = 300 * results.cumulative_score_all() / (len(arena1) * len(arena2))
  return similarity
I made a variant of the similarity table I used before, this time with the sss() function as my similarity measure, instead of jaccard():
def compute_sss_table(arena):
  sizes = (10, 20, 50, 100, 200, 500, 1000, 2000)
  print("     " + "".join(str(size).center(12) for size in sizes))
  for size1 in sizes:
    output1 = str(size1).rjust(4) + "   "
    output2 = "       "
    for size2 in sizes:
      scores = []
      for i in range(100):
          score = sss(make_subset(arena, size1), make_subset(arena, size2))
          scores.append(score)
      output1 += ("%.3f +/-" % (np.mean(scores),)).ljust(12) 
      output2 += ("  %.5f" % (np.std(scores),)).ljust(12)
    print(output1) 
    print(output2) 
The output table is very close to symmetric under transposition, and with a nearly constant similarity score of 0.011:
          10          20          50         100         200         500         1000        2000    
  10   0.024 +/-   0.012 +/-   0.005 +/-   0.013 +/-   0.013 +/-   0.015 +/-   0.008 +/-   0.011 +/-   
         0.23916     0.12318     0.05309     0.06569     0.04212     0.03571     0.02265     0.01623   
  20   0.012 +/-   0.013 +/-   0.008 +/-   0.013 +/-   0.013 +/-   0.010 +/-   0.012 +/-   0.016 +/-   
         0.12176     0.08863     0.05592     0.03930     0.02858     0.01905     0.02362     0.02322   
  50   0.005 +/-   0.028 +/-   0.018 +/-   0.010 +/-   0.010 +/-   0.010 +/-   0.011 +/-   0.012 +/-   
         0.04816     0.09862     0.04181     0.02301     0.02007     0.01040     0.01078     0.01331   
 100   0.010 +/-   0.006 +/-   0.011 +/-   0.012 +/-   0.013 +/-   0.011 +/-   0.011 +/-   0.011 +/-   
         0.05136     0.02660     0.02348     0.01906     0.01725     0.00920     0.00927     0.00801   
 200   0.011 +/-   0.013 +/-   0.012 +/-   0.011 +/-   0.010 +/-   0.010 +/-   0.012 +/-   0.011 +/-   
         0.05310     0.03826     0.02014     0.01716     0.00940     0.00745     0.00716     0.00496   
 500   0.013 +/-   0.008 +/-   0.011 +/-   0.012 +/-   0.012 +/-   0.011 +/-   0.012 +/-   0.012 +/-   
         0.03374     0.01458     0.01471     0.01177     0.00873     0.00597     0.00509     0.00354   
1000   0.009 +/-   0.012 +/-   0.010 +/-   0.012 +/-   0.012 +/-   0.012 +/-   0.012 +/-   0.012 +/-   
         0.02291     0.02247     0.01068     0.00878     0.00639     0.00508     0.00347     0.00300   
2000   0.011 +/-   0.011 +/-   0.013 +/-   0.013 +/-   0.011 +/-   0.011 +/-   0.011 +/-   0.011 +/-   
         0.01723     0.01354     0.01446     0.01031     0.00552     0.00332     0.00275     0.00209   
This is much more like what I expect from a similarity function!

Effect of threshold on the sss score

I picked a threshold of 0.8 because that's a pretty common value and few will object to it. It might be that a different threshold is better. How does the sss score change as a function of the threshold?

One obvious implementation is to call sss() with a different threshold each time:

>>> arena = chemfp.load_fingerprints("chembl_21.rdkit2048.fpb")
>>> subset1 = make_subset(arena, 1000)
>>> subset2 = make_subset(arena, 1000)
>>> sss(subset1, subset2, threshold=0.9)
0.0014073436068272435
>>> sss(subset1, subset2, threshold=0.8)
0.006656850064948667
>>> sss(subset1, subset2, threshold=0.7)
0.038875826785241686
>>> sss(subset1, subset2, threshold=0.6)
0.2569320521578928
(As you can see, while I managed to normalize the score as a function of set size, it's still threshold dependent.)

The obvious implementation is perfectly reasonable. There's a more computationally efficient way. You'll notice that each calculation ends up redoing the calculation before it, because all compounds with at least 0.9 are also within at least 0.8 similarity, etc.

Chemfp has another way to get the same values. The cumulative_score_all() method takes extra arguments to select the min and max values to use, and if the summation should include or exclude the endpoints. I'll get a SearchResults instance and ask for its help:

>>> results = search.threshold_tanimoto_search_arena(
...                  subset1, subset2, threshold=0.6)
>>> 
>>> help(results.cumulative_score_all)
cumulative_score_all(self, min_score=None, max_score=None, interval='[]')
method of chemfp.search.SearchResults instance
    The sum of all scores in all rows which are between *min_score* and *max_score*
    
    Using the default parameters this returns the sum of all of
    the scores in all of the results. With a specified range this
    returns the sum of all of the scores in that range. The
    cumulative score is also known as the raw score.
    
    The default *min_score* of None is equivalent to -infinity.
    The default *max_score* of None is equivalent to +infinity.
    
    The *interval* parameter describes the interval end
    conditions. The default of "[]" uses a closed interval,
    where min_score <= score <= max_score. The interval "()"
    uses the open interval where min_score < score < max_score.
    The half-open/half-closed intervals "(]" and "[)" are
    also supported.
    
    :param min_score: the minimum score in the range.
    :type min_score: a float, or None for -infinity
    :param max_score: the maximum score in the range.
    :type max_score: a float, or None for +infinity
    :param interval: specify if the end points are open or closed.
    :type interval: one of "[]", "()", "(]", "[)"
    :returns: a floating point count
I'll use the min_score parameter. First, I'll find all fingerprints with a similarity threshold of 0.6, which is the lowest score I'm concerned about, and, to reduce complexity later on, I'll merge the "300*" and the product of the sizes into a single scaling factor:
>>> results = search.threshold_tanimoto_search_arena(
...                  subset1, subset2, threshold=0.6)
>>> scaling = 300/len(subset1)/len(subset2) 
Then I'll compute the scores for each threshold, and get the same results as earlier.
>>> scaling*results.cumulative_score_all(min_score=0.9)
0.0014073436068272435
>>> scaling*results.cumulative_score_all(min_score=0.8)
0.006656850064948667
>>> scaling*results.cumulative_score_all(min_score=0.7)
0.038875826785241686
>>> scaling*results.cumulative_score_all(min_score=0.6)
0.2569320521578928

For fun, I put it in a loop and computed more thresholds:

>>> scaling = 300/len(subset1)/len(subset2) 
>>> for threshold in (0.99, 0.95, 0.90, 0.85, 0.8, 0.75, 0.7, 0.65, 0.6):
...   score = scaling * results.cumulative_score_all(min_score=threshold)
...   print("%.2f %7.4f" % (threshold, score))
... 
0.99  0.0003
0.95  0.0006
0.90  0.0014
0.85  0.0030
0.80  0.0067
0.75  0.0166
0.70  0.0389
0.65  0.0921
0.60  0.2569
A quick test now shows that computing the similarities once and finding the cumulative sum for the 9 threshold levels is about 4x faster than doing 9 independent threshold calculations.



Weininger's Realization #

Dave Weininger passed away recently. He was very well known in the chemical informatics community because of his contribution to the field and his personality. Dave and Yosi Taitz founded Daylight Chemical Information Systems to turn some of these ideas into a business, back in the 1980s. It was very profitable. (As a bit of trivia, the "Day" in "Daylight" comes from "Dave And Yosi".)

Some of the key ideas that Dave and Daylight introduced are SMILES, SMARTS, and fingerprints (both the name and the hash-based approach). Together these made for a new way to handle chemical information search, and do so in significantly less memory. The key realization which I think lead to the business success of the company, is that the cost of memory was decreasing faster than the creation of chemical information. This trend, combined with the memory savings of SMILES and fingerprints, made it possible to store a corporate dataset in RAM, and do chemical searches about 10,000 times faster than the previous generation of hard-disk based tools, and do it before any competition could. I call this "Weininger's Realization". As a result, the Daylight Thor and Merlin databases, along with the chemistry toolkits, became part of the core infrastructure of many pharmaceutical companies.

I don't know if there was a specific "a-ha" moment when that realization occurred. It certainly wasn't what drove Dave to work on those ideas in the first place. He was a revolutionary, a Prometheus who wanted to take chemical information from what he derisively called 'the high priests' and bring it to the people.

An interest of mine in the last few years is to understand more about the history of chemical information. The best way I know to describe the impact of Dave and Daylight is to take some of the concepts back to the roots.

You may also be interested in reading Anthony Nicholls description of some of the ways that Dave influenced him, and Derek Lowe's appreciation of SMILES.

Errors and Omissions

Before I get there, I want to emphasize that the success of Daylight cannot be attributed to just Dave, or Dave and Yosi. Dave's brother Art and his father Joseph were coauthors on the SMILES canonicalization paper. The company hired people to help with the development, both as employees and consultants. I don't know the details of who did what, so I will say "Dave and Daylight" and hopefully reduce the all too easy tendency to give all the credit on the most visible and charismatic person.

I'm unfortunately going to omit many parts of the Daylight technologies, like SMIRKS, where I don't know enough about the topic or its effect on cheminformatics. I'll also omit other important but invisible aspects of Daylight, like documentation or the work Craig James did to make the database servers more robust to system failures. Unfortunately, it's the jockeys and horses which attract the limelight, not those who muck the stables or shoe the horses.

Also, I wrote this essay mostly from what I have in my head and from presentations I've given, which means I've almost certainly made mistakes that could be fixed by going to my notes and primary sources. Over time I hope to spot and fix those mistakes in this essay. Please let me know of anything you want me to change or improve.

Dyson and Wiswesser notations

SMILES is a "line notation", that is, a molecular representation which can be described as a line of text. Many people reading this may have only a vague idea of the history of line notations. Without that history, it's hard to understand what helped make SMILES successful.

The original line notations were developed in the 1800s. By the late 1800s chemists began to systematize the language into what is now called the IUPAC nomenclature. For example, caffeine is "1,3,7-trimethylpurine-2,6-dione". The basics of this system are taught in high school chemistry class. It takes years of specialized training to learn how to generate the correct name for complex structures.

Chemical nomenclature helps chemists index the world's information about chemical structures. In short, if you can assign a unique name to a chemical structure (a "canonical" name), then it you can use standard library science techniques to find information about the structure.

The IUPAC nomenclature was developed when books and index cards were the best way to organize data. Punched card machines brought a new way of thinking about line notations. In 1946, G. Malcolm Dyson proposed a new line notation meant for punched cards. The Dyson notion was developed as a way to mechanize the process of organizing and publishing a chemical structure index. It became a formal IUPAC notation in 1960, but was already on its last legs and dead within a few years. While it might have been useful for mechanical punched card machines, it wasn't easily repurposed for the computer needs of the 1960s. For one, it depended on superscripts and subscripts, and used characters which didn't exist on the IBM punched cards.

William J. Wiswesser in 1949 proposed the Wiswesser Line Notation, universally called WLN, which could be represented in EBCIDIC and (later) ASCII in a single line of text. More importantly, unlike the Dyson notation, which follows the IUPAC nomenclature tradition of starting with the longest carbon chain, WLN focuses on functional groups, and encodes many functional groups directly as symbols.

Chemists tend to be more interested in functional groups, and want to search based on those groups. For many types of searches, WLN acts as its own screen, that is, it's possible to do some types of substructure search directly on the symbols of the WLN, without having to convert the name into a molecular structure for a full substructure search. To search for structures containing a single sulfur, look for WLNs with a single occurrence of S, but not VS or US or SU. The chemical information scientists of the 1960s and 1970s developed several hundred such clever pattern searches to make effective use of the relatively limited hardware of that era.

WLNs started to disappear in the early 1980s, before SMILES came on the scene. Wendy Warr summarized the advantages and disadvantages of WLNs in 1982. She wrote "The principle disadvantage of WLN is that it is not user friendly. This can only be overcome by programs which will derive a canonical WLN from something else (but no one has yet produced a cost-effective program to do this for over 90% of compounds), by writing programs to generate canonical connection tables from noncanonical WLNs, or by accepting the intervention of a skilled "middle man"."

Dyson/IUPAC and WLNs were just two of dozens, if not hundreds, of proposed line notations. Nearly every proposal suffered from a fatal flaw - they could not easily be automated on a computer. Most required postgraduate-level knowledge of chemistry, and were error-prone. The more rigorous proposals evaluated the number of mistakes made during data entry.

One of the few exceptions are the "parentheses free" notations from a pair of papers from 1964, one by Hiz and the other by Eisman, in the same issue of the Journal of Chemical Documentation. In modern eyes, they very much like SMILES but represented in a postfix notation. Indeed, the Eisman paper gives a very SMILES-like notation for a tree structure, "H(N(C(HC(CIC(N(HH)C(HN(IH)))I)H)H))" and a less SMILES-like notation for a cyclic structure, before describing how to convert them into a postfix form.

I consider the parentheses-free nomenclatures a precursor to SMILES, but they were not influential to the larger field of chemical information. I find this a bit odd, and part of my research has been to try and figure out why. It's not like it had no influence. A version of this notation was in the Chemical Information and Data System (CIDS) project in the 1960s and early 1970s. In 1965, "CIDS was the first system to demonstrate online [that is, not batch processing] searching of chemical structures", and CIDS wasn't completely unknown in the chemical information field.

But most of the field in the 1970s went for WLN for a line notation, or a canonical connection table.

SMILES

Dave did not know about the parentheses free line notations when he started work on SMILES, but he drew from similar roots in linguistics. Dave was influenced by Chomsky's writings on linguistics. Hiz, mentioned earlier, was at the Department of Linguistics at the University of Pennsylvania, and that's also where Eugene Garfield did his PhD work on the linguistics of chemical nomenclature.

Dave's interest in chemical representations started when he was a kid. His father, Joseph Weininger, was a chemist at G.E., with several patents to his hame. He would draw pictures of chemical compounds for Dave, and Dave would, at a non-chemical level, describe how they were put together. These seeds grew into what became SMILES.

SMILES as we know it started when Dave was working for the EPA in Duluth. They needed to develop a database of environmental compounds, to be entered by untrained staff. (For the full story of this, read Ed Regis's book "The Info Mesa: Science, Business, and New Age Alchemy on the Santa Fe Plateau.") As I recall, SMILES was going to the the internal language, with a GUI for data entry, but it turns out that SMILES was easy enough for untrained data entry people to write it directly.

And it's simple. I've taught the basics of SMILES to non-chemist programmers in a matter of minutes, while WLN, Dyson, and InChI, as example of other line notations, are much more difficult to generate either by hand or by machine. Granted, those three notations have canonicalization rules built into them, which is part of the difficulty. Still, I asked Dave why something like SMILES didn't appear earlier, given that the underlying concepts existed in the literature by then.

He said he believes it's because the generation of people before him didn't grow up with the software development background. I think he's right. When I go to a non-chemist programer, I say "it's a spanning tree with special rules to connect the cycles", and they understand. But that vocabulary was still new in the 1960s, and very specialized.

There's also some conservatism in how people work. Dyson defended the Dyson/IUPAC notation, saying that it was better than WLN because it was based on the longest carbon chain principle that chemists were already familiar with, even though the underlying reasons for that choice were becoming less important because of computer search. People know what works, and it's hard to change that mindset when new techniques become available.

Why the name SMILES? Dave Cosgrove told me it was because when Dave woud tell people he had a new line notations, most would "groan, or worse. When he demonstrated it to them, that would rapidly turn to a broad smile as they realised how simple and powerful it is."

Exchange vs. Canonical SMILES

I not infrequently come across people who say that SMILES is a proprietary format. I disagree. I think the reason for the disagreement is that two different concepts go under the name "SMILES". SMILES is an exchange language for chemistry, and it's an identifier for chemical database search. Only the second is proprietary.

Dave wanted SMILES as a way for chemists from around the world and through time to communicate. SMILES describes a certain molecular valence model view of the chemistry. This does not need to be canonical, because you can always do that yourself once you have the information. I can specify hydrogen cyanide as "C#N", "N#C", or "[H][C]#[N]" and you will be able to know what I am talking about, without needing to consult some large IUPAC standard.

He wanted people to use SMILES that way, without restriction. The first SMILES paper describes the grammar. Later work at Daylight in the 1990s extended SMILES to handle isotopes and stereochemistry. (This was originally called "isomeric SMILES", but it's the SMILES that people think of when they want a SMILES.) Daylight published the updated grammar on their website. It was later included as part of Gasteiger's "Handbook of Chemoinformatics: From Data to Knowledge in 4 Volumes". Dave also helped people at other companies develop their own SMILES parsers.

To say that SMILES as an exchange format is proprietary is opposite to what Dave wanted and what Daylight did.

What is proprietary is the canonicalization algorithm. The second SMILES paper describes the CANGEN algorithm, although it is incomplete and doesn't actually work. Nor does it handle stereochemistry, which was added years later. Even internally at Daylight, it took many years to work out all of the bugs in the implementation.

There's a good financial reason to keep the algorithm proprietary. People were willing to pay a lot of money for a good, fast chemical database, and the canonicalization algorithm was a key part of Daylight's Thor and Merlin database servers. In business speak, this is part of Daylight's "secret sauce".

On the other hand, there's little reason for why that algorithm should be published. Abstractly speaking, it would mean that different tools would generate the same canonical SMILES, so a federated data search would reduce to a text search, rather than require a re-canonicalization. This is one of the goals of the InChI project, but they discovered that Google didn't index the long InChI strings in a chemically useful way. They created the InChI key as a solution. SMILES has the same problem and would need a similar solution.

Noel O'Boyle published a paper pointed out that the InChI canonicalization assignment could be used to assign the atom ordering for a SMILES string. This would give a universal SMILES that anyone could implement. There's been very little uptake of that idea, which gives a feel of how little demand there is. [Edited 20 March 2017: Noel points out that CDK uses Universal SMILES for canonical isomeric SMILES generation, so there is some uptake.]

Sometimes people also include about the governance model to decide if something is proprietary or not, or point to the copyright restrictions on the specification. I don't agree with these interpretations, and would gladly talk about them at a conference meeting if you're interested.

Line notations and connection tables

There are decades of debate on the advantages of line notations over connection tables, or vice versa. In short, connection tables are easy to understand and parse into an internal molecule data structure, while line notations are usually more compact and can be printed on a single line. And in either case, at some point you need to turn the text representation into a data structure and treat it as a chemical compound rather than a string.

Line notations are a sort of intellectual challenge. This alone seems to justify some of the many papers proposing a new line notation. By comparison, Open Babel alone supports over 160 connection table formats, and there are untold more in-house or internal formats. Very few of these formats have ever been published, except perhaps in an appendix in a manual.

Programmers like simple formats because they are easy to parse, often easy to parse quickly, and easy to maintain. Going back the Warr quote earlier, it's hard to parse WLN efficiently.

On the other hand, line notations fit better with text-oriented systems. Back in the 1960s and 1970s, ISI (the Institute for Scientific Information) indexed a large subset of the chemistry literature and distributed the WLNs as paper publications, in a permuted table to help chemists search the publication by hand. ISI was a big proponent of WLN. And of course it was easy to put a WLN on a punched card and search it mechanically, without an expensive computer,

Even now, a lot of people use Excel or Spotfire to display their tabular data. It's very convenient to store the SMILES as a "word" in a text cell.

Line notations also tend to be smaller than connection tables. As an example, the connection table lines from the PubChem SD files (excluding the tag data) average about 4K per record. The PUBCHEM_OPENEYE_ISO_SMILES tag values average about 60 bytes in length.

Don't take the factor of 70 as being all that meaningful. The molfile format is not particularly compact, PubChem includes a bunch of "0" entries which could be omitted, and the molfile stores the coordinates, which the SMILES does not. The CAS search system in the late 1980s used about 256 bytes for each compact connection table, which is still 4x larger than the equivalent SMILES.

Dave is right. SMILES, unlike most earlier line notations, really is built with computer parsing in mind. Its context-free grammar is easy to parse using simple stack, thought still not as easy as a connection table. It doesn't require much in the way of lookup tables or state information. There's also a pretty natural mapping from the SMILES to the corresponding topology.

What happens if you had a really fast SMILES parser? As a thought experiment which doesn't reflect real hardware, suppose you could convert 60 bytes of SMILES string to a molecule data structure faster than you could read the additional 400 bytes of connection table data. (Let's say the 10 GHz CPU is connected to the data through a 2400 baud modem.) Then clearly it's best to use a SMILES, even if it takes longer to process.

A goal for the Daylight toolkit was to make SMILES parsing so fast that there was no reason to store structures in a special internal binary format or data structure. Instead, when it's needed, parse the SMILES into a molecule object, use the molecule, and throw it away.

On the topic of muck and horseshoes, as I recall Daylight hired an outside company at one point to go through the code and optimize it for performance.

SMARTS

SMARTS is a natural recasting and extension of the SMILES grammar to define a related grammar for substructure search.

I started in chemical information in the late 1990s, with the Daylight toolkit and a background which included university courses on computer grammars like regular expressions. The analogy of SMARTS to SMILES is like regular expressions to strings seems obvious, and I modeled my PyDaylight API on the equivalent Python regular expression API.

Only years later did I start to get interested in the history of chemical information, though I've only gotten up to the early 1970s so there's a big gap that I'm missing. Clearly there were molecular query representations before SMARTS. What I haven't found is a query line notation, much less one implemented in multiple systems. This is a topic I need to research more.

The term "fingerprint"

Fingerprints are a core part of modern cheminformatics. I was therefore surprised to discover that Daylight introduced the term "fingerprint" to the field, around 1990 or so.

The concept existed before then. Adamson and Bush did some of the initial work in using fingerprint similarity as a proxy for molecular similarity in 1973, and Willett and Winterman's 1986 papers [1, 2] (the latter also with Bawden) reevaluated the earlier work and informed the world of the effectiveness of the Tanimoto. (We call it "Tanimoto" instead of "Jaccard" precisely because of those Sheffield papers.)

But up until the early 1990s, published papers referred to fingerprints as the "screening set" or "bit screens", which describes the source of the fingerprint data, and didn't reifying them into an independent concept. The very first papers which used "fingerprint" were by Yvonne Martin, an early Daylight user at Abbott, and John Barnard, who uses "fingerprint", in quotes, in reference specifically to Daylight technology.

I spent a while trying to figure out the etymology of the term. I asked Dave about it, but it isn't the sort of topic which interests him, and he didn't remember. "Fingerprint" already existed in chemistry, for IR spectra, and the methods for matching spectra are somewhat similar to those of cheminformatics fingerprint similarity, but not enough for me to be happy with the connection. Early in his career Dave wrote software for a mass spectrometer, so I'm also not rejecting the possibility.

The term "fingerprint" was also used in cryptographic hash functions, like "Fingerprinting by Random Polynomial" by Rabin (1981). However, these fingerprints can only be used to test if two fingerprints are identical. They are specifically designed to make it hard to use the fingerprints to test for similarity of the source data.

I've also found many papers talking about image fingerprints or audio fingerprints which can be used for both identity and similarity testing; so-called "perceptual hashes". However, their use of "fingerprint" seems to have started a good decade after Daylight popularized it in cheminformatics.

Hash fingerprints

Daylight needed a new name for fingerprints because they used a new approach to screening.

Fingerprint-like molecular descriptions go back to at least the Frear code of the 1940s. Nearly all of the work in the intervening 45 years was focused on finding fragments, or fragment patterns, which would improve substructure screens.

Screen selection was driven almost entirely by economics. Screens are cheap, with data storage as the primary cost. Atom-by-atom matching, on the other hand, had a very expensive CPU cost. The more effective the screens, the better the screenout, the lower the overall cost for an exact substructure search.

The best screen would have around 50% selection/rejection on each bit, with no correlation between the bits. If that could exist, then an effective screen for 1 million structures would need only 20 bits. This doesn't exist, because few fragments meet that criteria. The Sheffield group in the early 1970s (who often quoted Mooers as the source of the observation) looked instead at more generic fragment descriptions, rather than specific patterns. This approach was further refined by BASIC in Basel and then at CAS to become the CAS Online screens. This is likely the pinnacle of the 1970s screen development.

Even then, it had about 4000 patterns assigned to 2048 bits. (Multiple rare fragments can be assigned to the same bit with little reduction in selectivity.)

A problem with a fragment dictionary is that it can't take advantage of unknown fragments. Suppose your fragment dictionary is optimized for pharmaceutical compounds, then someone does a search for plutonium. If there isn't an existing fragment definition like "transuranic element" or "unusual atom", then the screen will not be able to reject any structures. Instead, it will slowly go through the entire data set only to return no matches.

This specific problem is well known, and the reason for the "OTHER" bit of the MACCS keys. However, other types of queries may still have an excessive number of false positives during screening.

Daylight's new approach was the enumeration-based hash fingerprint. Enumerate all subgraphs of a certain size and type (traditionally all linear paths with up to 7 atoms), choose a canonical order, and use the atom and bond types in order t generate a hash value. Use this value to seed a pseudo random number generator, then generate a few values to set bits in the fingerprint; the specific number depends on the size of the subgraph. (The details of the hash function and the number of bits to set were also part of Daylight's "secret sauce.")

Information theory was not new in chemical information. Calvin Mooers developed superimposed coding back in the 1940s in part to improve the information density of chemical information on punched cards (and later complained about how computer scientists rediscovered it as hash tables). The Sheffield group also used information theory to guide their understanding of the screen selection problem. Feldman and Hodes in the 1970s developed screens by the full enumeration of common subgraphs of the target set and a variant of Mooers' superimposed coding.

But Daylight was able to combine information theory and computer science theory (i.e., hash tables) to develop a fingerprint generation technique which was completely new. And I do mean completely new.

Remember how I mentioned there are SMILES-like line notations in the literature, even if people never really used them? I've looked hard, and only with a large stretch of optimism can I find anything like the Daylight fingerprints before Daylight, and mostly as handwaving proposal for what might be possible. Nowadays, almost every fingerprint is based on a variation of the hash approach, rather than a fragment dictionary.

In addition, because of the higher information density, Daylight fingerprints were effective as both a substructure screen and a similarity fingerprint using only 1024 bits, instead of the 2048 bits of the CAS Online screen. This will be important in the next section.

Chemical data vs. compute power and storage size

CAS had 4 million chemical records in 1968. The only cost-effective way to store that data was on tape. Companies could and did order data tapes from CAS for use on their own corporate computers.

Software is designed for the hardware, so the early systems were built first for tape drives and then, as they became affordable, for the random-access capabilities of hard disks. A substructure search would first check against the screens to reject obvious mismatches, then for each of the remaining candidates, read the corresponding record off disk and do the full atom-by-atom match.

Apparently Derek Price came up with the "law of exponential increase" in "Science Since Babylon" (1961), which describes how science information has exponential growth. I've only heard about that second hand. Chemical data is no exception, and its exponential growth was noted, I believe, in the 1950s by Perry.

In their 1971 text book, Lynch, et al. observed the doubling period was about 12 years. I've recalculated that number over a longer baseline, and it still holds. CAS has 4 million structures in 1968 and 100 million structure in 2015, which is a doubling every 10-11 years.

On the other hand, computers have gotten more powerful at a faster rate. For decades the effective computing power doubled every 2 years, and the amount of RAM and data storage for constant dollars has doubled even faster than that.

In retrospect it's clear that at some point it would be possible to store all of the world's chemistry data, or at least a corporate database, in memory.

Weininger's Realization

Disks are slow. Remember how the atom-by-atom search needed to pull a record off the disk? That means the computer needs to move the disk arm to the right spot and wait for the data to come by, while the CPU simply waits. If the data were in RAM then it would be 10,000x faster to fetch a randomly chosen record.

Putting it all in RAM sounds like the right solution, but in in the early 1980s memory was something like $2,000 per MB while hard disk space was about $300 per MB. One million compounds with 256 bytes per connection table and 256 bytes per screen requires almost 500MB of space. No wonder people kept the data on disk.

By 1990, RAM was about $80/MB while hard disk storage was $4/MB, while the amount of chemistry data had only doubled.

Dave, or at least someone at Daylight, must have realized that the two different exponential growth rates make for a game changer, and that the Daylight approach would give them a head start over the established vendors. This is explicit in the Daylight documentation for Merlin:

The basic idea behind Merlin is that data in a computer's main memory can be manipulated roughly five orders of magnitude faster than data on its disk. Throughout the history of computers, there has been a price-capacity-speed tradeoff for data storage: Large-capacity storage (tapes, drums, disks, CD-ROMS) is affordable but slow; high-speed storage ("core", RAM) is expensive but fast. Until recently, high-speed memory was so costly that even a modest amount of chemical information had to be stored on tapes or disks.

But technology has a way of overwhelming problems like this. The amount of chemical information is growing at an alarming rate, but the size of computer memories is growing even faster: at an exponential rate. In the mid-1980's it became possible for a moderately large minicomputer to fit a chemical database of several tens of thousands of structures into its memory. By the early 1990's, a desktop "workstation" could be purchased that could hold all of the known chemicals in the world (ca. 15 million structures) in its memory, along with a bit of information about each.

On the surface, in-memory operations seem like a straightforward good deal: A computer's memory is typically 105 times faster than its disk, so everything you could do on disk is 100000 times faster when you do it in memory. But these simple numbers, while impressive, don't capture the real differences between disk- and memory-based searches:

Scaling down

I pointed out earlier that SMILES and fingerprints take up less space. I estimate it was 1/3 the space of what CAS needed, which is the only comparison I've been able to figure out. That let Daylight scale up to larger data set for a given price, but also scale down to smaller hardware.

Let's say you had 250,000 structures in the early 1990s. With the Daylight system you would need just under 128 MB of RAM, which meant you could buy a Sun 3, which maxed out at 128 MB, instead of a more expensive computer.

It still requires a lot of RAM, and that's where Yosi comes in. His background was in hardware sales, and he know how to get a computer with a lot of RAM in it. Once the system was ready, Dave and his brother Art put it in the back of a van and went around the country to potential customers to give a demo, often to much astonishment that it could be so fast.

I think the price of RAM was the most important hardware factor to the success of Daylight, but it's not the only one. When I presented some of these ideas at Novartis in 2015, Bernhard Rohde correctly pointed out that decreasing price of hardware also meant that computers were no longer big purchase items bought and managed by IT, but something that even individual researchers could buy. That's another aspect of scaling down.

While Daylight did sell to corporate IT, their heart was in providing tools and especially toolkits to the programmer-chemists who would further develop solutions for their company.

Success and competitors

By about 1990, Daylight was a market success. I have no real idea how much profit the company made, but it was enough that Dave bought his own planes, including a fighter jet. When I was at the Daylight Summer School in 1998, the students over dinner came up with a minimum of $15 million in income, and maximum of $2 million in expenses.

It was also a scientific success, measured by the number of people talking about SMILES and fingerprints in the literature.

I am not a market analyst so I can't give that context. I'm more of a scholar. I've been reading through the old issues of JCICS (now titled JCIM) trying to identify the breakthrough transition for Daylight. There is no bright line, but there is tantalizing between-the-lines.

In 1992 or so (I'll have to track it down), there's a review of a database vendors' product. The reviewer mentions that the vendor plans to have an in-memory database the next year. I can't help but interpret it as a competitor responding to the the new Daylight system, and having to deal with customers who now understood Weininger's Realization.

Dave is a revolutionary

The decreasing price of RAM and hardware may help explain the Daylight's market success, but Dave wasn't driven by trying to be the next big company. You can see that in how the company acted. Before the Oracle cartridge, they catered more towards the programmer-chemist. They sold VMS and then Unix database servers and toolkits, with somewhat primitive database clients written using the XView widget toolkit for X. I remember Dave once saying that the clients were meant as examples of what users could do, rather than as complete applications. A different sort of company would have developed Windows clients and servers, more tools for non-programmer chemists, and focused more on selling enterprise solutions to IT and middle managers.

A different sort of company would have tried to be the next MDL. Dave didn't think they were having fun at MDL, so why would he want to be like them?

Dave was driven by the idea of bringing chemical information away from the "high priests" who held the secret knowledge of how to make things work. Look at SMILES - Dyson and WLN required extensive training, while SMILES could be taught to non-chemists in an hour or so. Look at fingerprints - the CAS Online screens were the result of years of research in picking out just the right fragments, based on close analysis of the types of queries people do, while the hash fingerprints can be implemented in a day. Look even at the Daylight depictions, which were well known as being ugly. But Dave like to point out that the code, at least originally, needed only 4K. That's the sort of code a non-priest could understand, and the sort of justification a revolutionary could appreciate.

Dave is a successful revolutionary, which is rare. SMILES, SMARTS and fingerprints are part of the way we think about modern cheminformatics. Innumerable tools implement them, or variations of them.

High priests of chemical information

Revolutionary zeal is powerful. I remember hearing Dave's "high priests" talk back in 1998 and the feeling empowered, that yes, even as a new person in the field, cheminformatics is something I can take on on my own.

As I learn more about the history of the field, I've also learned that Dave's view is not that uncommon. In the post-war era the new field of information retrieval wanted to challenge the high priests of library science. (Unfortunately I can't find that reference now.)

Michael Lynch would have been the high priest of chemical information if there ever was one. Yet at ICCS 1988 he comments "I can recollect very little sense, in 1973, that this revolution was imminent. Georges Anderla .. noted that the future impact of very large scale integration (VLSI) was evident only to a very few at that time, so that he quite properly based his projections on the characteristics of the mainframe and minicomputer types then extant. As a result, he noted, he quite failed to see, first, that the PC would result in expertise becoming vastly more widely disseminated, with power passing out of the hands of the small priesthood of computer experts, thus tapping a huge reservoir of innovative thinking, and, second, that the workstation, rather than the dumb terminal, would become standard."

A few years I talked with Egon Willighagen. He is one of the CDK developers and an advocate of free and open source software for chemical information. He also used the metaphor of talking information from the high priests to the people, but in his case he meant the previous generation of closed commercial tools, like the Daylight toolkit.

Indeed, one way to think of it is that Dave the firebrand of Daylight became it high priest of Daylight, and only the priests of Daylight control the secret knowledge of fingerprint generation and canonicalization.

That's why I no longer like the metaphor. Lynch and the Sheffield group published many books and papers, including multiple textbooks on how to work with chemical information. Dave and Daylight did a lot of work to disseminate the Daylight way of thinking about cheminformatics. These are not high priests hoarding occult knowledge, but humans trying to do the right thing in an imperfect world.

There's also danger in the metaphor. Firebrand revolutionaries doesn't tend to know history. Perhaps some of the temple should be saved? At the very least there might be bad feelings if you declare your ideas revolutionary only to find out that not only they are not new, and you are talking to the previous revolutionary who proposed them.

John Barnard told me a story of Dave and Lynch meeting at ICCS in 1988, I believe. Dave explained how his fingerprint algorithm worked. Lynch commented something like "so it's like Calvin Mooers' superimposed coding"? Lynch knew his history, and he was correct - fingerprints and superimposed coding are related, though not the same. Dave did not know the history or how to answer the question.

My view has its own danger. With 75 years of chemical information history, one might feel a paralysis of not doing something out of worry that it's been done before and you just don't know about it.

Leaving Daylight

In the early 2000s Dave became less interested in chemical information. He had grown increasingly disenchanted with the ethics of how pharmaceutical companies work and do research. Those who swear allegience to money would have no problems just making a sale and profits, but he was more interested in ideas and people. He had plenty of money.

He was concerned about the ethics of how people used his work, though I don't know how big a role that was overall. Early on, when Daylight was still located at Pomona, they got purchase request from the US military. He didn't want the Daylight tools to be used to make chemical weapons, and put off fulfilling the sale. The military group that wanted the software contacted him and pointed out that they were actually going to use the tools to develop defenses against chemical weapons, which helped him change his mind.

I think the Daylight Oracle cartridge marked the big transition for Daylight. The 1990s was the end of custom domain-specific databases like Thor and Merlin and the rise of object-relational databases with domain-specific extensions. Norah MacCuish (then Shemetulskis) helped show how the Daylight functionality could work as an Informix DataBlade. Oracle then developed their data cartridge, and most of the industry, including the corporate IT that used the Daylight servers, followed suit.

Most of Daylight's income came from the databases, not the toolkit. Companies would rather use widely-used technology because it's easier to find people with the skills to use it, and there can be better integration with other tools. If Daylight didn't switch, then companies would turn to competitive products which, while perhaps not as elegant, were a better fit for IT needs. Daylight did switch, and the Daylight user group meetings (known as "MUG" because it started off as the Medchem User Group) started being filled by DBAs and IT support, not the geek programmer-chemist that were interested in linguistics and information theory and the other topics that excited Dave.

It didn't help that the Daylight tools were somewhat insular and static. The toolkit didn't have an officially supported way to import SD file, though there was a user-contributed package for that, which Daylight staff did maintain. Ragu Bharadwaj developed Daylight's JavaGrins chemical sketcher, which was a first step towards are more GUI-centered Daylight system, but also the only step. The RDKit, as an example of a modern chemical informatics toolkit, includes algorithms Murko scaffolds, matched molecular pairs, and maximum common substructure, and continues to get new ones. But Daylight didn't go that route either.

What I think it comes down to is that Dave was really interested in databases and compound collections. Daylight was also a reseller of commercial chemical databases, and Dave enjoyed looking through the different sets. He told me there was a different feel to the chemistry done in different countries, and he could spot that by looking at the data. He was less interested in the other parts of cheminformatics, and as the importance of old-school "chemical information" diminished in the newly renamed field of "chem(o)informatics", so too did his interest, as did the interest from Daylight customers.

Post-Daylight

Dave left chemical information and switched to other topics. He tried to make theobromine-free chocolate, for reasons I don't really understand, though I see now that many people buy carob as a chocolate alternative because it's theobromine free and thus stimulant free. He was also interested in binaural recording and hearing in general. He bought the house next door to turn it into a binaural recoding studio.

He became very big into solar power, as a way to help improve the world. He bought 12 power panels, which he called The Twelve Muses, from a German company and installed them for his houses. These were sun trackers, to maximize the power. Now, Santa Fe is at 2,300 m/7,000 ft. elevation, in a dry, near-desert environment. He managed to overload the transformers because they produced a lot more power than the German-made manufacturer expected. Once that was fixed, both houses were solar powered, plus he had several electric cars and motorcycles, and could feed power back into the grid for profit. He provided a recharging station for the homebrew electric car community who wanted to drive from, say, Albuquerque to Los Alamos. (This was years before Teslas were on the market). Because he had his own DC power source, he was also able to provide a balanced power system to his recording studio and minimize the power hum noise.

He tried to persuade the state of New Mexico to invest more in solar power. It makes sense, and he showed that it was possible. But while he was still a revolutionary, he was not a politician, and wasn't able to make the change he wanted.

When I last saw him in late 2015, he was making a cookbook of New Mexico desserts.

Dave will always have a place in my heart.

Andrew Dalke
Trollhättan, Sweden
2 December 2016

Changes



Fragment by copy and trim #

This is part of a series of essays on how to fragment a molecular graph using RDKit. These are meant to describe the low-level steps that go into fragmentation, and the ways to think about testing. To fragment a molecule in RDKit, use FragmentOnBonds(). The essays in this series are:

Why another fragmentation method?

How do you tell if an algorithm is correct? Sometimes you can inspect the code. More often you have test cases where you know the correct answer. For complex algorithms, especially when using real-world data, testing is even harder. It's not always clear where edge cases are, and often the real-world is often more complicated than you think.

Another validation solution is to write two different algorithms which accomplish the same thing, and compare them with each other. I did that in my essay about different ways to evaluate the parity of a permutation order. That cross-validation testing was good enough, because it's easy to compute all possible input orders.

In the previous essay in this series, I cross-validated that fragment_chiral() and fragment_on_bonds() give the same results. They did. That isn't surprising because they implement essentially the same algorithm. Not only that, but I looked at the FragmentOnBonds() implementation when I found out my first fragment_chiral() didn't give the same answers. (I didn't know I needed to ClearComputedProps() after I edit the structure.)

Cross-validation works better when the two algorithm are less similar. The way you think about a problem and implement a solution are connected. Two similar solutions may be the result of thinking about things the same way, and come with the same blind spots. (This could also be a design goal, if you want the new implementation to be "bug compatible" with the old.)

I've made enough hints that you likely know what I'm leading up to. RDKit's FragmentOnBonds() code currently (in mid-2016) converts directional bonds (the E-Z stereochemistry around double bonds into non-directional single bonds.

>>> from rdkit import Chem
>>> mol = Chem.MolFromSmiles("F/C=C/F")
>>> fragmented_mol = Chem.FragmentOnBonds(mol, [0], dummyLabels=[(0,0)])
>>> Chem.MolToSmiles(fragmented_mol, isomericSmiles=True)
'[*]C=CF.[*]F'
when I expected the last line to be
'[*]/C=C/F.[*]F'

This may or may not be what you want. Just in case, I've submitted a patch to discuss changing it in RDKit, so your experience in the future might be different than mine in the past.

I bring it up now to show how cross-validation of similar algorithms isn't always enough to figure out if the algorithms do as you expect.

Validate though re-assembly

As a reminder, in an earlier essay I did a much stronger validation test. I took the fragments, reassembled them via SMILES syntax manipulation, and compared the canonicalized result to the canonicalized input structure. These should always match, to within limits of the underlying chemistry support.

They didn't, because my original code didn't support directional bonds. Instead, in that essay I pre-processed the input SMILES string to replace the "/" and "\" characters with a "-". That gives the equivalent chemistry without directional bonds.

I could use that validation technique again, but I want to explore more of what it means to cross-validate by using a different fragmentation implementations.

Fragment by 'copy and trim'

Dave Cosgrove, on the RDKit mailing list, described the solution he uses to cut a bond and preserve chirality in toolkits which don't provide a high-level equivalent to FragmentOnBonds(). This method only works on non-ring bonds, which isn't a problem for me as I want to fragment R-groups. (As I discovered while doing the final testing, my implementation of the method also assumes there is only one molecule in the structure.)

To make fragment from a molecule, trim away all the atoms which aren't part of the fragment, except for the atom connected to the fragment. Convert that atom into the wildcard atom "*" by setting its atomic number, charge, and number of hydrogens to 0, and setting the chirality and aromaticity flags correctly. The image below shows the steps applied to cutting the ester off of aspirin:

steps to trim a fragment of aspirin

This method never touches the bonds connected to a chiral atom of a fragment, so chirality or other bond properties (like bond direction!) aren't accidently removed by breaking the bond and making a new one in its place.

A downside is that I need to copy and trim twice in order to get both fragments after cutting a chain bond. I can predict now the performance won't be as fast as FragmentOnBonds().

Let's try it out.

Trim using the interactive shell

I'll use aspirin as my reference structure and cut the bond to the ester (the R-OCOCH3), which is the bond between atoms 2 and 3.

aspirin with atoms labeled by index
I want the result to be the same as using FragmentOnBonds() to cut that bond:
>>> from rdkit import Chem
>>> aspirin = Chem.MolFromSmiles("O=C(Oc1ccccc1C(=O)O)C")
>>> bond_idx = aspirin.GetBondBetweenAtoms(2, 3).GetIdx()
>>> bond_idx
2
>>> fragmented_mol = Chem.FragmentOnBonds(aspirin, [bond_idx], dummyLabels=[(0,0)])
>>> Chem.MolToSmiles(fragmented_mol, isomericSmiles=True)
'[*]OC(C)=O.[*]c1ccccc1C(=O)O'

I'll need a way to find all of the atoms which are on the ester side of the bond. I don't think there's a toolkit function to get all the atoms on one side of a bond, so I'll write one myself.

Atom 2 is the connection point on the ester to the rest of the aspirin structure. I'll start with that atom, show that it's an oxygen, and get information about its bonds:

>>> atom = aspirin.GetAtomWithIdx(2)
>>> atom.GetSymbol()
'O'
>>> [bond.GetBondType() for bond in atom.GetBonds()]
[rdkit.Chem.rdchem.BondType.SINGLE, rdkit.Chem.rdchem.BondType.SINGLE]
I'll use the bond object to get to the atom on the other side of the bond from atom 2, and report more detailed information about the atom at the other end of each bond:
>>> for bond in atom.GetBonds():
...   other_atom = bond.GetOtherAtom(atom)
...   print(other_atom.GetIdx(), bond.GetBondType(), other_atom.GetSymbol(), other_atom.GetIsAromatic())
... 
1 SINGLE C False
3 SINGLE C True
This says that atom 1 is an aliphatic carbon, and atom 3 is an aromatic carbon. This matches the structure diagram I used earlier.

I know that atom 3 is the other end of the bond which was cut, so I can stop there. What about atom 1? What is it connected to?

>>> next_atom = aspirin.GetAtomWithIdx(1)
>>> [b.GetOtherAtom(next_atom).GetIdx() for b in next_atom.GetBonds()]
[0, 2, 12]
I've already processed atom 2, so only atoms 0 and 12 are new.

What are those atoms connected to?

>>> next_atom = aspirin.GetAtomWithIdx(0)
>>> [b.GetOtherAtom(next_atom).GetIdx() for b in next_atom.GetBonds()]
[1]
>>> next_atom = aspirin.GetAtomWithIdx(12)
>>> [b.GetOtherAtom(next_atom).GetIdx() for b in next_atom.GetBonds()]
[1]
Only atom 1, which I've already seen before so I've found all of the atoms on the ester side of the bond.

Semi-automated depth-first search

There are more atoms on the other side of the bond, so I'll have to automate it somewhat. This sort of graph search is often implemented as a depth-first search (DFS) or a breadth-first search (BFS). Python lists easily work as stacks, which makes DFS slightly more natural to implement.

To give you an idea of what I'm about to explain, here's an animation of the aspirin depth-first search:

depth-first search of an apsirin fragment

If there is the chance of a ring (called "cycle" in graph theory), then there are two ways to get to the same atom. To prevent processing the same atom multiple times, I'll set up set named "seen_ids", which contains the atoms indices I've before. Since it's the start, I've only seen the two atoms which are at the end of the bond to cut.

>>> seen_ids = set([2, 3])
I also store a list of the atoms which are in this side of the bond, at this point is only atom 3, and a stack (a Python list) of the atoms I need to process further:
>>> atom_ids = [3]
>>> stack = [aspirin.GetAtomWithIdx(3)]
I'll start by getting the top of the stack (the last element of the list) and looking at its neighbors:
>>> atom = stack.pop()
>>> neighbor_ids = [b.GetOtherAtom(atom).GetIdx() for b in atom.GetBonds()]
>>> neighbor_ids
[2, 4, 8]
I'll need to filter out the neighbors I've already seen. I'll write a helper function for that. I'll need both the atom objects and the atom ids, so this will return two lists, one for each:
def get_atoms_to_visit(atom, seen_ids):
    neighbor_ids = []
    neighbor_atoms = []
    for bond in atom.GetBonds():
        neighbor_atom = bond.GetOtherAtom(atom)
        neighbor_id = neighbor_atom.GetIdx()
        if neighbor_id not in seen_ids:
            neighbor_ids.append(neighbor_id) 
            neighbor_atoms.append(neighbor_atom)
    
    return neighbor_ids, neighbor_atoms
and use it:
>>> atom_ids_to_visit, atoms_to_visit = get_atoms_to_visit(atom, seen_ids)
>>> atom_ids_to_visit
[4, 8]
>>> [a.GetSymbol() for a in atoms_to_visit]
['C', 'C']
These atom ids are connected to the original atom, so add them all it to the atom_ids.
>>> atom_ids.extend(atom_ids_to_visit) 
>>> atom_ids
[3, 4, 8]
These haven't been seen before, so add the atoms (not the atom ids) to the stack of items to process:
>>> stack.extend(atoms_to_visit)
>>> stack
[<rdkit.Chem.rdchem.Atom object at 0x111bf37b0>, <rdkit.Chem.rdchem.Atom object at 0x111bf3760>]
>>> [a.GetIdx() for a in stack]
[4, 8]
Now that they've been seen, I don't need to process them again, so add the new ids to the set of seen ids:
>>> seen_ids.update(atom_ids_to_visit)
>>> seen_ids
{8, 2, 3, 4}

That's the basic loop. The stack isn't empty, so pop the top of the stack to get a new atom object, and repeat the earlier steps:

>>> atom = stack.pop()
>>> atom.GetIdx()
8
>>> atom_ids_to_visit, atoms_to_visit = get_atoms_to_visit(atom, seen_ids)
>>> atom_ids_to_visit
[7, 9]
>>> atom_ids.extend(atom_ids_to_visit) 
>>> atom_ids 
[3, 4, 8, 7, 9]
>>> stack.extend(atoms_to_visit)
>>> [a.GetIdx() for a in stack]
[4, 7, 9]
>>> seen_ids.update(atom_ids_to_visit)
>>> seen_ids
{2, 3, 4, 7, 8, 9}
Then another loop. I'll stick it in a while loop to process the stack until its empty, and I'll only have it print the new atom ids added to the list:
>>> while stack:
...     atom = stack.pop()
...     atom_ids_to_visit, atoms_to_visit = get_atoms_to_visit(atom, seen_ids)
...     atom_ids.extend(atom_ids_to_visit)
...     stack.extend(atoms_to_visit)
...     seen_ids.update(atom_ids_to_visit)
...     print("added atoms", atom_ids_to_visit)
... 
added atoms [10, 11]
added atoms []
added atoms []
added atoms [6]
added atoms [5]
added atoms []
added atoms []
The final set of atoms is:
>>> atom_ids
[3, 4, 8, 7, 9, 10, 11, 6, 5]

Fully automated - fragment_trim()

In this section I'll put all the pieces together to make fragment_trim(), a function which implements the trim algorithm.

Find atoms to delete

The trim algorithm needs to know which atoms to delete. That will be all of the atoms on one side of the bond, except for the atom which is at the end of the bond. (I'll turn that atom into a "*" atom.) I'll use the above code to get the atom ids to delete:

# Look for neighbor atoms of 'atom' which haven't been seen before  
def get_atoms_to_visit(atom, seen_ids):
    neighbor_ids = []
    neighbor_atoms = []
    for bond in atom.GetBonds():
        neighbor_atom = bond.GetOtherAtom(atom)
        neighbor_id = neighbor_atom.GetIdx()
        if neighbor_id not in seen_ids:
            neighbor_ids.append(neighbor_id) 
            neighbor_atoms.append(neighbor_atom)
    
    return neighbor_ids, neighbor_atoms

# Find all of the atoms connected to 'start_atom_id', except for 'ignore_atom_id'
def find_atom_ids_to_delete(mol, start_atom_id, ignore_atom_id):
    seen_ids = {start_atom_id, ignore_atom_id}
    atom_ids = [] # Will only delete the newly found atoms, not the start atom
    stack = [mol.GetAtomWithIdx(start_atom_id)]
    
    # Use a depth-first search to find the connected atoms
    while stack:
        atom = stack.pop()
        atom_ids_to_visit, atoms_to_visit = get_atoms_to_visit(atom, seen_ids)
        atom_ids.extend(atom_ids_to_visit)
        stack.extend(atoms_to_visit)
        seen_ids.update(atom_ids_to_visit)
    
    return atom_ids
and try it out:
>>> from rdkit import Chem
>>> aspirin = Chem.MolFromSmiles("O=C(Oc1ccccc1C(=O)O)C")
>>> find_atom_ids_to_delete(aspirin, 3, 2)
[1, 0, 12]
>>> find_atom_ids_to_delete(aspirin, 2, 3)
[4, 8, 7, 9, 10, 11, 6, 5]
That looks right. (I left out the other testing I did to get this right.)

Trim atoms

The trim function has two parts. The first is to turn the attachment point into the wildcard atom "*", which I'll do by removing any charges, hydrogens, chirality, or other atom properties. This atom will not be deleted. The second is to remove the other atoms on that side of the bond.

Removing atoms from a molecule is slightly tricky. The atom indices are reset after each atom is removed. (While I often use a variable name like "atom_id", the atom id is not a permanent id, but simply the current atom index.) When atom "i" is deleted, all of the atoms with a larger id "j" get the new id "j-1". That is, the larger ids are all shifted down one to fill in the gap.

This becomes a problem because I have a list of ids to delete, but as I delete them the real ids change. For example, if I delete atoms 0 and 1 from a two atom molecule, in that order, I will get an exception:

>>> rwmol = Chem.RWMol(Chem.MolFromSmiles("C#N"))
>>> rwmol.RemoveAtom(0)
>>> rwmol.RemoveAtom(1)
[13:54:02] 

****
Range Error
idx
Violation occurred on line 155 in file /Users/dalke/ftps/rdkit-Release_2016_03_1/Code/GraphMol/ROMol.cpp
Failed Expression: 1 <= 0
****

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: Range Error
	idx
	Violation occurred on line 155 in file Code/GraphMol/ROMol.cpp
	Failed Expression: 1 <= 0
	RDKIT: 2016.03.1
	BOOST: 1_60
This is because after RemoveAtom(0) the old atom with id 1 gets the id 0. As another example, if I remove atoms 0 and 1 from a three atom molecule, I'll end up with what was originally the second atom:
>>> rwmol = Chem.RWMol(Chem.MolFromSmiles("CON"))
>>> rwmol.RemoveAtom(0)
>>> rwmol.RemoveAtom(1)
>>> Chem.MolToSmiles(rwmol)
'O'
Originally the ids were [C=0, O=1, N=2]. The RemoveAtom(0) removed the C and renumbered the remaining atoms, giving [O=0, N=1]. The RemoveAtom(1) then removed the N, leaving [O=0] as the remaining atom.

While you could pay attention to the renumbering and adjust the index of the atom to delete, the far easier solution is to sort the atom ids, and delete starting from the largest id.

Here is the function to turn the attachment atom into a wildcard and to delete the other atoms:

def trim_atoms(mol, wildcard_atom_id, atom_ids_to_delete):
    rwmol = Chem.RWMol(mol)
    # Change the attachment atom into a wildcard ("*") atom
    atom = rwmol.GetAtomWithIdx(wildcard_atom_id)
    atom.SetAtomicNum(0)
    atom.SetIsotope(0)
    atom.SetFormalCharge(0)
    atom.SetIsAromatic(False)
    atom.SetNumExplicitHs(0)
    
    # Remove the other atoms.
    # RDKit will renumber after every RemoveAtom() so
    # remove from highest to lowest atom index.
    # If not sorted, may get a "RuntimeError: Range Error"
    for atom_id in sorted(atom_ids_to_delete, reverse=True):
        rwmol.RemoveAtom(atom_id)
    
    # Don't need to ClearComputedProps() because that will
    # be done by CombineMols()
    return rwmol.GetMol()
(If this were a more general-purpose function I would need to call ClearComputedProps(), which is needed after any molecular editing in RDKit. I don't need to do this here because it will be done during CombineMols(), which is coming up.)

How did I figure out which properties needed to be changed to make a wildcard atom? I tracked the down during cross-validation tests of an earlier version of this algorithm.

I'll try out the new code:

>>> fragment1 = trim_atoms(aspirin, 2, [1, 0, 12])
>>> Chem.MolToSmiles(fragment1, isomericSmiles=True)
'[*]c1ccccc1C(=O)O'
>>> fragment2 = trim_atoms(aspirin, 3, [4, 8, 7, 9, 10, 11, 6, 5])
>>> Chem.MolToSmiles(fragment2, isomericSmiles=True)
'[*]OC(C)=O'

fragment_trim()

The high-level fragment code is straight-forward, and I don't think it requires explanation:

def fragment_trim(mol, atom1, atom2):
    # Find the fragments on either side of a bond
    delete_atom_ids1 = find_atom_ids_to_delete(mol, atom1, atom2)
    fragment1 = trim_atoms(mol, atom1, delete_atom_ids1)
    
    delete_atom_ids2 = find_atom_ids_to_delete(mol, atom2, atom1)
    fragment2 = trim_atoms(mol, atom2, delete_atom_ids2)
    
    # Merge the two fragments.
    # (CombineMols() does the required ClearComputedProps())
    return Chem.CombineMols(fragment1, fragment2)

I'll check that it does what I think it should do:

>>> new_mol = fragment_trim(aspirin, 2, 3)
>>> Chem.MolToSmiles(new_mol)
'[*]OC(C)=O.[*]c1ccccc1C(=O)O'
This matches the FragmentOnBonds() output at the start of this essay.

Testing

I used the same cross-validation method I did in my previous essay. I parsed structures from ChEMBL and checked that fragment_trim() produces the same results as FragmentOnBonds(). It doesn't. The first failure mode surprised me:

fragment_trim() only works with a single connected structure

Here's one of the failure reports:

Mismatch: record: 61 id: CHEMBL1203109 smiles: Cl.CNC(=O)c1cc(C(O)CNC(C)CCc2ccc3c(c2)OCO3)ccc1O cut: 1 2
   smiles_trim: Cl.Cl.[*]C.[*]NC(=O)c1cc(C(O)CNC(C)CCc2ccc3c(c2)OCO3)ccc1O
smiles_on_bond: Cl.[*]C.[*]NC(=O)c1cc(C(O)CNC(C)CCc2ccc3c(c2)OCO3)ccc1O
Somehow the "Cl" appears twice. After thinking about it I realized that my implementation of the copy and trim algorithm assumes that the input structure is strongly connected, that is that it only contains a single molecule. Each copy contains a chlorine atom, so when I CombineMols() I end up with two chlorine atoms.

I could fix my implementation to handle this, but as I'm only interested in cross-validation, the easier solution is to never process a structure with multiple molecules. I decided that if the input SMILES contains multiple molecules (that is, if the SMILES contains the dot disconnection symbol ".") then I would choose the component which has the most characters, using the following:

if "." in smiles:
    # The fragment_trim() method only works on a connected molecule.
    # Arbitrarily pick the component with the longest SMILES string.
    smiles = max(smiles.split("."), key=len)

Directional bonds

After 1000 records and 31450 tests there were 362 mismatches. All of them appeared to be related to directional bonds, like this mismatch report:

Mismatch: record: 124 id: CHEMBL445177 smiles: CCCCCC/C=C\CCCCCCCc1cccc(O)c1C(=O)O cut: 7 8
   smiles_trim: [*]/C=C\CCCCCC.[*]CCCCCCCc1cccc(O)c1C(=O)O
smiles_on_bond: [*]C=CCCCCCC.[*]CCCCCCCc1cccc(O)c1C(=O)O
I knew this would happen coming in to this essay, but it's still nice to see. It demonstrates that the fragment_trim() is a better way to cross-validate FragmentOnBonds() than my earlier fragment_chiral() algorithm.

It's possible that other mismatches are hidden in the hundreds of reports, so I removed directional bonds from the SMILES and processed again:

if "/" in smiles:
    smiles = smiles.replace("/", "")
if "\\" in smiles:
    smiles = smiles.replace("\\", "")
That testing shows I forgot to include "atom.SetIsotope(0)" when I converted the attachment atom into a wildcard atom. Without it, the algorithm turned a "[18F]" into a "[18*]". It surprised me because I copied it from working code I made for an earlier version of the algorithm. Looking into it, I realized I left that line out during the copy&paste. This is why regression tests using diverse real-world data are important.

I stopped the testing after 100,00 records. Here is the final status line:
Processed 100000 records, 100000 molecules, 1120618 tests, 0 mismatches.  T_trim: 2846.79 T_on_bond 258.18
While trim code is slower than using the built-in function, the point of this exercise isn't to figure out the fastest implementation but to have a better way to cross-validate the original code, which it has done.

The final code

Here is the final code:

from __future__ import print_function

import datetime
import time

from rdkit import Chem

# Look for neighbor atoms of 'atom' which haven't been seen before  
def get_atoms_to_visit(atom, seen_ids):
    neighbor_ids = []
    neighbor_atoms = []
    for bond in atom.GetBonds():
        neighbor_atom = bond.GetOtherAtom(atom)
        neighbor_id = neighbor_atom.GetIdx()
        if neighbor_id not in seen_ids:
            neighbor_ids.append(neighbor_id) 
            neighbor_atoms.append(neighbor_atom)
    
    return neighbor_ids, neighbor_atoms

# Find all of the atoms connected to 'start_atom_id', except for 'ignore_atom_id'
def find_atom_ids_to_delete(mol, start_atom_id, ignore_atom_id):
    seen_ids = {ignore_atom_id, start_atom_id}
    atom_ids = [] # Will only delete the newly found atoms, not the start atom
    stack = [mol.GetAtomWithIdx(start_atom_id)]

    # Use a depth-first search to find the connected atoms
    while stack:
        atom = stack.pop()
        atom_ids_to_visit, atoms_to_visit = get_atoms_to_visit(atom, seen_ids)
        atom_ids.extend(atom_ids_to_visit)
        stack.extend(atoms_to_visit)
        seen_ids.update(atom_ids_to_visit)
    
    return atom_ids

def trim_atoms(mol, wildcard_atom_id, atom_ids_to_delete):
    rwmol = Chem.RWMol(mol)
    # Change the attachment atom into a wildcard ("*") atom
    atom = rwmol.GetAtomWithIdx(wildcard_atom_id)
    atom.SetAtomicNum(0)
    atom.SetIsotope(0)
    atom.SetFormalCharge(0)
    atom.SetIsAromatic(False)
    atom.SetNumExplicitHs(0)
    
    # Remove the other atoms.
    # RDKit will renumber after every RemoveAtom() so
    # remove from highest to lowest atom index.
    # If not sorted, may get a "RuntimeError: Range Error"
    for atom_id in sorted(atom_ids_to_delete, reverse=True):
        rwmol.RemoveAtom(atom_id)
    
    # Don't need to ClearComputedProps() because that will
    # be done by CombineMols()
    return rwmol.GetMol()
    
def fragment_trim(mol, atom1, atom2):
    # Find the fragments on either side of a bond
    delete_atom_ids1 = find_atom_ids_to_delete(mol, atom1, atom2)
    fragment1 = trim_atoms(mol, atom1, delete_atom_ids1)
    
    delete_atom_ids2 = find_atom_ids_to_delete(mol, atom2, atom1)
    fragment2 = trim_atoms(mol, atom2, delete_atom_ids2)

    # Merge the two fragments.
    # (CombineMols() does the required ClearComputedProps())
    return Chem.CombineMols(fragment1, fragment2)

####### Cross-validation code

def fragment_on_bond(mol, atom1, atom2):
    bond = mol.GetBondBetweenAtoms(atom1, atom2)
    return Chem.FragmentOnBonds(mol, [bond.GetIdx()], dummyLabels=[(0, 0)])


def read_records(filename):
    with open(filename) as infile:
        for recno, line in enumerate(infile):
            smiles, id = line.split()
            if "*" in smiles:
                continue
            yield recno, id, smiles
            
def cross_validate():
    # Match a single bond not in a ring
    BOND_SMARTS = "[!#0;!#1]-!@[!#0;!#1]"
    single_bond_pat = Chem.MolFromSmarts(BOND_SMARTS)
    
    num_mols = num_tests = num_mismatches = 0
    time_trim = time_on_bond = 0.0

    # Helper function to print status information. Since
    # the function is defined inside of another function,
    # it has access to variables in the outer function.
    def print_status():
        print("Processed {} records, {} molecules, {} tests, {} mismatches."
              "  T_trim: {:.2f} T_on_bond {:.2f}"
              .format(recno, num_mols, num_tests, num_mismatches,
                      time_trim, time_on_bond))
    
    filename = "/Users/dalke/databases/chembl_20_rdkit.smi"
    start_time = datetime.datetime.now()
    for recno, id, smiles in read_records(filename):
        if recno % 100 == 0:
            print_status()

        if "." in smiles:
            # The fragment_trim() method only works on a connected molecule.
            # Arbitrarily pick the component with the longest SMILES string.
            smiles = max(smiles.split("."), key=len)
            
        ## Remove directional bonds to see if there are mismatches
        ## which are not caused by directional bonds.
        #if "/" in smiles:
        #    smiles = smiles.replace("/", "")
        #if "\\" in smiles:
        #    smiles = smiles.replace("\\", "")
        
        mol = Chem.MolFromSmiles(smiles)
        if mol is None:
            continue
        num_mols += 1

        # Find all of the single bonds
        matches = mol.GetSubstructMatches(single_bond_pat)

        for a1, a2 in matches:
            # For each pair of atom indices, split on that bond
            # and determine the canonicalized fragmented molecule.
            # Do it for both fragment_trim() ...
            t1 = time.time()
            mol_trim = fragment_trim(mol, a1, a2)
            t2 = time.time()
            time_trim += t2-t1
            smiles_trim = Chem.MolToSmiles(mol_trim, isomericSmiles=True)

            # ... and for fragment_on_bond()
            t1 = time.time()
            mol_on_bond = fragment_on_bond(mol, a1, a2)
            t2 = time.time()
            time_on_bond += t2-t1
            smiles_on_bond = Chem.MolToSmiles(mol_on_bond, isomericSmiles=True)

            # Report any mismatches and keep on going.
            num_tests += 1
            if smiles_trim != smiles_on_bond:
                print("Mismatch: record: {} id: {} smiles: {} cut: {} {}".format(
                      recno, id, smiles, a1, a2))
                print("   smiles_trim:", smiles_trim)
                print("smiles_on_bond:", smiles_on_bond)
                num_mismatches += 1
    
    end_time = datetime.datetime.now()
    elapsed_time = str(end_time - start_time)
    elapsed_time = elapsed_time.partition(".")[0]  # don't display subseconds
    
    print("Done.")
    print_status()
    print("elapsed time:", elapsed_time)

if __name__ == "__main__":
    cross_validate()


Software mentioned in ChEMBL SD records #

When I work with real-world data sets, I usually start with PubChem before going on to ChEMBL. Why? The PubChem data is all generated by one chemistry toolkit - I'm pretty sure it's OEChem - while ChEMBL data comes from many sources.

To get a sense of the diversity, I processed the ChEBML 21 release to get the second line of each record. The ctfile format specification says that if non-blank then the 8 characters starting in the third position should contain the program name. The documentation includes a description of the fields:

IIPPPPPPPPMMDDYYHHmmddSS…
where those fields are: Here are a few examples:
  SciTegic12231509382D
  Mrv0541 05211314572D          
          08180812482D 1   1.00000     0.00000     0
  Symyx   11300911332D 1   1.00000     0.00000     0
The third program wishes to remain anonymous.

Extract program names from the ChEMBL 21 SDF

I'll write a program to extract those program names and count how many times each one occurs. I don't need a general-purpose SD file reader because ChEMBL uses a subset of the SD format. For example, there are only two lines in each record which start with "CHEMBL", the title line (the first line of the record) and the data line after the "chembl_id" tag.

My code can read line through the file. The first time it sees a "CHEMBL" line is the title line, so the following line (the second line) contains the data. Then when it sees "> <chembl_id>" it knows to skip the following line, which will have a CHEMBL on it.

There are two oddities here. First, the gzip reader returns byte strings. I decided to do the pattern matching on the byte strings to ignore the overhead of converting everything to Unicode when I only need a few characters from the file. Second, a Python file object is its own iterator, so I can use "infile" in both the for-loop and in the body itself.

import sys, gzip
from collections import Counter
counter = Counter()
with gzip.open("/Users/dalke/databases/chembl_21.sdf.gz", "rb") as infile:
  for line in infile:
      # Get the line after the title line (which starts with 'CHEMBL')
      if line[:6] == b"CHEMBL":
          next_line = next(infile)
          # Print unique program names
          program = next_line[2:10]
          if program not in counter:
              print("New:", repr(str(program, "ascii")))
          counter[program] += 1
      
      if line[:13] == b"> ":
          ignore = next(infile) # skip the CHEMBL
  
  print("Done. Here are the counts for each program seen.")
  for name, count in counter.most_common():
    print(repr(str(name, "ascii")), count)

Program counts

Here is the table of counts:
Program namecount
'SciTegic'1105617
' '326445
'CDK 9' 69145
'Mrv0541 '30962
''24531
'CDK 6'10805
'CDK 1'8642
'CDK 5'4771
'CDK 8'2209
'-ISIS- '281
'Symyx '171
'CDK 7'144
'-OEChem-'96
'Marvin '61
'Mrv0540 '13
' RDKit'3
'CDK 3'1
The number of different CDK lines is not because of a version number but because CDK doesn't format the line correctly. The specification states that the first few fields of a non-blank line are supposed to be:

IIPPPPPPPPMMDDYYHHmmddSS…
while the CDK lines use:
  CDK    7/28/10,10:58
  CDK    8/10/10,12:22
  CDK    9/16/09,9:40
  CDK    10/7/09,10:42
  CDK    11/9/10,11:20
were the date starts one byte too early. The date also isn't in the specified format.

Further examination shows these were only generated in 2009 and 2010. The current CDK implementation is correct, and a 'git annotate' suggests it was fixed in 2010.

I don't think anyone uses that line for anything, and don't see the point of changing anything, so I don't think it's worthwhile to even ask ChEBML to change these old records.



Use FragmentOnBonds to fragment a molecule in RDKit #

This is part of a continuing series on how to fragment a molecular graph. So far I have covered:

Those were mostly pedagogical. They describe the low-level details of how to cut a bond to fragment a molecule into two parts, and still maintain correct chirality.

If you are writing production code, you should almost certainly use the built-in RDKit function FragmentOnBonds() to fragment a molecule, and not use any of the code from those essays.

FragmentOnBonds

FragmentOnBonds() will fragment a molecule given a list of bond identifiers. In default use it attaches a new "dummy" atom (what I've been calling a wildcard atom) to the atoms at the end of each broken bond. It will also set the isotope label for each newly created dummy atom to the atom index of the atom to which it is attached.

>>> from rdkit import Chem
>>> bond_idx = vanillin.GetBondBetweenAtoms(4, 5)
>>> from rdkit import Chem
>>> vanillin = Chem.MolFromSmiles("c1(C=O)cc(OC)c(O)cc1")
>>> bond = vanillin.GetBondBetweenAtoms(4, 5)
>>> bond
<rdkit.Chem.rdchem.Bond object at 0x102bec8a0<
>>> bond.GetIdx()
4
>>> new_mol = Chem.FragmentOnBonds(vanillin, [bond.GetIdx()])
>>> Chem.MolToSmiles(new_mol, isomericSmiles=True)
'[4*]OC.[5*]c1cc(C=O)ccc1O'
I don't want a label, so I'll specify 0 for the atom labels. (An unlabeled atom has an isotope of zero.)
>>> new_mol = Chem.FragmentOnBonds(vanillin, [bond.GetIdx()], dummyLabels=[(0, 0)])
>>> Chem.MolToSmiles(new_mol, isomericSmiles=True)
'[*]OC.[*]c1cc(C=O)ccc1O'

I'll turn this into function that I can use for testing:

def fragment_on_bond(mol, a1, a2):
    bond = mol.GetBondBetweenAtoms(a1, a2)
    return Chem.FragmentOnBonds(mol, [bond.GetIdx()], addDummies=True, dummyLabels=[(0, 0)])
That's certainly much easier than fragment_chiral()!

FragmentOnBonds implementation

The RDKit source code is available, so we can look at how this built-in function is implemented. A search for "FragmentOnBonds":

Code/GraphMol/Wrap/MolOps.cpp
1631:      python::def("FragmentOnBonds", fragmentOnBondsHelper,
plus some investigation shows that the actual C++ code has the name "fragmentOnBonds". The RDKit convention is to use an intial uppercase letter for Python function, and an initial lowercase letter for C++ functions.

A search this time for "fragmentOnBonds" has a few more hits, including this very suggestive one:

Code/GraphMol/ChemTransforms/MolFragmenter.cpp
320:        ROMol *nm=fragmentOnBonds(mol,fragmentHere,addDummies,dummyLabelsHere,bondTypesHere,lCutsPerAtom);
329:    ROMol *fragmentOnBonds(const ROMol &mol,const std::vector &bondIndices,
That second result is start of the FragmentOnBondsfunction definition,. Here's the key part, with commentary.

The function can fragment multiple bonds. It iterates through the bonds in order. For each one it gets the bond type and the begin and end atoms, and if requested it stores information about the number of cuts per atom.

  for (unsigned int i = 0; i < bondsToRemove.size(); ++i) {
    const Bond *bond = bondsToRemove[i];
    unsigned int bidx = bond->getBeginAtomIdx();
    unsigned int eidx = bond->getEndAtomIdx();
    Bond::BondType bT = bond->getBondType();
    res->removeBond(bidx, eidx);
    if (nCutsPerAtom) {
      (*nCutsPerAtom)[bidx] += 1;
      (*nCutsPerAtom)[eidx] += 1;
    }
FragmentOnBonds() by default will add "dummy" atoms (atoms with atomic number 0), but this can be disabled. If enabled, then it will create the two atoms with atomic number 0. By default it will set the istope to be the index of the atom it will be attached to, but that can also be specified with the dummyLabels parameter:
    if (addDummies) {
      Atom *at1, *at2;
      at1 = new Atom(0);
      at2 = new Atom(0);
      if (dummyLabels) {
        at1->setIsotope((*dummyLabels)[i].first);
        at2->setIsotope((*dummyLabels)[i].second);
      } else {
        at1->setIsotope(bidx);
        at2->setIsotope(eidx);
      }
Next, make the bond from the old terminal atoms to the new wildcard atoms. By default the bond type for the new bonds is the same as the bond which was broken (determined earlier). Otherwise, there's an option to specify an alternate bond type.
      unsigned int idx1 = res->addAtom(at1, false, true);
      if (bondTypes) bT = (*bondTypes)[i];
      res->addBond(eidx, at1->getIdx(), bT);
      unsigned int idx2 = res->addAtom(at2, false, true);
      res->addBond(bidx, at2->getIdx(), bT);

The last part does what the comment says it does; if the atom has a tetrahedral chirality then check if the permutation order has changed and invert the chirality if needed:
      // figure out if we need to change the stereo tags on the atoms:
      if (mol.getAtomWithIdx(bidx)->getChiralTag() ==
              Atom::CHI_TETRAHEDRAL_CCW ||
          mol.getAtomWithIdx(bidx)->getChiralTag() ==
              Atom::CHI_TETRAHEDRAL_CW) {
        checkChiralityPostMove(mol, mol.getAtomWithIdx(bidx),
                               res->getAtomWithIdx(bidx),
                               mol.getBondBetweenAtoms(bidx, eidx));
      }
      if (mol.getAtomWithIdx(eidx)->getChiralTag() ==
              Atom::CHI_TETRAHEDRAL_CCW ||
          mol.getAtomWithIdx(eidx)->getChiralTag() ==
              Atom::CHI_TETRAHEDRAL_CW) {
        checkChiralityPostMove(mol, mol.getAtomWithIdx(eidx),
                               res->getAtomWithIdx(eidx),
                               mol.getBondBetweenAtoms(bidx, eidx));
}
The code to check if the chirality has changed iterates through all of the bonds of the old atom ("oAt") to make a list ("newOrder") containing all of the atom identifiers on the other side of the bond. (The "OEDGE_ITER" is part of the adapter to work with Boost Graph library. It's an "out_edge_iterator".)
void checkChiralityPostMove(const ROMol &mol, const Atom *oAt, Atom *nAt,
                            const Bond *bond) {
  INT_LIST newOrder;
  ROMol::OEDGE_ITER beg, end;
  boost::tie(beg, end) = mol.getAtomBonds(oAt);
  while (beg != end) {
    const BOND_SPTR obond = mol[*beg];
    ++beg;
The code knows that the RemoveBond()/AddBond() caused the new dummy atom to be placed at the end of bond list, so when it sees the old bond it simply ignores it. Once it's gone through the old atoms, it adds the new wildcard atom id to the end of the list. Interestingly, this knowledge isn't something I can depend on, because it's an implementation detail which might change in the future. That's why I needed to substitute the new bond id at this point in my code.
    if (obond.get() == bond) {
      continue;
    }
    newOrder.push_back(obond->getIdx());
  }
  newOrder.push_back(bond->getIdx());
The last bit is to compute the permutation order, or as it says, "perturbation order". I believe that is a typo. Dig down a bit and it uses a selection sort to count the number of swaps needed to make the "oAt" atom list and the "newOrder" list match. The result, modulo two, gives the parity. If the parity is odd, invert the chirality:
  unsigned int nSwaps = oAt->getPerturbationOrder(newOrder);
  if (nSwaps % 2) nAt->invertChirality();
}
This was a bit different than my approach, which compared the parity before and after, but it gives the same results.

Jumping back to the fragmentOnBonds() function, the last bit of the function is:

  res->clearComputedProps();
  return static_cast(res);
This is needed because some of the computed properties may depend on bond patterns which are no longer present. Note that it does not use SanitizeMol(), which is something I investigated in the previous essay.

While the details are a bit different between my fragment_chiral() and FragmentOnBonds(), the general approach is the same.

A possible optimization

I mentioned that checkChiralityPostMove() uses insider knowledge. It knows that the deleted bond is removed from the list and the new bond added at the end. It uses this information to build the "newOrder" list with atom indices in the correct order, before determining the difference in parity between that list and the list of atom indices around the new bond.

There's an easier way. Count the number of bond between where the deleted bond was and the end, and take the result modulo 2. For example, if the transformation were:

initial neighbors = [1, 6, 5, 4]
   - remove bond to atom 5
neighbors after delete = [1, 6, 4]
   - add bond to new atom 9
final neighbors = [1, 6, 4, 9]
then the bond to atom 5 was in the third position, with one bond (to atom 4) between it and the end of the list. The parity is therefore odd. I added this suggestion to the RDKit issue tracker as #1033.

The proof is simple. Assume there are n elements afterward the bond to delete. Then there are n swaps needed to bring it to the end, where it can be replaced with the connection to the new atom. Hence the parity is n%2.

Not using SanitizeMol for this investigation

In the earlier essay in this series, "Fragment chiral molecules in RDKit", I investigated some failure cases where the chirality wasn't preserved. Greg Landrum pointed out: after you break one or more bonds, you really, really should re-sanitize the molecule (or at least call ClearComputedProps() .... I found that if I added SanitizeMol() then some of the error cases done by using ClearComputedProps() were no longer error cases. This may be because of a bug in FastFindRings. Greg, in the second link, writes: There is a workaround until this is fixed; explicitly sanitize your molecules or just cal GetSymmSSSR() before you generate SMILES.

So that's what I did. However, without SanitizeMol() there were only 232 failures out of 1.4M+ input structures. Adding SanitizeMol() didn't fix all of the remaining errors. On the other hand, the performance overhead to SanitizeMol() is (as expected) larger than changing a few bonds around. In one timing test I made, the time to process 10,000 structures using fragment_chiral() without SanitizeMol() was 107 seconds, while the time with SanitizeMol() was 255 seconds. The overall processing time takes about twice as long. (There is an additional constant overhead to parse the SMILES and canonicalize the fragmented molecule, which I did not include.)

For this pedagogical investigation, don't think the performance impact is worth a slightly error rate, so I will not include SanitizeMol() in this essay. That means I can use my new fragment_on_mol() function as-is, and I'll modify fragment_chiral() so it calls ClearComputedProps() instead of SanitizeMol().

In any case, the explicit call to SanitizeMol() is a workaround. I expect ClearComputedProps() will suffice in the RDKit of the future world that you, the reader, inhabit.

Cross-validation

In the earlier essay I tested the fragment_chiral() function by attempting to re-connect the fragments. If the canonicalized result doesn't match the canonicalized input structure then there's a problem. (And there were a small number of problems, at the edge of RDKit's ability to handle chemisty.)

That test ignored what SMILES calls "directional bonds", that is the "/" and "\" in "F/C=C\F", which is how SMILES denotes the stereochemistry of double bonds. This is how SMILES deals with "E-Z notation", which would be more familiar to a chemist. The tests ignored those bonds for the simple reason that FragmentOnBonds() doesn't preserve that information if it cuts the bond. In a future essay I'll show how to implement that ability.

Instead of fully testing fragment_on_bonds(), I'll do a weaker test where I compare its results to fragment_chiral(). I say it's weaker because what it shows is that they are likely bug-compatible, which is different than being correct. It's also weaker because the two implementation methods are very similar. A stronger test would use a different way to fragment.

I think of this as a cross-validation test. As Wikipedia helpful points out, that term is overloaded. Statistics uses a different definition than chemistry; the latter term is very similar to what's used in verification and validation, so I think I'm right to use the term.

The test code is not interesting, so I'll put it at the end and not discuss it in detail. The results weren't interesting either. There were no mismatches. I just had to wait for a few hours while I let it process the 1.4M+ structures from ChEMBL 20.

Here are the last few lines of output:

Processed 1455762 records, 1455763 molecules, 16017700 tests, 0 mismatches.  T_chiral: 5912.26 T_on_bond 3554.22
elapsed time: 8:30:15
The two implementations gave identical results, and time needed for fragment_chiral() is about 1.7x that needed for fragment_on_bond(). The fragment_on_bond() code is shorter, faster, and easier to understand, so you should always use it instead of fragment_chiral().

The code

The fragment_chiral() function here is a copy of the previous essay, except I replaced the SanitizeMol(new_mol) with a mol.ClearComputedProps() for 2x performance, at the expense of a few incorrect structures. (The SanitizeMol() is a work-around to a current bug report. It shouldn't be needed in long-term use.) I put the code here so the test code was self-contained, except of course you'll need your own dataset.

from __future__ import print_function
import time
import datetime
from rdkit import Chem

# Two different methods to cut a bond and fragment an RDKit molecule.
#   - fragment_on_bond() uses RDKit's FragmentOnBonds()
#   - fragment_chiral() uses lower-level API calls

# This also includes a cross-validation function which checks that the
# two methods produce the same fragments as output.
# The final two lines when I evaluated ChEMBL20 were:
#
# Processed 1455762 records, 1455763 molecules, 16017700 tests, 0 mismatches.  T_chiral: 5912.26 T_on_bond 3554.22
# elapsed time: 8:30:15
#
# The timings show that fragment_on_bond() is about 1.7x the performance of fragment_chiral().


#### Fragment using a high-level RDKit API function

def fragment_on_bond(mol, atom1, atom2):
    bond = mol.GetBondBetweenAtoms(atom1, atom2)
    new_mol = Chem.FragmentOnBonds(mol, [bond.GetIdx()], dummyLabels=[(0, 0)])
    # FragmentOnBonds() calls ClearComputedProps() at the end.  There
    # is a current bug report where, as a downstream effect, that may
    # cause some chiralities to change, most notably on some
    # bridgeheads.. A workaround for now is to call SanitizeMol(),
    # though that ends up tripling the time. I'll stay compatible
    # with FragmentOnBonds() and not call it.
    #Chem.SanitizeMol(new_mol)

    return new_mol


#### Fragment using low-level RDKit API functions

# See http://www.dalkescientific.com/writings/diary/archive/2016/08/14/fragment_chiral_molecules.html
# for implementation discussion

CHI_TETRAHEDRAL_CW = Chem.ChiralType.CHI_TETRAHEDRAL_CW
CHI_TETRAHEDRAL_CCW = Chem.ChiralType.CHI_TETRAHEDRAL_CCW

def parity_shell(values):
    # Simple Shell sort; while O(N^2), we only deal with at most 4 values
    # See http://www.dalkescientific.com/writings/diary/archive/2016/08/15/fragment_parity_calculation.html
    # for faster versions for fixed-size lists.
    values = list(values)
    N = len(values)
    num_swaps = 0
    for i in range(N-1):
        for j in range(i+1, N):
            if values[i] > values[j]:
                values[i], values[j] = values[j], values[i]
                num_swaps += 1
    return num_swaps % 2


def get_bond_parity(mol, atom_id):
    """Compute the parity of the atom's bond permutation

    Return None if it does not have tetrahedral chirality,
    0 for even parity, or 1 for odd parity.
    """
    atom_obj = mol.GetAtomWithIdx(atom_id)
    
    # Return None unless it has tetrahedral chirality
    chiral_tag = atom_obj.GetChiralTag()
    if chiral_tag not in (CHI_TETRAHEDRAL_CW, CHI_TETRAHEDRAL_CCW):
        return None
    
    # Get the list of atom ids for the each atom it's bonded to.
    other_atom_ids = [bond.GetOtherAtomIdx(atom_id) for bond in atom_obj.GetBonds()]
    
    # Use those ids to determine the parity
    return parity_shell(other_atom_ids)


def set_bond_parity(mol, atom_id, old_parity, old_other_atom_id, new_other_atom_id):
    """Compute the new bond parity and flip chirality if needed to match the old parity"""
    
    atom_obj = mol.GetAtomWithIdx(atom_id)
    # Get the list of atom ids for the each atom it's bonded to.
    other_atom_ids = [bond.GetOtherAtomIdx(atom_id) for bond in atom_obj.GetBonds()]
    
    # Replace id from the new wildcard atom with the id of the original atom
    i = other_atom_ids.index(new_other_atom_id)
    other_atom_ids[i] = old_other_atom_id
    
    # Use those ids to determine the parity
    new_parity = parity_shell(other_atom_ids)
    if old_parity != new_parity:
        # If the parity has changed, invert the chirality
        atom_obj.InvertChirality()

def fragment_chiral(mol, atom1, atom2):
    """Cut the bond between atom1 and atom2 and replace with connections to wildcard atoms

    Return the fragmented structure as a new molecule.
    """
    rwmol = Chem.RWMol(mol)
    
    atom1_parity = get_bond_parity(mol, atom1)
    atom2_parity = get_bond_parity(mol, atom2)
    
    rwmol.RemoveBond(atom1, atom2)
    wildcard1 = rwmol.AddAtom(Chem.Atom(0))
    wildcard2 = rwmol.AddAtom(Chem.Atom(0))
    new_bond1 = rwmol.AddBond(atom1, wildcard1, Chem.BondType.SINGLE)
    new_bond2 = rwmol.AddBond(atom2, wildcard2, Chem.BondType.SINGLE)
    
    if atom1_parity is not None:
        set_bond_parity(rwmol, atom1, atom1_parity, atom2, wildcard1)
    if atom2_parity is not None:
        set_bond_parity(rwmol, atom2, atom2_parity, atom1, wildcard2)
    
    # After breaking bonds, should re-sanitize See
    # https://github.com/rdkit/rdkit/issues/1022#issuecomment-239355482
    # or at least call ClearComputedProps().  I found that
    # SanitizeMol() improves chiral carbon bridgeheads handling,
    # though using it doubles the execution time. I'll stick with
    # ClearComputedProps(), which matches what FragmentOnBonds() does
    new_mol = rwmol.GetMol()
    # I must ClearComputedProps() after editing the structure.
    new_mol.ClearComputedProps()
    #Chem.SanitizeMol(new_mol)
    return new_mol

####### Cross-validation code

def read_records(filename):
    with open(filename) as infile:
        for recno, line in enumerate(infile):
            smiles, id = line.split()
            if "*" in smiles:
                continue
            yield recno, id, smiles

def cross_validate():
    # Match a single bond not in a ring
    BOND_SMARTS = "[!#0;!#1]-!@[!#0;!#1]"
    single_bond_pat = Chem.MolFromSmarts(BOND_SMARTS)
    
    num_mols = num_tests = num_mismatches = 0
    time_chiral = time_on_bond = 0.0

    # Helper function to print status information. Since
    # the function is defined inside of another function,
    # it has access to variables in the outer function.
    def print_status():
        print("Processed {} records, {} molecules, {} tests, {} mismatches."
              "  T_chiral: {:.2f} T_on_bond {:.2f}"
              .format(recno, num_mols, num_tests, num_mismatches,
                      time_chiral, time_on_bond))
    
    filename = "/Users/dalke/databases/chembl_20_rdkit.smi"
    start_time = datetime.datetime.now()
    for recno, id, smiles in read_records(filename):
        if recno % 1000 == 0:
            print_status()

        mol = Chem.MolFromSmiles(smiles)
        if mol is None:
            continue
        num_mols += 1

        # Find all of the single bonds
        matches = mol.GetSubstructMatches(single_bond_pat)

        for a1, a2 in matches:
            # For each pair of atom indices, split on that bond
            # and determine the canonicalized fragmented molecule.
            # Do it for both fragment_chiral() ...
            t1 = time.time()
            mol_chiral = fragment_chiral(mol, a1, a2)
            t2 = time.time()
            time_chiral += t2-t1
            smiles_chiral = Chem.MolToSmiles(mol_chiral, isomericSmiles=True)

            # ... and for fragment_on_bond()
            t1 = time.time()
            mol_on_bond = fragment_on_bond(mol, a1, a2)
            t2 = time.time()
            time_on_bond += t2-t1
            smiles_on_bond = Chem.MolToSmiles(mol_on_bond, isomericSmiles=True)

            # Report any mismatches and keep on going.
            num_tests += 1
            if smiles_chiral != smiles_on_bond:
                print("Mismatch: record: {} id: {} smiles: {} cut: {} {}".format(
                      recno, id, smiles, a1, a2))
                print(" smiles_chiral:", smiles_chiral)
                print("smiles_on_bond:", smiles_on_bond)
                num_mismatches += 1
    
    end_time = datetime.datetime.now()
    elapsed_time = str(end_time - start_time)
    elapsed_time = elapsed_time.partition(".")[0]  # don't display subseconds
    
    print("Done.")
    print_status()
    print("elapsed time:", elapsed_time)
                
if __name__ == "__main__":
    cross_validate()



Faster parity calculation #

In the previous essay I needed determine the parity of a permutation. I used a Shell sort and counted the number of swaps needed to order the list. The parity is even (or "0") if the number of swaps is even, otherwise it's odd (or "1"). The final code was:

def parity_shell(values):
    # Simple Shell sort; while O(N^2), we only deal with at most 4 values 
    values = list(values)
    N = len(values)
    num_swaps = 0
    for i in range(N-1):
        for j in range(i+1, N):
            if values[i] > values[j]:
                values[i], values[j] = values[j], values[i]
                num_swaps += 1
    return num_swaps % 2

I chose this implementation because it's easy to understand, and any failure case is easily found. However, it's not fast.

It's tempting to use a better sort method. The Shell sort takes quadratic time in the number of elements, while others take O(N*ln(N)) time in the asymptotic case.

However, an asymptotic analysis is pointless for this case. The code will only ever receive 3 terms (if there is a chiral hydrogen) or 4 terms because the code will only ever be called for tetrahedral chirality.

Sorting networks

The first time I worked on this problem, I used a sorting networks. A sorting network works on a fixed number of elements. It uses a pre-determined set of pairwise comparisons, each followed by a swap if needed. These are often used where code branches are expensive, like in hardware or on a GPU. A sorting network takes constant time, so can help minimize timing side-channel attacks, where the time to sort may give some insight into what is being sorted.

A general algorithm to find a perfect sorting network for a given value of 'N' element isn't known, though there are non-optimal algorithms like Bose-Nelson and Batcher's odd–even mergesort, and optimal solutions are known for up to N=10.

John M. Gamble has a CGI script which will generate a sorting network for a given number of elements and choice of algorithm. For N=4 it generates:

N=4 elements: SWAP(0, 1); SWAP(2, 3); SWAP(0, 2); SWAP(1, 3); SWAP(1, 2);
where the SWAP would modify the elements of an array in-place. Here's one way to turn those instructions into a 4-element sort for Python:
def sort4(data):
  if data[1] < data[0]:  # SWAP(0, 1)
    data[0], data[1] = data[1], data[0]
  
  if data[3] < data[2]:  # SWAP(2, 3)
    data[2], data[3] = data[3], data[2]
  
  if data[2] < data[0]:  # SWAP(0, 2)
    data[0], data[2] = data[2], data[0]
  
  if data[3] < data[1]:  # SWAP(1, 3)
    data[1], data[3] = data[3], data[1]
  
  if data[2] < data[1]:  # SWAP(1, 2)
    data[1], data[2] = data[2], data[1]

As a test, I'll sort every permutation of four values and make sure the result is sorted. I could write the test cases out manually, but it's easier to use the "permutations()" function from Python's itertools module, as in this example with 3 values:

>>> list(itertools.permutations([1, 2, 3]))
[(1, 2, 3), (1, 3, 2), (2, 1, 3), (2, 3, 1), (3, 1, 2), (3, 2, 1)]
Here's the test, which confirms that the function sorts correctly:
>>> for permutation in itertools.permutations([0, 1, 2, 3]):
...   permutation = list(permutation) # Convert the tuple to list sort4() can swap elements
...   sort4(permutation)
...   if permutation != [0, 1, 2, 3]:
...     print("ERROR:", permutation)
... 
>>>

I think it's obvious how to turn this into a parity function by adding a swap counter. If the input array cannot be modified then the parity function need to make a copy of the array first. That's what parity_shell() does. (NOTE: the day after writing this I realized I'm thinking like a C programmer. In Python the better solution is to copy the items into local variables.)

No need to sort

A sort network will always do D comparisions, but those sorts aren't always needed. The reason is simple - if you think of the network as a decision tree, where each comparison is a branch, then D comparison will always have 2D leaves. This must be at least as large as N!, where N is the number of elements in the list. But N! for N>2 is not a perfect power of 2, so there will be some unused leaves.

I would like to minimize the number of comparisions. I would also like to not modify the array in-place by actually sorting it.

The key realization is that there's no need to sort in order to determine the parity. For example, if there are only two elements in the list, then the parity is as simple as testing

def two_element_parity(x):
  assert len(x) == 2
  return x[1] > x[0]

The three element parity is a bit harder to do by hand:

def three_element_parity(x):
  assert len(x) == 3
  if x[0] < x[1]:
    if x[1] < x[2]:
      return 0      # 1, 2, 3
    elif x[0] < x[2]:
      return 1      # 1, 3, 2
    else:
      return 0      # 2, 3, 1
  elif x[0] < x[2]:
    return 1        # 2, 1, 3
  elif x[1] < x[2]:
    return 0        # 3, 1, 2
  else:
    return 1        # 3, 2, 1
It's complicated enough that it took several attempts before it was correct. I had to fix it using the following test code, which uses parity_shell() as a reference because I'm confident that it gives the correct values. (A useful development technique is to write something that you know works, even if it's slow, so you can use it to test more complicated code which better fits your needs)

The test code is:

def test_three_element_parity():
  for x in itertools.permutations([1,2,3]):
    p1 = parity_shell(x)
    p2 = three_element_parity(x)
    if p1 != p2:
      print("MISMATCH", x, p1, p2)
    else:
      print("Match", x, p1, p2)
which gives the output:
>>> test_three_element_parity()
Match (1, 2, 3) 0 0
Match (1, 3, 2) 1 1
Match (2, 1, 3) 1 1
Match (2, 3, 1) 0 0
Match (3, 1, 2) 0 0
Match (3, 2, 1) 1 1

A debugging technique

As I said, it took a couple of iterations to get correct code. I wasn't sure sometimes which branch was used to get a 0 or 1. During development I added a second field to each return value, to serve as a tag. The code looked like:

def three_element_parity(x):
  assert len(x) == 3
  if x[0] < x[1]:
    if x[1] < x[2]:
      return 0,1      # 1, 2, 3
    elif x[0] < x[2]:
      return 1,2      # 1, 3, 2
    else:
      return 0,3      # 2, 3, 1
  …
which meant I could see which path was in error. Here's one of the debug outputs using that technique:
>>> test_three_element_parity()
MISMATCH (1, 2, 3) 0 (0, 0)
MISMATCH (1, 3, 2) 1 (1, 1)
MISMATCH (2, 1, 3) 1 (1, 3)
MISMATCH (2, 3, 1) 0 (0, 2)
MISMATCH (3, 1, 2) 0 (0, 4)
MISMATCH (3, 2, 1) 1 (1, 5)
Of course the "MISMATCH" now is misleading and I need to compare things by eye, but with this few number of elements that's fine. For more complicated code I would modify the test code as well.

Brute force solution

The last time I worked on this problem I turned the sorting network for N=4 into a decision tree. With 5 swaps there 25=32 terminal nodes, but only N! = 4! = 24 of them will be used. I pruned them by hand, which is possible with 32 elements.

I thought this time I would come up with some clever way to handle this, and pulled out Knuth's "The Art of Computer Programming" for a pointer, which has a lot about optimal sorts and sorting network. Oddly, "parity" wasn't in the index.

There's probably some interesting pattern I could use to figure out which code paths to use, but N is small, so I decided to brute force it.

I'll start with all of the possible permutations:

>>> permutations = list(itertools.permutations(range(3)))
>>> permutations
[(0, 1, 2), (0, 2, 1), (1, 0, 2), (1, 2, 0), (2, 0, 1), (2, 1, 0)]
I want to build a decision tree where each leaf contains only one permutation. Each decision will be made by choosing two indices to use for the comparison test. I'll go through the permtuations. If its values at those indices are sorted then I'll put them into the "lt_permutations" list ("lt" is short for "less than"), otherwise they go into the "gt_permutations" list.

For now, I'll assume the first pair of indices to swap is (0, 1):

>>> lt_permutations = []
>>> gt_permutations = []
>>> for permutation in permutations:
...   if permutation[0] < permutation[1]:
...     lt_permutations.append(permutation)
...   else:
...     gt_permutations.append(permutation)
... 
>>> lt_permutations
[(0, 1, 2), (0, 2, 1), (1, 2, 0)]
>>> gt_permutations
[(1, 0, 2), (2, 0, 1), (2, 1, 0)]

I'll turn the above into a utility function:

def partition_permutations(i, j, permutations):
    lt_permutations = []
    gt_permutations = []
    for permutation in permutations:
        if permutation[i] < permutation[j]:
            lt_permutations.append(permutation)
        else:
            gt_permutations.append(permutation)
    return lt_permutations, gt_permutations
and use it further partition the 'lt_permutations' on the pair (1, 2):
>>> lt_permutations2, gt_permutations2 = partition_permutations(1, 2, lt_permutations)
>>> lt_permutations2
[(0, 1, 2)]
>>> gt_permutations2
[(0, 2, 1), (1, 2, 0)]
The lt_permutations2 list contains one element, so this time I'll partition gt_permutations2 using the swap index pair (0, 2):
>>> lt_permutations3, gt_permutations3 = partition_permutations(0, 2, gt_permutations2)
>>> lt_permutations3
[(0, 2, 1)]
>>> gt_permutations3
[(1, 2, 0)]
Each partitioning corresponds to additional if-statements until there is only one element in the branch. I want to use the above information to make a decision tree which looks like:
def parity3(data):
  if data[0] < data[1]:
    if data[1] < data[2];
      return 0 # parity of (0, 1, 2)
    else:
      if data[0] < data[2]:
        return 1 # parity of (0, 2, 1)
      else:
        return 0 # parity of (1, 2, 0)
  ...

Partition scoring

In the previous section I partioned using the successive pairs (0, 1), (1, 2) and (0, 2). These are pretty obvious. What should I use for N=4 or higher? In truth, I could likely use same swap pairs as from the sorting network, but I decided to continue with brute force.

For N item there are N*(N-1)/2 possible swap pairs.

>>> n = 4
>>> swap_pairs = [(i, j) for i in range(n-1) for j in range(i+1, n)]
>>> swap_pairs = [(i, j) for i in range(n-1) for j in range(i+1, n)]
>>> swap_pairs
[(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)]
I decided to pick the one which is more likely to partition the set of permutations in half. For each pair, I partition the given permutations, and use the absolute value of the difference between the "less than" and the "greater than" subsets.
def get_partition_score(swap_pair, permutations):
    i, j = swap_pair
    num_lt = num_gt = 0
    for permutation in permutations:
        if permutation[i] < permutation[j]:
            num_lt += 1
        else:
            num_gt += 1
    return abs(num_lt - num_gt)
I'll create all permutations for 4 terms and score each of the pairs on the results:
>>> permutations = list(itertools.permutations(range(n)))
>>> len(permutations)
24
>>> permutations[:3]
[(0, 1, 2, 3), (0, 1, 3, 2), (0, 2, 1, 3)]
>>> for swap_pair in swap_pairs:
...   print(swap_pair, get_partition_score(swap_pair, permutations))
... 
(0, 1) 0
(0, 2) 0
(0, 3) 0
(1, 2) 0
(1, 3) 0
(2, 3) 0
The score is 0, which means that all partitions are equally good. I'll use the first, which is (0, 1), and partition into two sets of 12 each:
>>>lt_permutations, gt_permutations = partition_permutations(0, 1, permutations)
>>> lt_permutations
[(0, 1, 2, 3), (0, 1, 3, 2), (0, 2, 1, 3), (0, 2, 3, 1), (0, 3, 1, 2),
(0, 3, 2, 1), (1, 2, 0, 3), (1, 2, 3, 0), (1, 3, 0, 2), (1, 3, 2, 0),
(2, 3, 0, 1), (2, 3, 1, 0)]
>>> len(lt_permutations), len(gt_permutations)
(12, 12)
I'll redo the scoring with the "less than" permutations
>>> permutations = lt_permutations
>>> for swap_pair in swap_pairs:
...   print(swap_pair, get_partition_score(swap_pair, permutations))
... 
(0, 1) 12
(0, 2) 4
(0, 3) 4
(1, 2) 4
(1, 3) 4
(2, 3) 0
Obviously it does no good to use (0, 1) again because those are all sorted. Most of the other fields are also partially sorted so using them leads to an imbalanced 4-8 partitioning, but (2, 3) gives a perfect partitioning, so I'll use it for the next partitioning, and again select the "less-than" subset and re-score:
>>> lt_permutations, gt_permutations = partition_permutations(2, 3, permutations)
>>> lt_permutations
[(0, 1, 2, 3), (0, 2, 1, 3), (0, 3, 1, 2), (1, 2, 0, 3), (1, 3, 0, 2), (2, 3, 0, 1)]
>>> gt_permutations
[(0, 1, 3, 2), (0, 2, 3, 1), (0, 3, 2, 1), (1, 2, 3, 0), (1, 3, 2, 0), (2, 3, 1, 0)]
>>> permutations = lt_permutations
>>> for swap_pair in swap_pairs:
...   print(swap_pair, get_partition_score(swap_pair, permutations))
... 
(0, 1) 6
(0, 2) 0
(0, 3) 4
(1, 2) 4
(1, 3) 0
(2, 3) 6

Repeat this process until only one permutation is left, and use the parity of that permutation as the return value.

Code generation

I'll combine the above code together and put it into program which generates Python code that will compute the parity of a list with N distinct items. It uses recursion. The main entry point is "generate_parity_function()", which sets up the data for the recursive function "_generate_comparison()". That identifies the best pair of indices to use for the swap then calls itself to process each side.

On the other hand, if there's one permutation in the list, then there's nothing more do to but compute the parity of that permutation and use that as the return value for that case.

import itertools

def parity_shell(values):
    # Simple Shell sort; while O(N^2), we only deal with at most 4 values 
    values = list(values)
    N = len(values)
    num_swaps = 0
    for i in range(N-1):
        for j in range(i+1, N):
            if values[i] > values[j]:
                values[i], values[j] = values[j], values[i]
                num_swaps += 1
    return num_swaps % 2

def get_partition_score(swap_pair, permutations):
    i, j = swap_pair
    num_lt = num_gt = 0
    for permutation in permutations:
        if permutation[i] < permutation[j]:
            num_lt += 1
        else:
            num_gt += 1
    return abs(num_lt - num_gt)

def partition_permutations(i, j, permutations):
    lt_permutations = []
    gt_permutations = []
    for permutation in permutations:
        if permutation[i] < permutation[j]:
            lt_permutations.append(permutation)
        else:
            gt_permutations.append(permutation)
    return lt_permutations, gt_permutations

def generate_parity_function(n):
    print("def parity{}(data):".format(n))
    permutations = list(itertools.permutations(range(n)))
    swap_pairs = [(i, j) for i in range(n-1) for j in range(i+1, n)]
    _generate_comparison(permutations, swap_pairs, "  ")

def _generate_comparison(permutations, swap_pairs, indent):
    if len(permutations) == 1:
        parity = parity_shell(permutations[0])
        print(indent + "return {} # {} ".format(parity, permutations[0]))
        return
    
    swap_pair = min(swap_pairs, key=lambda x: get_partition_score(x, permutations))
    # Delete the swap pair because it can't be used again.
    # (Not strictly needed as it will always have the worse score.)
    del swap_pairs[swap_pairs.index(swap_pair)]

    # I could have a case where the lt subset has 0 elements while the
    # gt subset has 1 element. Rather than have the 'if' block do nothing,
    # I'll swap the comparison indices and swap branches.
    i, j = swap_pair
    lt_permutations, gt_permutations = partition_permutations(i, j, permutations)
    if not lt_permutations:
        lt_permutations, gt_permutations = gt_permutations, lt_permutations
        i, j = j, i
    
    print(indent + "if data[{i}] < data[{j}]:".format(i=i, j=j))
    # Need to copy the swap_pairs because the 'else' case may reuse a pair.
    _generate_comparison(lt_permutations, swap_pairs[:], indent+"  ")
    if gt_permutations:
        print(indent + "else:")
        _generate_comparison(gt_permutations, swap_pairs, indent+"  ")
    

if __name__ == "__main__":
    import sys
    n = 4
    if sys.argv[1:]:
        n = int(sys.argv[1])
    generate_parity_function(n)

The output for n=2 elements is the expected trivial case:

def parity2(data):
  if data[0] < data[1]:
    return 0 # (0, 1) 
  else:
    return 1 # (1, 0) 
For n=3 it's a bit more complicated.
def parity3(data):
  if data[0] < data[1]:
    if data[0] < data[2]:
      if data[1] < data[2]:
        return 0 # (0, 1, 2) 
      else:
        return 1 # (0, 2, 1) 
    else:
      return 0 # (1, 2, 0) 
  else:
    if data[0] < data[2]:
      return 1 # (1, 0, 2) 
    else:
      if data[1] < data[2]:
        return 0 # (2, 0, 1) 
      else:
        return 1 # (2, 1, 0) 
and for n=4 elements, well, you can see why I wrote a program to help generate the function:
def parity4(data):
  if data[0] < data[1]:
    if data[2] < data[3]:
      if data[0] < data[2]:
        if data[1] < data[2]:
          return 0 # (0, 1, 2, 3) 
        else:
          if data[1] < data[3]:
            return 1 # (0, 2, 1, 3) 
          else:
            return 0 # (0, 3, 1, 2) 
      else:
        if data[0] < data[3]:
          if data[1] < data[3]:
            return 0 # (1, 2, 0, 3) 
          else:
            return 1 # (1, 3, 0, 2) 
        else:
          return 0 # (2, 3, 0, 1) 
    else:
      if data[0] < data[3]:
        if data[1] < data[2]:
          if data[1] < data[3]:
            return 1 # (0, 1, 3, 2) 
          else:
            return 0 # (0, 2, 3, 1) 
        else:
          return 1 # (0, 3, 2, 1) 
      else:
        if data[0] < data[2]:
          if data[1] < data[2]:
            return 1 # (1, 2, 3, 0) 
          else:
            return 0 # (1, 3, 2, 0) 
        else:
          return 1 # (2, 3, 1, 0) 
  else:
    if data[2] < data[3]:
      if data[0] < data[3]:
        if data[0] < data[2]:
          return 1 # (1, 0, 2, 3) 
        else:
          if data[1] < data[2]:
            return 0 # (2, 0, 1, 3) 
          else:
            return 1 # (2, 1, 0, 3) 
      else:
        if data[1] < data[2]:
          return 1 # (3, 0, 1, 2) 
        else:
          if data[1] < data[3]:
            return 0 # (3, 1, 0, 2) 
          else:
            return 1 # (3, 2, 0, 1) 
    else:
      if data[0] < data[2]:
        if data[0] < data[3]:
          return 0 # (1, 0, 3, 2) 
        else:
          if data[1] < data[3]:
            return 1 # (2, 0, 3, 1) 
          else:
            return 0 # (2, 1, 3, 0) 
      else:
        if data[1] < data[2]:
          if data[1] < data[3]:
            return 0 # (3, 0, 2, 1) 
          else:
            return 1 # (3, 1, 2, 0) 
        else:
          return 0 # (3, 2, 1, 0) 

The test code is essentially the same as "test_three_element_parity()", so I won't include it here.

Evaluation

I don't think it makes much sense to use this function beyond n=5 because there's so much code. Here's a table of the number of lines of code it generates for difference values of n:

# elements   # lines
----------   -------
    2              5
    3             17
    4             71
    5            349
    6          2,159
    7         15,119
    8        120,959
    9      1,088,662
This appears to be roughly factorial growth, which is what it should be. For my case, n=4, so 71 lines is not a problem.

I wrote some timing code which does 100,000 random selections from the possible permutations and compares the performance of the parityN() function with parity_shell(). To put them on a more even basis, I changed the parity_shell() implementation so mutates the input values rather than making a temporary list. The timing code for parity5() looks like:

import itertools

def parity_shell(values):
    # Simple Shell sort; while O(N^2), we only deal with at most 4 values 
    #values = list(values)
    N = len(values)
    num_swaps = 0
    for i in range(N-1):
        for j in range(i+1, N):
            if values[i] > values[j]:
                values[i], values[j] = values[j], values[i]
                num_swaps += 1
    return num_swaps % 2


if __name__ == "__main__":
    import random
    import time
    permutations = list(itertools.permutations(range(5)))
    perms = [list(random.choice(permutations)) for i in range(100000)]
    t1 = time.time()
    p1 = [parity5(perm) for perm in perms]
    t2 = time.time()
    p2 = [parity_shell(perm) for perm in perms]
    t3 = time.time()
    if p1 != p2:
      print("oops")
    print("parity5:", t2-t1, "parity_shell:", t3-t2)
    print("ratio:", (t2-t1)/(t3-t2))
The decision tree version is consisently 5-6x faster than the Shell sort version across all the sizes I tested.

A performance improvement

By the way, I was able to raise the performance to 9x faster by switching to local variables rather than an array index each time. Here's the start of parity4() with that change:

def parity4(data):
  data0,data1,data2,data3 = data
  if data0 < data1:
    if data2 < data3:
      if data0 < data2:
        if data1 < data2:
          return 0 # (0, 1, 2, 3) 
        else:
          if data1 < data3:
            return 1 # (0, 2, 1, 3) 
          else:
            return 0 # (0, 3, 1, 2) 
      else:
        if data0 < data3:
          if data1 < data3:
            return 0 # (1, 2, 0, 3) 
          else:
            return 1 # (1, 3, 0, 2) 
        else:
          return 0 # (2, 3, 0, 1) 
    else:
      if data0 < data3:
        if data1 < data2:
          if data1 < data3:
            return 1 # (0, 1, 3, 2) 
          else:
            return 0 # (0, 2, 3, 1) 
        else:
          return 1 # (0, 3, 2, 1) 
      else:
        if data0 < data2:
          if data1 < data2:
            return 1 # (1, 2, 3, 0) 
          else:
            return 0 # (1, 3, 2, 0) 
        else:
          return 1 # (2, 3, 1, 0) 
  else:
    if data2 < data3:
      if data0 < data3:
        if data0 < data2:
          return 1 # (1, 0, 2, 3) 
        else:
          if data1 < data2:
            return 0 # (2, 0, 1, 3) 
          else:
            return 1 # (2, 1, 0, 3) 
      else:
        if data1 < data2:
          return 1 # (3, 0, 1, 2) 
        else:
          if data1 < data3:
            return 0 # (3, 1, 0, 2) 
          else:
            return 1 # (3, 2, 0, 1) 
    else:
      if data0 < data2:
        if data0 < data3:
          return 0 # (1, 0, 3, 2) 
        else:
          if data1 < data3:
            return 1 # (2, 0, 3, 1) 
          else:
            return 0 # (2, 1, 3, 0) 
      else:
        if data1 < data2:
          if data1 < data3:
            return 0 # (3, 0, 2, 1) 
          else:
            return 1 # (3, 1, 2, 0) 
        else:
          return 0 # (3, 2, 1, 0) 
It's easy to change the code so it generates this version instead, so I won't show it.

Are the if/else:s worthwhile?

This is written the day after I wrote the previous text. In the above, I minimized the number of comparisions, at the expense of a lot of code generation. But the sorting network doesn't add that much overhead, and it's clearly a lot easier to implement. The real test shouldn't be how much faster the if/else decision tree is over the Shell sort solution, but how much faster it is to the optimal sorting network.

So I did that. Here's the N=4 and N=5 code, which is clearly simpler than the 71 and 349 lines of code from the decision tree:

def parity4_network(data):
    # N=4 Bose-Nelson sorting network from http://pages.ripco.net/~jgamble/nw.html 
    num_swaps = 0
    x0, x1, x2, x3 = data
    if x1 < x0:
        num_swaps += 1
        x0, x1 = x1, x0
  
    if x3 < x2:
        num_swaps += 1
        x2, x3 = x3, x2
  
    if x2 < x0:
        num_swaps += 1
        x0, x2 = x2, x0
  
    if x3 < x1:
        num_swaps += 1
        x1, x3 = x3, x1
  
    if x2 < x1:
        num_swaps += 1
        x1, x2 = x2, x1
  
    return num_swaps % 2
    
def parity5_network(data):
    # N=5 Bose-Nelson sorting network from http://pages.ripco.net/~jgamble/nw.html 
    num_swaps = 0
    x0, x1, x2, x3, x4 = data
    if x1 < x0:
        num_swaps += 1
        x0, x1 = x1, x0
    
    if x4 < x3:
        num_swaps += 1
        x3, x4 = x4, x3
    
    if x4 < x2:
        num_swaps += 1
        x2, x4 = x4, x2
    
    if x3 < x2:
        num_swaps += 1
        x2, x3 = x3, x2
    
    if x3 < x0:
        num_swaps += 1
        x0, x3 = x3, x0
    
    if x2 < x0:
        num_swaps += 1
        x0, x2 = x2, x0
    
    if x4 < x1:
        num_swaps += 1
        x1, x4 = x4, x1
    
    if x3 < x1:
        num_swaps += 1
        x1, x3 = x3, x1
    
    if x2 < x1:
        num_swaps += 1
        x1, x2 = x2, x1
    
    return num_swaps % 2
My timing tests say that the N=4 sorting network takes 1.4x the time of the decision tree with local variables, and the N=5 sorting network takes 1.7x the time. The decision tree is clearly faster.

If you are really interested in performance then you could push this into C or switch to PyPy. But sometimes you just need the code to be fast enough, which is why it can be worthwhile to explore different solutions and know their tradeoffs.



Fragment chiral molecules in RDKit using low-level functions #

In the previous essay, I showed that the simple fragmentation function doesn't preserve chiral after making a single cut. Here's the function definition:

from rdkit import Chem

# Only works correctly for achiral molecules
def fragment_simple(mol, atom1, atom2):
    rwmol = Chem.RWMol(mol)
    rwmol.RemoveBond(atom1, atom2)
    wildcard1 = rwmol.AddAtom(Chem.Atom(0))
    wildcard2 = rwmol.AddAtom(Chem.Atom(0))
    rwmol.AddBond(atom1, wildcard1, Chem.BondType.SINGLE) 
    rwmol.AddBond(atom2, wildcard2, Chem.BondType.SINGLE) 
    return rwmol.GetMol()
The reason is the RemoveBond()/AddBond() combination can change the permutation order of the bonds around and atom, which inverts the chirality. Here's the relevant part of the connection table from the end of that essay:
  connections from atom 1 (as bond type + other atom index)
1 C -0 -2 -3 -5   original structure
1 C -0 -2 -5 -11  modified; bond to atom 3 is now a bond to atom 11
             ^^^--- was bond to atom 3
         ^^^^^^^--- these two bonds swapped position = inverted chirality

I'll now show how to improve the code to handle chirality. (Note: this essay is pedagogical. To fragment in RDKit use FragmentOnBonds().)

Parity of a permutation

There's no way from Python to go in and change the permutation order of RDKit's bond list for an atom. Instead, I need to detect if the permutation order has changed, and if so, un-invert the atom's chirality.

While I say "un-invert", that's because we only need to deal with tetrahedral chirality, which has only two chirality types. SMILES supports more complicated chiralities, like octahedral (for example, "@OH19") which can't be written simply as "@" or "@@". However, I've never seen them in use.

With only two possibilities, this reduces to determining the "parity" of the permutation. There are only two possible parities. I'll call one "even" and the other "odd", though in code I'll use 0 for even and 1 for odd.

A list of values in increasing order, like (1, 2, 9), has an even parity. If I swap two values then it has odd parity. Both (2, 1, 9) and (9, 2, 1) have odd parity, because each needs only one swap to put it in sorted order. With another swap, such as (2, 9, 1), the permutation order is back to even parity. The parity of a permutation is the number of pairwise swaps needed to order the list, modulo 2. If the result is 0 then it has even parity, if the result is 1 then it has odd parity.

One way to compute the permutation order is to sort the list, and count the number of swaps needed. Since there will only be a handful of bonds, I can use a simple sort like the Shell sort:

def parity_shell(values):
    # Simple Shell sort; while O(N^2), we only deal with at most 4 values 
    values = list(values)
    N = len(values)
    num_swaps = 0
    for i in range(N-1):
        for j in range(i+1, N):
            if values[i] > values[j]:
                values[i], values[j] = values[j], values[i]
                num_swaps += 1
    return num_swaps % 2
(I have another essay where I show how to write a faster parity function for a fixed number of values.)

I'll test it with a few different cases to see if it gives the expected results:

>>> parity_shell( (1, 2, 9) )
0
>>> parity_shell( (2, 1, 9) )
1
>>> parity_shell( (2, 9, 1) )
0
>>> parity_shell( (2, 1, 9) )
1
>>> parity_shell( (1, 3, 9, 8) )
1
There are faster and better ways to determine the parity. I find it best to start with the most obviously correct solution first.

Determine an atom's parity

The next step is to determine the configuration order before and after attaching the dummy atom. I'll use the fragment_simple() and parity_shell() functions I defined earlier, and define a couple of helper functions to create an isomeric canonical SMILES from a molecule or SMILES string.

from rdkit import Chem

def C(mol): # Create a canonical isomeric SMILES from a molecule
  return Chem.MolToSmiles(mol, isomericSmiles=True)

def Canon(smiles): # Create a canonical isomeric SMILES from a SMILES string
  return C(Chem.MolFromSmiles(smiles))

The permutation order is based on which atoms are connected to a given bond. I'll parse a simple chiral structure (which is already in canonical form) and get the ids for the atoms bonded to the second atom. (The second atom has an index of 1.)

>>> mol = Chem.MolFromSmiles("O[C@](F)(Cl)Br")
>>> C(mol)
'O[C@](F)(Cl)Br'
>>> 
>>> atom_id = 1
>>> atom_obj = mol.GetAtomWithIdx(atom_id)
>>> other_atoms = [bond.GetOtherAtomIdx(atom_id) for bond in atom_obj.GetBonds()]
>>> other_atoms
[0, 2, 3, 4]
The list values are in order, so you won't be surprised it has a parity of 0 ("even"):
>>> parity_shell(other_atoms)
0

I'll use the fragment_simple() function to fragment between the oxygen and the chiral carbon:

>>> fragmented_mol = fragment_simple(mol, 0, 1)
>>> fragmented_smiles = C(fragmented_mol)
>>> fragmented_smiles
'[*]O.[*][C@@](F)(Cl)Br'
the use the convert_wildcards_to_closures() function from the previous essay to re-connect the fragments and produce a canonical SMILES from it:
>>> from smiles_syntax import convert_wildcards_to_closures
>>> 
>>> closure_smiles = convert_wildcards_to_closures(fragmented_smiles, (0, 0))
>>> closure_smiles
'O%90.[C@@]%90(F)(Cl)Br'
>>> 
>>> Canon(closure_smiles)
'O[C@@](F)(Cl)Br'

If you compare this to the canonicalized input SMILES you'll see the chirality is inverted from what it should be. I'll see if I can detect that from the list of neighbor atoms to the new atom 1 of the fragmented molecule:

>>> atom_id = 1
>>> atom_obj = fragmented_mol.GetAtomWithIdx(atom_id)
>>> other_atoms = [bond.GetOtherAtomIdx(atom_id) for bond in atom_obj.GetBonds()]
>>> other_atoms
[2, 3, 4, 6]
These values are ordered. It's tempting to conclude that this list also has an even parity. But recall that the original list was [0, 2, 3, 4]. The id 0 (the connection to the oxygen) has been replaced with the id 6 (the connection to the wildcard atom).

The permutation must use the same values, so I'll replace the 6 with a 0 and determine the parity of the resulting list:

>>> i = other_atoms.index(6)
>>> i
3
>>> other_atoms[i] = 0
>>> other_atoms
[2, 3, 4, 0]
>>> parity_shell(other_atoms)
1
This returned a 1 when the ealier parity call returned a 0, which means parity is inverted, which means I need to invert the chirality of the second atom:
>>> atom_obj.InvertChirality()

Now to check the re-assembled structure:

>>> fragmented_smiles = C(fragmented_mol)
>>> fragmented_smiles
'[*]O.[*][C@](F)(Cl)Br'
>>> 
>>> closure_smiles = convert_wildcards_to_closures(fragmented_smiles, (0, 0))
>>> closure_smiles
'O%90.[C@]%90(F)(Cl)Br'
>>> 
>>> Canon(closure_smiles)
'O[C@](F)(Cl)Br'
This matches the canonicalized input SMILES, so we're done.

An improved fragment function

I'll use a top-down process to describe the changes to fragment_simple() to make it work. What this doesn't show you is the several iterations I went through to make it look this nice.

At the top level, I need some code to figure out if an atom is chiral, then after I made the cut, and if the atom is chiral, I need some way to restore the correct chirality once I've connected it to the new wildcard atom.

def fragment_chiral(mol, atom1, atom2):
    rwmol = Chem.RWMol(mol)

    # Store the old parity as 0 = even, 1 = odd, or None for no parity 
    atom1_parity = get_bond_parity(mol, atom1)
    atom2_parity = get_bond_parity(mol, atom2)
    
    rwmol.RemoveBond(atom1, atom2)
    wildcard1 = rwmol.AddAtom(Chem.Atom(0))
    wildcard2 = rwmol.AddAtom(Chem.Atom(0))
    new_bond1 = rwmol.AddBond(atom1, wildcard1, Chem.BondType.SINGLE)
    new_bond2 = rwmol.AddBond(atom2, wildcard2, Chem.BondType.SINGLE)

    # Restore the correct parity if there is a parity
    if atom1_parity is not None:
        set_bond_parity(rwmol, atom1, atom1_parity, atom2, wildcard1)
    if atom2_parity is not None:
        set_bond_parity(rwmol, atom2, atom2_parity, atom1, wildcard2)

    # (Later I'll find I also need to call ClearComputedProps().)
    return rwmol.GetMol()

To get atom's bond permutation parity, check if it has tetrahedral chirality (it will either be clockwise or counter-clockwise). If it doesn't have a tetrahedral chirality, return None. Otherwise use the neighboring atom ids to determine the parity:

from rdkit import Chem

CHI_TETRAHEDRAL_CW = Chem.ChiralType.CHI_TETRAHEDRAL_CW
CHI_TETRAHEDRAL_CCW = Chem.ChiralType.CHI_TETRAHEDRAL_CCW

def get_bond_parity(mol, atom_id):
    atom_obj = mol.GetAtomWithIdx(atom_id)
    
    # Return None unless it has tetrahedral chirality
    chiral_tag = atom_obj.GetChiralTag()
    if chiral_tag not in (CHI_TETRAHEDRAL_CW, CHI_TETRAHEDRAL_CCW):
        return None
    
    # Get the list of atom ids for the each atom it's bonded to.
    other_atom_ids = [bond.GetOtherAtomIdx(atom_id) for bond in atom_obj.GetBonds()]
    
    # Use those ids to determine the parity
    return parity_shell(other_atom_ids)

To restore the parity, again get the list of neighboring atom ids, but this time from the fragmented molecule. This will be connected to one of the new wildcard atoms. I need to map that back to the original atom index before I can compute the parity and, if it's changed, invert the chirality:

def set_bond_parity(mol, atom_id, old_parity, old_other_atom_id, new_other_atom_id):
    atom_obj = mol.GetAtomWithIdx(atom_id)
    # Get the list of atom ids for the each atom it's bonded to.
    other_atom_ids = [bond.GetOtherAtomIdx(atom_id) for bond in atom_obj.GetBonds()]

    # Replace id from the new wildcard atom with the id of the original atom
    i = other_atom_ids.index(new_other_atom_id)
    other_atom_ids[i] = old_other_atom_id

    # Use those ids to determine the parity
    new_parity = parity_shell(other_atom_ids)
    if old_parity != new_parity:
        # If the parity has changed, invert the chirality
        atom_obj.InvertChirality()

Testing

I used a simple set of tests during the initial development where I split the bond between the first and second atom of a few structures and compared the result to a reference structure that I fragmented manually (and re-canonicalized):

# Create a canonical isomeric SMILES from a SMILES string.
# Used to put the manually-developed reference structures into canonical form.
def Canon(smiles):
    mol = Chem.MolFromSmiles(smiles)
    assert mol is not None, smiles
    return Chem.MolToSmiles(mol, isomericSmiles=True)

def simple_test():
    for smiles, expected in (
            ("CC", Canon("*C.*C")),
            ("F[C@](Cl)(Br)O", Canon("*F.*[C@](Cl)(Br)O")),
            ("F[C@@](Cl)(Br)O", Canon("*F.*[C@@](Cl)(Br)O")),
            ("F[C@@H](Br)O", Canon("*F.*[C@@H](Br)O")),
            ):
        mol = Chem.MolFromSmiles(smiles)
        fragmented_mol = fragment_chiral(mol, 0, 1)
        fragmented_smiles = Chem.MolToSmiles(fragmented_mol, isomericSmiles=True)
        if fragmented_smiles != expected:
            print("smiles:", smiles)
            print("fragmented:", fragmented_smiles)
            print("  expected:", expected)
These tests passed, so I developed some more extensive tests. My experience is that real-world chemistry is far more complex and interesting than the manual test cases I develop. After the basic tests are done, I do more extensive testing by processing a large number of structures from PubChem and then from ChEMBL.

(Incidentally, I process PubChem first because those files are generated by one tool - OEChem I believe - while ChEMBL files are generated by several different tools, each with a different way to handle chemistry. This makes the chemistry in ChEMBL more diverse.)

Need ClearComputedProps()?

The more extensive test processes every structure which contains a chiral atom (where the SMILES contains a '@'), cuts every bond between heavy atoms, so long as it's a single bond not in a ring, and puts the results back together to see if it matches the canonicalized input structure. The code isn't interesting enough to make specific comments about it. You can get the code at the end of this essay.

The first error occurred quickly, and there were many errors. Here's the first one:

FAILURE in record 94
     input_smiles: C[C@H]1CC[C@@H](NCc2ccc3c(c2)Cc2c(-c4ccc(CC(=O)O)cc4)n[nH]c2-3)CC1
  begin/end atoms: 4 5
fragmented smiles: [*]NCc1ccc2c(c1)Cc1c(n[nH]c1-2)-c1ccc(cc1)CC(=O)O.[*][C@H]1CC[C@H](C)CC1
   closure smiles: N%90Cc1ccc2c(c1)Cc1c(n[nH]c1-2)-c1ccc(cc1)CC(=O)O.[C@@H]%901CC[C@H](C)CC1
     final smiles: C[C@H]1CC[C@H](NCc2ccc3c(c2)Cc2c(-c4ccc(CC(=O)O)cc4)n[nH]c2-3)CC1
  expected smiles: C[C@H]1CC[C@@H](NCc2ccc3c(c2)Cc2c(-c4ccc(CC(=O)O)cc4)n[nH]c2-3)CC1
Greg Landrum, the main RDKit developer, points to the solution. He writes: "after you break one or more bonds, you really, really should re-sanitize the molecule (or at least call ClearComputedProps()".

The modified code is:

def fragment_chiral(mol, atom1, atom2):
    rwmol = Chem.RWMol(mol)
    
    atom1_parity = get_bond_parity(mol, atom1)
    atom2_parity = get_bond_parity(mol, atom2)
    
    rwmol.RemoveBond(atom1, atom2)
    wildcard1 = rwmol.AddAtom(Chem.Atom(0))
    wildcard2 = rwmol.AddAtom(Chem.Atom(0))
    new_bond1 = rwmol.AddBond(atom1, wildcard1, Chem.BondType.SINGLE)
    new_bond2 = rwmol.AddBond(atom2, wildcard2, Chem.BondType.SINGLE)
    
    if atom1_parity is not None:
        set_bond_parity(rwmol, atom1, atom1_parity, atom2, wildcard1)
    if atom2_parity is not None:
        set_bond_parity(rwmol, atom2, atom2_parity, atom1, wildcard2)

    # After breaking bonds, should re-sanitize, or at least use ClearComputedProps()
    # See https://github.com/rdkit/rdkit/issues/1022#issuecomment-239355482
    new_mol = rwmol.GetMol()
    # (I will find that I need to call SanitizeMol().)
    Chem.ClearComputedProps(new_mol)
    return new_mol

Ring chirality failures

There were 200,748 chiral structures in my selected subset from PubChem. Of those, 232 unique structures problems when I try to cut a bond. Here's an example of one of the reported failures:

FAILURE in record 12906
     input_smiles: [O-][S@+](CC1(O)CCN(CCc2c[nH]c3ccc(F)cc23)CC1)c1ccccc1
  begin/end atoms: 0 1
fragmented smiles: [*][O-].[*][S@+](CC1(O)CCN(CCc2c[nH]c3ccc(F)cc23)CC1)c1ccccc1
   closure smiles: [O-]%90.[S@+]%90(CC1(O)CCN(CCc2c[nH]c3ccc(F)cc23)CC1)c1ccccc1
     final smiles: [O-][S@@+](CC1(O)CCN(CCc2c[nH]c3ccc(F)cc23)CC1)c1ccccc1
  expected smiles: [O-][S@+](CC1(O)CCN(CCc2c[nH]c3ccc(F)cc23)CC1)c1ccccc1

Here's the complete list of failing structures, which might make a good test set for other programs:

C-N=C(-S)[C@]1(c2cccnc2)CCCC[S@+]1[O-] 1
C=C(c1ccc([S@+]([O-])c2ccc(OC)cc2)cc1)C1CCN(C2CCCCC2)CC1 2
C=C(c1ccc([S@@+]([O-])c2ccc(OC)cc2)cc1)C1CCN(C2CCCCC2)CC1 3
C=CCCCC[N@+]1(C)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 4
C=CCCCC[N@@+]1(C)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 5
C=CC[S@+]([O-])C[C@H](N)C(=O)O 6
CC#CCOc1ccc([S@@+]([O-])[C@H](C(=O)NO)C(C)C)cc1 7
CC(=O)NC[C@H]1CN(c2ccc([S@+](C)[O-])cc2)C(=O)O1 8
CC(=O)Nc1cc(-c2[nH]c([S@+](C)[O-])nc2-c2ccc(F)cc2)ccn1 9
CC(=O)Nc1cc(-c2c(-c3ccc(F)cc3)nc([S@+](C)[O-])n2C)ccn1 10
CC(C(=O)O[C@@H]1CC2CCC(C1)[N@]2C)(c1ccccc1)c1ccccc1 11
CC(C)(C(=O)N[C@H]1C2CCC1C[C@@H](C(=O)O)C2)N1CCN(c2ccc(C(F)(F)F)cn2)CC1 12
CC(C)(C(=O)N[C@H]1C2CCCC1C[C@@H](C(=O)O)C2)N1CCN(c2ccc(C(F)(F)F)cn2)CC1 13
CC(C)(C(=O)N[C@H]1C2CCCC1C[C@@H](C(N)=O)C2)N1CCN(c2ccc(C(F)(F)F)cn2)CC1 14
CC(C)(C(=O)N[C@H]1C2CCCC1C[C@H](C(N)=O)C2)N1CCN(c2ccc(C(F)(F)F)cn2)CC1 15
CC(C)(Oc1ccc(C#N)cn1)C(=O)N[C@H]1C2CCCC1C[C@H](C(N)=O)C2 16
CC(C)(Oc1ccc(C#N)cn1)C(=O)N[C@H]1C2COCC1C[C@H](C(N)=O)C2 17
CC(C)(Oc1ccc(Cl)cc1)C(=O)N[C@H]1C2CCCC1C[C@@H](C(=O)O)C2 18
CC(C)(Oc1ccc(Cl)cc1)C(=O)N[C@H]1C2CCCC1C[C@@H](C(N)=O)C2 19
CC(C)(Oc1ccc(Cl)cc1)C(=O)N[C@H]1C2CCCC1C[C@H](C(=O)O)C2 20
CC(C)(Oc1ccc(Cl)cc1)C(=O)N[C@H]1C2CCCC1C[C@H](C(N)=O)C2 21
CC(C)(Oc1ccc(Cl)cc1)C(=O)N[C@H]1C2COCC1C[C@@H](C(N)=O)C2 22
CC(C)Cc1c(C#N)cc2ccccc2c1C(=O)N(C)C[C@@H](CCN1CCC(c2ccccc2[S@+](C)[O-])CC1)c1ccc(Cl)c(Cl)c1 23
CC(C)Nc1cc(-c2[nH]c([S@+](C)[O-])nc2-c2ccc(F)cc2)ccn1 24
CC(C)Nc1cc(-c2[nH]c([S@@+]([O-])C(C)C)nc2-c2ccc(F)cc2)ccn1 25
CC(C)OC(=O)N1CCC(Oc2ncnc3c2CCN3c2ccc([S@+](C)[O-])c(F)c2)CC1 26
CC(C)OC(=O)N1CCC(Oc2ncnc3c2CCN3c2ccc([S@@+](C)[O-])c(F)c2)CC1 27
CC(C)[C@@H](C)Nc1cc(-c2[nH]c([S@+](C)[O-])nc2-c2ccc(F)cc2)ccn1 28
CC(C)[C@H](C)Nc1cc(-c2[nH]c([S@+](C)[O-])nc2-c2ccc(F)cc2)ccn1 29
CC(C)[S@+]([O-])c1nc(-c2ccc(F)cc2)c(-c2ccnc(NC3CCCCC3)c2)[nH]1 30
CC1(C)C(=O)N([C@H]2C3CCCC2C[C@@H](C(N)=O)C3)CC1COc1ccc(C#N)cn1 31
CC1(C)C(=O)N([C@H]2C3CCCC2C[C@H](C(N)=O)C3)CC1COc1ccc(C#N)cn1 32
CC1(C)N=C(c2ccccc2)C(=O)N1[C@@H]1C(=O)N2[C@@H]1[S@+]([O-])C(C)(C)[C@@H]2C(=O)O 33
CC1(C)NC(c2ccccc2)C(=O)N1[C@@H]1C(=O)N2[C@@H]1[S@+]([O-])C(C)(C)[C@@H]2C(=O)O 34
CC1(C)NC(c2ccccc2)C(=O)N1[C@@H]1C(=O)N2[C@@H]1[S@@+]([O-])C(C)(C)[C@@H]2C(=O)O 35
CCCCCCCCCCCCCO[C@@H](CO)[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 36
CCCCCCCCC[S@@+]([O-])CC(=O)OC 37
CCCCCCCCC[S@@+]([O-])CC(N)=O 38
CCCCCCCC[N@+]1(C)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 39
CCCCCC[C@@H](C(=O)NO)[S@+]([O-])c1ccc(OC)cc1 40
CCCCCC[C@@H](C(=O)NO)[S@@+]([O-])c1ccc(OC)cc1 41
CCCCCC[N@+]1(C)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 42
CCCCOCCOc1ccc(-c2ccc3c(c2)C=C(C(=O)Nc2ccc([S@@+]([O-])Cc4cncn4CCC)cc2)CCCCCN3CC(C)C)cc1 43
CCCCOCCOc1ccc(-c2ccc3c(c2)C=C(C(=O)Nc2ccc([S@@+]([O-])Cc4cncn4CCC)cc2)CCCCCN3CCC)cc1 44
CCCCOCCOc1ccc(-c2ccc3c(c2)C=C(C(=O)Nc2ccc([S@@+]([O-])Cc4cncn4CCC)cc2)CCCCN3CC(C)C)cc1 45
CCCCOCCOc1ccc(-c2ccc3c(c2)C=C(C(=O)Nc2ccc([S@@+]([O-])Cc4cncn4CCC)cc2)CCCCN3CCC)cc1 46
CCCCOCCOc1ccc(-c2ccc3c(c2)C=C(C(=O)Nc2ccc([S@@+]([O-])Cc4cncn4CCC)cc2)CCCN3CC(C)C)cc1 47
CCCCOCCOc1ccc(-c2ccc3c(c2)C=C(C(=O)Nc2ccc([S@@+]([O-])Cc4cncn4CCC)cc2)CCCN3CC(C)C)cc1.CS(=O)(=O)O 48
CCCCOCCOc1ccc(-c2ccc3c(c2)C=C(C(=O)Nc2ccc([S@@+]([O-])Cc4cncn4CCC)cc2)CCCN3CCC)cc1 49
CCCCOCCOc1ccc(-c2ccc3c(c2)C=C(C(=O)Nc2ccc([S@@+]([O-])Cc4cncn4CCC)cc2)CCCN3Cc2cnn(C)c2)cc1 50
CCCCOCCOc1ccc(-c2ccc3c(c2)C=C(C(=O)Nc2ccc([S@@+]([O-])Cc4nncn4CCC)cc2)CCCN3CC(C)C)cc1 51
CCCCOCCOc1ccc(-c2ccc3c(c2)C=C(C(=O)Nc2ccc([S@@+]([O-])Cc4nncn4CCC)cc2)CCCN3CCC)cc1 52
CCCC[N@+]1(C)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 53
CCC[C@H]1CC[C@H]([C@H]2CC[C@@H](OC(=O)[C@H]3[C@H](c4ccc(O)cc4)[C@H](C(=O)O[C@H]4CC[C@@H]([C@H]5CC[C@H](CCC)CC5)CC4)[C@H]3c3ccc(O)cc3)CC2)CC1 54
CCCc1c(C#N)cc2ccccc2c1C(=O)N(C)C[C@@H](CCN1CCC(c2ccccc2[S@+](C)[O-])CC1)c1ccc(Cl)c(Cl)c1 55
CCN1C(=O)c2ccccc2[C@H]1[C@@H](CCN1CCC(c2ccccc2[S@+](C)[O-])CC1)c1ccc(Cl)c(Cl)c1 56
CCN1c2cc(C(=O)NCc3ccc(C#N)cc3)ccc2[S@@+]([O-])c2ccccc2C1=O 57
CCN1c2cc(C(=O)NCc3ccc(F)cc3)ccc2[S@@+]([O-])c2ccccc2C1=O 58
CCN1c2cc(C(=O)NCc3ccc(OC)cc3)ccc2[S@+]([O-])c2ccccc2C1=O 59
CCN1c2cc(C(=O)NCc3ccc(OC)cc3)ccc2[S@@+]([O-])c2ccccc2C1=O 60
CCN1c2cc(C(=O)N[C@H](C)c3ccc(Br)cc3)ccc2[S@@+]([O-])c2ccccc2C1=O 61
CCN1c2cc(C(=O)N[C@H](C)c3ccc4ccccc4c3)ccc2[S@@+]([O-])c2ccccc2C1=O 62
CCO[C@@H](CO)[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 63
CC[C@H](NC(=O)c1c2ccccc2nc(-c2ccccc2)c1C[S@+](C)[O-])c1ccccc1 64
CC[C@H](NC(=O)c1c2ccccc2nc(-c2ccccc2)c1[S@+](C)[O-])c1ccccc1 65
CC[S@+]([O-])c1ccc(-c2coc3ccc(-c4ccc(C)o4)cc32)cc1 66
CCc1c(C#N)cc2ccccc2c1C(=O)N(C)C[C@@H](CCN1CCC(c2ccccc2[S@+](C)[O-])CC1)c1ccc(Cl)c(Cl)c1 67
CN(C[C@@H](CCN1CCC(c2ccc(Br)cc2[S@+](C)[O-])CC1)c1ccc(Cl)c(Cl)c1)C(=O)c1cc(C#N)cc2ccccc21 68
CN(C[C@@H](CCN1CCC(c2ccc(F)cc2[S@+](C)[O-])CC1)c1ccc(Cl)c(Cl)c1)C(=O)c1cc(C#N)cc2ccccc21 69
CN(C[C@@H](CCN1CCC(c2ccccc2[S@+](C)[O-])CC1)c1ccc(Cl)c(Cl)c1)C(=O)c1c2ccccc2cc(C#N)c1-c1ccccc1 70
CN(C[C@@H](CCN1CCC(c2ccccc2[S@+](C)[O-])CC1)c1ccc(Cl)c(Cl)c1)C(=O)c1cc(C#N)c(Br)c2ccccc12 71
CN(C[C@@H](CCN1CCC(c2ccccc2[S@+](C)[O-])CC1)c1ccc(Cl)c(Cl)c1)C(=O)c1cc(C#N)c(C#N)c2ccccc12 72
CN(C[C@@H](CCN1CCC(c2ccccc2[S@+](C)[O-])CC1)c1ccc(Cl)c(Cl)c1)C(=O)c1cc(C#N)c(F)c2ccccc12 73
CN(C[C@@H](CCN1CCC(c2ccccc2[S@+](C)[O-])CC1)c1ccc(Cl)c(Cl)c1)C(=O)c1cc(C#N)cc2cc(C#N)ccc21 74
CN(C[C@@H](CCN1CCC(c2ccccc2[S@+](C)[O-])CC1)c1ccc(Cl)c(Cl)c1)C(=O)c1cc(C#N)cc2cc(O)ccc21 75
CN(C[C@@H](CCN1CCC(c2ccccc2[S@+](C)[O-])CC1)c1ccc(Cl)c(Cl)c1)C(=O)c1cc(C#N)cc2ccccc21 76
CN1c2cc(C(=O)NCCN3CCCC3)ccc2[S@@+]([O-])c2ccccc2C1=O 77
CN1c2cc(C(=O)NCCc3cccs3)ccc2[S@+]([O-])c2ccccc2C1=O 78
CN1c2cc(C(=O)NCCc3cccs3)ccc2[S@@+]([O-])c2ccccc2C1=O 79
CN1c2cc(C(=O)NCc3ccc(Br)cc3)ccc2[S@@+]([O-])c2ccccc2C1=O 80
CN1c2cc(C(=O)NCc3ccc(Cl)cc3)ccc2[S@@+]([O-])c2ccccc2C1=O 81
CN1c2cc(C(=O)NCc3cccnc3)ccc2[S@@+]([O-])c2ccccc2C1=O 82
COC(=O)[C@H]1CC2COCC(C1)[C@H]2NC(=O)[C@@]1(C)CCCN1S(=O)(=O)c1cccc(Cl)c1C 83
COC(=O)c1ccc2[nH]c([S@+]([O-])Cc3nccc(OC)c3OC)nc2c1 84
COC(=O)c1ccc2[nH]c([S@@+]([O-])Cc3nccc(OC)c3OC)nc2c1 85
COC1(C[S@@+]([O-])c2ccccc2)CCN(CCc2c[nH]c3ccc(F)cc23)CC1 86
COCCCOc1ccnc(C[S@+]([O-])c2nc3ccccc3[nH]2)c1C 87
CO[C@@H](CO)[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 88
CO[C@@H]([C@H](O)[C@@H](O)CO)[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 89
COc1c(C#N)cc2ccccc2c1C(=O)N(C)C[C@@H](CCN1CCC(c2ccc(Br)cc2[S@+](C)[O-])CC1)c1ccc(Cl)c(Cl)c1 90
COc1c(C#N)cc2ccccc2c1C(=O)N(C)C[C@@H](CCN1CCC(c2ccccc2[S@+](C)[O-])CC1)c1ccc(Cl)c(Cl)c1 91
COc1cc(-C=C-OC(=O)[C@H]2CC[C@@H](N(C)[C@H]3CC[C@H](C(=O)O-C=C-c4cc(OC)c(OC)c(OC)c4)CC3)CC2)cc(OC)c1OC 92
COc1cc(C(=O)N2CCO[C@@](CCN3CCC4(CC3)c3ccccc3C[S@+]4[O-])(c3ccc(Cl)c(Cl)c3)C2)cc(OC)c1OC 93
COc1cc(C(=O)N2CCO[C@@](CCN3CCC4(CC3)c3ccccc3C[S@@+]4[O-])(c3ccc(Cl)c(Cl)c3)C2)cc(OC)c1OC 94
COc1cc(C(=O)N2CCO[C@](CCN3CCC4(CC3)c3ccccc3C[S@+]4[O-])(c3ccc(Cl)c(Cl)c3)C2)cc(OC)c1OC 95
COc1cc(C(=O)N2CCO[C@](CCN3CCC4(CC3)c3ccccc3C[S@@+]4[O-])(c3ccc(Cl)c(Cl)c3)C2)cc(OC)c1OC 96
COc1cc(OC(=O)[C@@H]2CC[C@H](N(C)[C@@H]3CC[C@@H](C(=O)Oc4cc(OC)c(OC)c(OC)c4)CC3)CC2)cc(OC)c1OC 97
COc1ccc(C2CCN(CC[C@H](CN(C)C(=O)c3c4ccccc4cc(C#N)c3OC)c3ccc(Cl)c(Cl)c3)CC2)c([S@+](C)[O-])c1 98
COc1ccc(C2CCN(CC[C@H](CN(C)C(=O)c3cc(C#N)cc4ccccc43)c3ccc(Cl)c(Cl)c3)CC2)c([S@+](C)[O-])c1 99
COc1ccc(N(C(=O)OC(C)(C)C)[C@@H]2C(C)=C[C@H]([S@@+]([O-])c3ccc(C)cc3)[C@@H]3C(=O)N(C)C(=O)[C@@H]32)cc1 100
COc1ccc([C@@H]2C3(CO)C4[N@](C)C5C6(CO)C([N@@](C)C3C4(CO)[C@H]6c3ccc(OC)cc3)C52CO)cc1 101
COc1ccc([S@+]([O-])c2ccc(C(=O)C3CCN(C4CCCCC4)CC3)cc2)cc1 102
COc1ccc([S@+]([O-])c2ccc(C(C#N)C3CCN(C4CCCCC4)CC3)cc2)cc1 103
COc1ccc([S@+]([O-])c2ccc(C(C#N)N3CCN(C4CCCCC4)CC3)cc2)cc1 104
COc1ccc([S@@+]([O-])c2ccc(C(=O)C3CCN(C4CCCCC4)CC3)cc2)cc1 105
COc1ccc([S@@+]([O-])c2ccc(C(C#N)C3CCN(C4CCCCC4)CC3)cc2)cc1 106
COc1ccc([S@@+]([O-])c2ccc(C(C#N)N3CCN(C4CCCCC4)CC3)cc2)cc1 107
COc1ccc2c(cc(C#N)cc2C(=O)N(C)C[C@@H](CCN2CCC(c3ccccc3[S@+](C)[O-])CC2)c2ccc(Cl)c(Cl)c2)c1 108
COc1ccc2cc(-c3nc(-c4ccc([S@+](C)[O-])cc4C)[nH]c3-c3ccncc3)ccc2c1 109
COc1ccc2cc(-c3nc(-c4ccc([S@@+](C)[O-])cc4C)[nH]c3-c3ccncc3)ccc2c1 110
COc1ccc2nc([S@@+]([O-])Cc3ncc(C)c(OC)c3C)[nH]c2c1 111
COc1ccnc(C[S@@+]([O-])c2nc3ccc(OC(F)F)cc3[nH]2)c1OC 112
CSC[S@+]([O-])C[C@H](CO)NC(=O)-C=C-c1c(C)[nH]c(=O)[nH]c1=O 113
CSC[S@@+]([O-])CC(CO)NC(=O)-C=C-c1c(C)nc(O)nc1O 114
C[C@@H](Nc1cc(-c2[nH]c([S@+](C)[O-])nc2-c2ccc(F)cc2)ccn1)c1ccccc1 115
C[C@@H]1CC[C@H]2[C@@H](C)[C@@H](CCC(=O)Nc3cccc([S@+](C)[O-])c3)O[C@@H]3O[C@@]4(C)CC[C@@H]1[C@]32OO4 116
C[C@@H]1O[C@H]([C@@H]2CCC[N+]2(C)C)C[S@+]1[O-] 117
C[C@@H]1O[C@H]([C@@H]2CCC[N+]2(C)C)C[S@@+]1[O-] 118
C[C@@H]1O[C@H]([C@H]2CCC[N+]2(C)C)C[S@+]1[O-] 119
C[C@@H]1O[C@H]([C@H]2CCC[N+]2(C)C)C[S@@+]1[O-] 120
C[C@H](Nc1cc(-c2[nH]c([S@+](C)[O-])nc2-c2ccc(F)cc2)ccn1)c1ccccc1 121
C[C@H]1O[C@@H]([C@@H]2CCCN2C)C[S@+]1[O-] 122
C[C@H]1O[C@@H]([C@@H]2CCCN2C)C[S@@+]1[O-] 123
C[C@H]1O[C@@H]([C@@H]2CCC[N+]2(C)C)C[S@+]1[O-] 124
C[C@H]1O[C@@H]([C@@H]2CCC[N+]2(C)C)C[S@@+]1[O-] 125
C[N+](C)(C)C[C@@H]1C[S@+]([O-])C(C2CCCCC2)(C2CCCCC2)O1 126
C[N+](C)(C)C[C@@H]1C[S@@+]([O-])C(C2CCCCC2)(C2CCCCC2)O1 127
C[N@+]1([O-])[C@H]2CC[C@H]1C[C@H](OC(=O)[C@@H](CO)c1ccccc1)C2 128
C[N@@+]1(CC2CC2)C2CCC1C[C@@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 129
C[N@@+]1(CC2CC2)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 130
C[N@@+]1(CCCOCc2ccccc2)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 131
C[N@@+]1(CCCOc2ccccc2)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 132
C[N@@+]1(CCCc2ccccc2)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 133
C[N@@+]1(CCOCc2ccccc2)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 134
C[N@@+]1(CCOc2ccccc2)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 135
C[S@+]([O-])CCCCN=C=S 136
C[S@+]([O-])CC[C@@](CO)(C(=O)O[C@H]1CN2CCC1CC2)c1ccccc1 137
C[S@+]([O-])C[C@@H]1[C@@H](O)[C@]23CC[C@H]1C[C@H]2[C@]1(C)CCC[C@](C)(C(=O)O)[C@H]1CC3 138
C[S@+]([O-])C[C@](C)(O)[C@H]1OC(=O)C=C2[C@@]13O[C@@H]3[C@H]1OC(=O)[C@@]3(C)C=CC[C@@]2(C)[C@@H]13 139
C[S@+]([O-])c1ccc(C2=C(c3ccccc3)C(=O)OC2)cc1 140
C[S@+]([O-])c1cccc2c3c(n(Cc4ccc(Cl)cc4)c21)[C@@H](CC(=O)O)CCC3 141
C[S@+]([O-])c1nc(-c2ccc(F)cc2)c(-c2ccnc(NC(=O)Cc3ccc(F)cc3)c2)[nH]1 142
C[S@+]([O-])c1nc(-c2ccc(F)cc2)c(-c2ccnc(NC3CCCCC3)c2)[nH]1 143
C[S@+]([O-])c1nc(-c2ccc(F)cc2)c(-c2ccnc(NC3CCOCC3)c2)[nH]1 144
C[S@@+]([O-])CCCC-C(=N-OS(=O)(=O)[O-])S[C@@H]1O[C@H](CO)[C@@H](O)[C@H](O)[C@H]1O 145
C[S@@+]([O-])CCCCN=C=S 146
C[S@@+]([O-])C[C@@H]1[C@@H](O)[C@]23CC[C@H]1C[C@H]2[C@]1(C)CCC[C@](C)(C(=O)O)[C@H]1CC3 147
C[S@@+]([O-])Cc1ccc(C(=O)Nc2cccnc2C(=O)NCC2CCOCC2)c2ccccc12 148
Cc1c(C#N)cc(C(=O)N(C)C[C@@H](CCN2CCC(c3ccccc3[S@+](C)[O-])CC2)c2ccc(Cl)c(Cl)c2)c2ccccc12 149
Cc1c(C#N)cc2ccccc2c1C(=O)N(C)C[C@@H](CCN1CCC(c2ccccc2[S@+](C)[O-])CC1)c1ccc(Cl)c(Cl)c1 150
Cc1c(OCC(F)(F)F)ccnc1C[S@@+]([O-])c1nc2ccccc2[nH]1 151
Cc1cc(=O)c(Oc2ccc(Br)cc2F)c(-c2ccc([S@@+](C)[O-])cc2)o1 152
Cc1cc(=O)c(Oc2ccc(Cl)cc2F)c(-c2ccc([S@@+](C)[O-])cc2)o1 153
Cc1cc(=O)c(Oc2ccc(F)cc2F)c(-c2ccc([S@+](C)[O-])cc2)o1 154
Cc1ccc(-c2ccc3occ(-c4ccc([S@+](C)[O-])cc4)c3c2)o1 155
Cc1ccc(-c2ncc(Cl)cc2-c2ccc([S@+](C)[O-])cc2)cn1 156
Cc1ccc([S@+]([O-])-C(F)=C-c2ccccn2)cc1 157
Cc1ccc([S@+]([O-])c2occc2C=O)cc1 158
Cc1nc(O)nc(O)c1-C=C-C(=O)N[C@@H](CO)C[S@+]([O-])CCl 159
Cc1nc(O)nc(O)c1-C=C-C(=O)N[C@@H](CO)C[S@@+]([O-])CCl 160
NC(=O)C[S@+]([O-])C(c1ccccc1)c1ccccc1 161
NC(=O)C[S@@+]([O-])C(c1ccccc1)c1ccccc1 162
O.O.O.O.[Sr+2].COc1ccc2[n-]c([S@@+]([O-])Cc3ncc(C)c(OC)c3C)nc2c1.COc1ccc2[n-]c([S@@+]([O-])Cc3ncc(C)c(OC)c3C)nc2c1 163
O=C(O)CC-C=C-CC[C@H]1[C@H](OCc2ccc(-c3ccccc3)cc2)C[S@+]([O-])[C@@H]1c1cccnc1 164
O=C(O)CC-C=C-CC[C@H]1[C@H](OCc2ccc(-c3ccccc3)cc2)C[S@@+]([O-])[C@@H]1c1cccnc1 165
O=S(=O)([O-])OCCC[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 166
O=S(=O)([O-])OC[C@@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 167
O=S(=O)([O-])OC[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 168
O=S(=O)([O-])O[C@@H](CO)CC[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 169
O=S(=O)([O-])O[C@@H](CO)[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 170
O=S(=O)([O-])O[C@@H]([C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO)[C@@H](O)[C@@H](O)[C@@H](O)CO 171
O=S(=O)([O-])O[C@@H]([C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO)[C@@H](O)[C@@H](O)[C@H](O)CO 172
O=S(=O)([O-])O[C@@H]([C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO)[C@@H](O)[C@H](O)[C@@H](O)CO 173
O=S(=O)([O-])O[C@@H]([C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO)[C@@H](O)[C@H](O)[C@H](O)CO 174
O=S(=O)([O-])O[C@@H]([C@H](O)[C@@H](O)CO)[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 175
O=S(=O)([O-])O[C@H](CO)[C@@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 176
O=S(=O)([O-])O[C@H]([C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO)[C@@H](O)[C@@H](O)[C@@H](O)CO 177
O=S(=O)([O-])O[C@H]([C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO)[C@@H](O)[C@@H](O)[C@H](O)CO 178
O=S(=O)([O-])O[C@H]([C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO)[C@@H](O)[C@H](O)[C@@H](O)CO 179
O=S(=O)([O-])O[C@H]([C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO)[C@@H](O)[C@H](O)[C@H](O)CO 180
OCC12C3[N@](Cc4ccc(O)cc4)C4C5(CO)C([N@@](Cc6ccc(O)cc6)C1C3(CO)[C@H]5c1ccccc1)C4(CO)[C@H]2c1ccccc1 181
OCC12C3[N@](Cc4ccc(OCc5ccccc5)cc4)C4C5(CO)C([N@@](Cc6ccc(OCc7ccccc7)cc6)C1C3(CO)[C@@H]5c1ccccc1)C4(CO)[C@@H]2c1ccccc1 182
OCC12C3[N@](Cc4cccc(OCc5ccccc5)c4)C4C5(CO)C([N@@](Cc6cccc(OCc7ccccc7)c6)C1C3(CO)[C@H]5c1ccccc1)C4(CO)[C@H]2c1ccccc1 183
OCC12C3[N@](Cc4ccccc4)C4C5(CO)C([N@@](Cc6ccccc6)C1C3(CO)[C@H]5c1ccc(O)cc1)C4(CO)[C@H]2c1ccc(O)cc1 184
OCC12C3[N@](Cc4ccccc4)C4C5(CO)C([N@@](Cc6ccccc6)C1C3(CO)[C@H]5c1ccc(OCc3ccccc3)cc1)C4(CO)[C@H]2c1ccc(OCc2ccccc2)cc1 185
OCC12C3[N@](Cc4ccccc4)C4C5(CO)C([N@@](Cc6ccccc6)C1C3(CO)[C@H]5c1cccc(O)c1)C4(CO)[C@H]2c1cccc(O)c1 186
OCC12C3[N@](Cc4ccccc4)C4C5(CO)C([N@@](Cc6ccccc6)C1C3(CO)[C@H]5c1ccccc1)C4(CO)[C@H]2c1ccccc1 187
OCC12C3[N@](Cc4cccnc4)C4C5(CO)C([N@@](Cc6cccnc6)C1C3(CO)[C@H]5c1ccccc1)C4(CO)[C@H]2c1ccccc1 188
OCC12C3[N@](Cc4ccncc4)C4C5(CO)C([N@@](Cc6ccncc6)C1C3(CO)[C@H]5c1ccccc1)C4(CO)[C@H]2c1ccccc1 189
OC[C@@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 190
OC[C@@H](O)[C@@H](O)[C@H](O)[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 191
OC[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 192
OC[C@H](O)[C@@H](O)[C@@H](O)[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 193
OC[C@H](O)[C@@H](O)[C@@H](O)[C@H](O)[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 194
OC[C@H](O)[C@@H](O)[C@H](O)[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 195
OC[C@H](OCc1ccccc1)[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO.F[B-](F)(F)F 196
[Br-].C=CCCCC[N@+]1(C)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 197
[Br-].CCCCCCCC[N@+]1(C)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 198
[Br-].CCCCCC[N@+]1(C)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 199
[Br-].CCCC[N@+]1(C)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 200
[Br-].C[N@@+]1(CC2CC2)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 201
[Br-].C[N@@+]1(CCCOCc2ccccc2)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 202
[Br-].C[N@@+]1(CCCOc2ccccc2)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 203
[Br-].C[N@@+]1(CCCc2ccccc2)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 204
[Br-].C[N@@+]1(CCOc2ccccc2)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 205
[Cl-].CCCCCCCCCCCCCO[C@@H](CO)[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 206
[Cl-].CCO[C@@H](CO)[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 207
[Cl-].CO[C@@H](CO)[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 208
[Cl-].CO[C@@H]([C@H](O)[C@@H](O)CO)[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 209
[Cl-].OC[C@@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 210
[Cl-].OC[C@@H](O)[C@@H](O)[C@H](O)[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 211
[Cl-].OC[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 212
[Cl-].OC[C@H](O)[C@@H](O)[C@@H](O)[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 213
[Cl-].OC[C@H](O)[C@@H](O)[C@H](O)[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 214
[I-].C[C@@H]1O[C@H]([C@@H]2CCC[N+]2(C)C)C[S@+]1[O-] 215
[I-].C[C@@H]1O[C@H]([C@@H]2CCC[N+]2(C)C)C[S@@+]1[O-] 216
[I-].C[C@@H]1O[C@H]([C@H]2CCC[N+]2(C)C)C[S@+]1[O-] 217
[I-].C[C@@H]1O[C@H]([C@H]2CCC[N+]2(C)C)C[S@@+]1[O-] 218
[I-].C[C@H]1O[C@@H]([C@@H]2CCC[N+]2(C)C)C[S@+]1[O-] 219
[I-].C[C@H]1O[C@@H]([C@@H]2CCC[N+]2(C)C)C[S@@+]1[O-] 220
[I-].C[N+](C)(C)C[C@@H]1C[S@+]([O-])C(C2CCCCC2)(C2CCCCC2)O1 221
[I-].C[N+](C)(C)C[C@@H]1C[S@@+]([O-])C(C2CCCCC2)(C2CCCCC2)O1 222
[I-].C[N@@+]1(CCOCc2ccccc2)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 223
[K+].COc1ccc2[n-]c([S@@+]([O-])Cc3ncc(C)c(OC)c3C)nc2c1 224
[Mg+2].COc1ccc2[n-]c([S@@+]([O-])Cc3ncc(C)c(OC)c3C)nc2c1.COc1ccc2[n-]c([S@@+]([O-])Cc3ncc(C)c(OC)c3C)nc2c1 225
[Na+].COc1ccc2[n-]c([S@@+]([O-])Cc3ncc(C)c(OC)c3C)nc2c1 226
[O-]CC[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 227
[O-]C[C@@H](O)[C@@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 228
[O-][C@@H](CO)[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 229
[O-][S@+](CC1(O)CCN(CCc2c[nH]c3ccc(F)cc23)CC1)c1ccccc1 230
[O-][S@+](Cc1cc(OCC2CC2)ccn1)c1nc2cc(F)ccc2[nH]1 231
[O-][S@@+](Cc1cc(OCC2CC2)ccn1)c1nc2cc(F)ccc2[nH]1 232

I've been trying to make sense of the 232 failures. Some observations:

RDKit bug reports

The results of my investigations lead to three RDKit bug reports:

In the first, Greg identified that that FastFindRings() isn't putting the two chiral atoms into the same primitive ring, so AssignStereochemistry() isn't seeing that this is an instance of ring stereochemistry.

In the second, Greg points to the August 2015 thread titled "Stereochemistry - Differences between RDKit Indigo" in the RDKit mailing list". Greg comments about nitrogren chirality:

There are two things going on here in the RDKit: 1) Ring stereochemistry 2) stereochemistry about nitrogen centers. Let's start with the second, because it's easier: RDKit does not generally "believe in" stereochemistry around three coordinate nitrogens. ... Back to the first: ring stereochemistry. ... The way the RDKit handles this is something of a hack: it doesn't identify those atoms as chiral centers, but it does preserve the chiral tags when generating a canonical SMILES:

Need SanitizeMol(), not ClearComputedProps()

He proposes that as a workaround I sanitize the newly created molecule, so I replaced the call to ClearComputedProps() with one to "SanitizeMol()", near the end of fragment_chiral(), as shown here:

    # After breaking bonds, should re-sanitize or at least call
    # ClearComputedProps().
    # See https://github.com/rdkit/rdkit/issues/1022#issuecomment-239355482
    new_mol = rwmol.GetMol()
    #new_mol.ClearComputedProps()
    Chem.SanitizeMol(new_mol)
    return new_mol
With that in place, where there were 232 records which failed my test, now there are 195. All 181 of the chiral sulfurs still fail, 11 of the 34 chiral nitrogens still fail, the chiral carbon bridgeheads all pass, while the 3 remaining chiral carbons still fail.

(I also tested with both ClearComputedProps() and SanitizeMol(), but using both made no difference.)

While better, it's not substantially better. What's going on?

RDKit can produce non-canonical SMILES

At this point we're pushing the edge of what RDKit can handle. A few paragraphs ago I quoted Greg as saying that ring chirality is "something of a hack". I think that's the reason why, of the 232 records that cause a problem, 67 of them don't produce a stable SMILES string. That is, if I parse what should be a canonicalized SMILES string and recanonicalize it, I get a different result. The canonicalization is bi-stable, in that recanonicalization swaps between two possibilites, with a different chirality assignment each time.

Here's a reproducible if you want to try it out yourself. (By the time you read this it's likely been fixed!)

from rdkit import Chem

def Canon(smiles):
    mol = Chem.MolFromSmiles(smiles)
    return Chem.MolToSmiles(mol, isomericSmiles=True)

def check_if_canonical(smiles):
    s1 = Canon(smiles)
    s2 = Canon(s1)
    if s1 != s2:
        print("Failed to canonicalize", smiles)
        print(" input:", smiles)
        print("canon1:", s1)
        print("canon2:", s2)
        print("canon3:", Canon(s2)) 
    else:
        print("Passed", smiles)
    
for smiles in (
    "O[C@H]1CC2CCC(C1)[N@@]2C",
    "C[C@]1(c2cccnc2)CCCC[S@+]1O",
    "[C@]1C[S@+]1O"):
    check_if_canonical(smiles)
The output from this is:
Failed to canonicalize O[C@H]1CC2CCC(C1)[N@@]2C
 input: O[C@H]1CC2CCC(C1)[N@@]2C
canon1: C[N@]1C2CCC1C[C@@H](O)C2
canon2: C[N@]1C2CCC1C[C@H](O)C2
canon3: C[N@]1C2CCC1C[C@@H](O)C2
Failed to canonicalize C[C@]1(c2cccnc2)CCCC[S@+]1O
 input: C[C@]1(c2cccnc2)CCCC[S@+]1O
canon1: C[C@]1(c2cccnc2)CCCC[S@@+]1O
canon2: C[C@]1(c2cccnc2)CCCC[S@+]1O
canon3: C[C@]1(c2cccnc2)CCCC[S@@+]1O
Failed to canonicalize [C@]1C[S@+]1O
 input: [C@]1C[S@+]1O
canon1: O[S@@+]1[C]C1
canon2: O[S@+]1[C]C1
canon3: O[S@@+]1[C]C1

SanitizeMol adds some overhead

The SanitizeMol() is a work-around to a problem which is under investigation. I'll use SanitizeMol() for the rest of this essay, but bear in mind that that adds overhead. In one timing test I did, the version with ClearComputedProps() took 107 seconds while the version with SanitizeMol() took 255 seconds.

You'll have to decide for yourself if the small number of additional correct structures is worth the extra time. And, hopefully it's been fixed by the time you read this essay so you don't need a workaround.

Bridgeheads

Many of the failures were due to chiral bridgehead atoms. I used the following two SMARTS to detect bridgeheads:

depiction of two bridgehead topologies
*~1~*~*(~*~*~2)~*~*~2~*~1
*~1~*~*(~*~*~*~2)~*~*~2~*~1
Before I added the SanitizeMol() call, there were 34 chiral nitrogen structures which failed. Of those 34, only 11 are still failures after adding the SanitizeMol(). Of those 11, one is a normal-looking bridgehead:
a nitrogen bridgehead that has a bistable SMILES

CC(C(=O)O[C@@H]1CC2CCC(C1)[N@]2C)(c1ccccc1)c1ccccc1
It's the only one of the simple nitrogen bridgehead structures which doesn't have a stable canonicalization. (I used the core bridgehead from this structure as the first test case in the previous section, where I showed a few bi-stable SMILES strings.)

The other 10 of the 11 nitrogen bridgehead failures have a more complex ring system, like:

a complex chiral nitrogen ring system
OCC12C3[N@](Cc4cccnc4)C4C5(CO)C([N@@](Cc6cccnc6)C1C3(CO)[C@H]5c1ccccc1)C4(CO)[C@H]2c1ccccc1
All of these have a bi-stable canonicalization.

I also looked at the chiral carbon bridgeheads which failed. Of the original 14, all 14 of them pass after I added the SanitizeMol() call.

The remaining structures

There are three chiral structures which fail even after sanitization, which do not contain a chiral nitrogen or chiral sulfur, and which do not contain a bridgehead. These are:

  
CCC[C@H]1CC[C@H]([C@H]2CC[C@@H](OC(=O)[C@H]3[C@H](c4ccc(O)cc4)[C@H](C(=O)O[C@H]4CC[C@@H]([C@H]5CC[C@H](CCC)CC5)CC4)[C@H]3c3ccc(O)cc3)CC2)CC1
COc1cc(-C=C-OC(=O)[C@H]2CC[C@@H](N(C)[C@H]3CC[C@H](C(=O)O-C=C-c4cc(OC)c(OC)c(OC)c4)CC3)CC2)cc(OC)c1OC
COc1cc(OC(=O)[C@@H]2CC[C@H](N(C)[C@@H]3CC[C@@H](C(=O)Oc4cc(OC)c(OC)c(OC)c4)CC3)CC2)cc(OC)c1OC
Upon investigation, all three seem involve the ring chirality solution that Greg called a "hack". I did not investigate further.

The final code

That was lot of text. And a lot of work. If you made it this far, congratualtions. Oddly, I still have more to write about on the topic.

I'll leave you with the final version of the code, with various tweaks and comments that I didn't discuss in the essay. As a bonus, it includes an implementation of fragment_chiral() which uses RDKit's FragmentOnBonds() function, which is the function you should be using to fragment bonds.

# Cut an RDKit molecule on a specified bond, and replace the old terminals with wildcard atoms ("*").
# The code includes test suite which depends on an external SMILES file.
#
# This code is meant as a study of the low-level operations. For production use,
# see the commented out function which uses RDKit's built-in FragmentOnBonds().
#
# Written by Andrew Dalke <dalke@dalkescientific.com>.

from __future__ import print_function

from rdkit import Chem

# You can get a copy of this library from:
# http://www.dalkescientific.com/writings/diary/archive/2016/08/09/fragment_achiral_molecules.html#smiles_syntax.py
from smiles_syntax import convert_wildcards_to_closures


CHI_TETRAHEDRAL_CW = Chem.ChiralType.CHI_TETRAHEDRAL_CW
CHI_TETRAHEDRAL_CCW = Chem.ChiralType.CHI_TETRAHEDRAL_CCW

def parity_shell(values):
    # Simple Shell sort; while O(N^2), we only deal with at most 4 values 
    values = list(values)
    N = len(values)
    num_swaps = 0
    for i in range(N-1):
        for j in range(i+1, N):
            if values[i] > values[j]:
                values[i], values[j] = values[j], values[i]
                num_swaps += 1
    return num_swaps % 2


def get_bond_parity(mol, atom_id):
    """Compute the parity of the atom's bond permutation

    Return None if it does not have tetrahedral chirality,
    0 for even parity, or 1 for odd parity.
    """
    atom_obj = mol.GetAtomWithIdx(atom_id)
    
    # Return None unless it has tetrahedral chirality
    chiral_tag = atom_obj.GetChiralTag()
    if chiral_tag not in (CHI_TETRAHEDRAL_CW, CHI_TETRAHEDRAL_CCW):
        return None
    
    # Get the list of atom ids for the each atom it's bonded to.
    other_atom_ids = [bond.GetOtherAtomIdx(atom_id) for bond in atom_obj.GetBonds()]
    
    # Use those ids to determine the parity
    return parity_shell(other_atom_ids)


def set_bond_parity(mol, atom_id, old_parity, old_other_atom_id, new_other_atom_id):
    """Compute the new bond parity and flip chirality if needed to match the old parity"""
    
    atom_obj = mol.GetAtomWithIdx(atom_id)
    # Get the list of atom ids for the each atom it's bonded to.
    other_atom_ids = [bond.GetOtherAtomIdx(atom_id) for bond in atom_obj.GetBonds()]
    
    # Replace id from the new wildcard atom with the id of the original atom
    i = other_atom_ids.index(new_other_atom_id)
    other_atom_ids[i] = old_other_atom_id
    
    # Use those ids to determine the parity
    new_parity = parity_shell(other_atom_ids)
    if old_parity != new_parity:
        # If the parity has changed, invert the chirality
        atom_obj.InvertChirality()

# You should really use the commented-out function below, which uses
# RDKit's own fragmentation code. Both do the same thing.

def fragment_chiral(mol, atom1, atom2):
    """Cut the bond between atom1 and atom2 and replace with connections to wildcard atoms

    Return the fragmented structure as a new molecule.
    """
    rwmol = Chem.RWMol(mol)
    
    atom1_parity = get_bond_parity(mol, atom1)
    atom2_parity = get_bond_parity(mol, atom2)
    
    rwmol.RemoveBond(atom1, atom2)
    wildcard1 = rwmol.AddAtom(Chem.Atom(0))
    wildcard2 = rwmol.AddAtom(Chem.Atom(0))
    new_bond1 = rwmol.AddBond(atom1, wildcard1, Chem.BondType.SINGLE)
    new_bond2 = rwmol.AddBond(atom2, wildcard2, Chem.BondType.SINGLE)
    
    if atom1_parity is not None:
        set_bond_parity(rwmol, atom1, atom1_parity, atom2, wildcard1)
    if atom2_parity is not None:
        set_bond_parity(rwmol, atom2, atom2_parity, atom1, wildcard2)
    
    # After breaking bonds, should re-sanitize See
    # https://github.com/rdkit/rdkit/issues/1022#issuecomment-239355482
    # or at least call ClearComputedProps().  I found that
    # SanitizeMol() improves chiral carbon bridgeheads handling,
    # though using it doubles the execution time. I'll stick with
    # ClearComputedProps(), which matches what FragmentOnBonds() does
    new_mol = rwmol.GetMol()
    # If you don't sanitize then you must call ClearComputedProps() 
    #mol.ClearComputedProps()
    Chem.SanitizeMol(new_mol)
    return new_mol

#### Use this code for production
## def fragment_chiral(mol, atom1, atom2):
##     bond = mol.GetBondBetweenAtoms(atom1, atom2)
##     new_mol = Chem.FragmentOnBonds(mol, [bond.GetIdx()], dummyLabels=[(0, 0)])
##     # FragmentOnBonds() calls ClearComputedProps() at the end.  There
##     # is a current bug report where, as a downstream effect, that may
##     # cause some chiralities to change, most notably on some
##     # bridgeheads.. A workaround for now is to call SanitizeMol(),
##     # though that ends up tripling the time. I'll stay compatible
##     # with FragmentOnBonds() and not call it.
##     #Chem.SanitizeMol(new_mol)
##     return new_mol


##### ##### ##### ##### Test code ##### ##### ##### ##### #####

# Create a canonical isomeric SMILES from a SMILES string
# Used to put the manually-developed reference structures into canonical form.
def Canon(smiles):
    mol = Chem.MolFromSmiles(smiles)
    assert mol is not None, smiles
    return Chem.MolToSmiles(mol, isomericSmiles=True)
        
def simple_test():
    for smiles, expected in (
            ("CC", Canon("*C.*C")),
            ("F[C@](Cl)(Br)O", Canon("*F.*[C@](Cl)(Br)O")),
            ("F[C@@](Cl)(Br)O", Canon("*F.*[C@@](Cl)(Br)O")),
            ("F[C@@H](Br)O", Canon("*F.*[C@@H](Br)O")),
            ):
        mol = Chem.MolFromSmiles(smiles)
        fragmented_mol = fragment_chiral(mol, 0, 1)
        fragmented_smiles = Chem.MolToSmiles(fragmented_mol, isomericSmiles=True)
        if fragmented_smiles != expected:
            print("smiles:", smiles)
            print("fragmented:", fragmented_smiles)
            print("  expected:", expected)

# Match a single bond not in a ring
BOND_SMARTS = "[!#0;!#1]-!@[!#0;!#1]"
single_bond_pat = Chem.MolFromSmarts(BOND_SMARTS)


_bridgehead1_pat = Chem.MolFromSmarts("*~1~*~*(~*~*~*~2)~*~*~2~*~1")
_bridgehead2_pat = Chem.MolFromSmarts("*~1~*~*(~*~*~2)~*~*~2~*~1")

def is_bridgehead(mol):
    """Test if the molecule contains one of the bridgehead patterns"""
    return (mol.HasSubstructMatch(_bridgehead1_pat) or
            mol.HasSubstructMatch(_bridgehead2_pat))

def file_test():
    # Point this to a SMILES file to test
    filename = "/Users/dalke/databases/chembl_20_rdkit.smi"
    
    with open(filename) as infile:
        num_records = num_successes = num_failures = 0
        for lineno, line in enumerate(infile):
            # Give some progress feedback
            if lineno % 100 == 0:
                print("Processed", lineno, "lines and", num_records,
                      "records. Successes:", num_successes,
                      "Failures:", num_failures)
            
            # Only test structures with a chiral atom
            input_smiles = line.split()[0]
            if "@" not in input_smiles:
                continue
            
            # The code doesn't handle directional bonds. Convert them
            # to single bonds
            if "/" in input_smiles:
                input_smiles = input_smiles.replace("/", "-")
            if "\\" in input_smiles:
                input_smiles = input_smiles.replace("\\", "-")
            
            mol = Chem.MolFromSmiles(input_smiles)
            if mol is None:
                continue
            
            ### Uncomment as appropriate
            if is_bridgehead(mol):
                pass
                #continue
            else:
                pass
                continue
            num_records += 1
            
            # I expect the reassembled structure to match this canonical SMILES
            expected_smiles = Chem.MolToSmiles(mol, isomericSmiles=True)
            
            # Cut each of the non-ring single bonds between two heavy atoms
            matches = mol.GetSubstructMatches(single_bond_pat)
            has_failure = False
            for begin_atom, end_atom in matches:
                # Fragment
                fragmented_mol = fragment_chiral(mol, begin_atom, end_atom)
                fragmented_smiles = Chem.MolToSmiles(fragmented_mol, isomericSmiles=True)
                assert "." in fragmented_smiles, fragmented_smiles # safety check
                
                # Convert the "*"s to the correct "%90" closures
                closure_smiles = convert_wildcards_to_closures(fragmented_smiles, (0, 0))
                assert "%90" in closure_smiles, closure_smiles # safety check
                closure_mol = Chem.MolFromSmiles(closure_smiles)
                
                # canonicalize and compare; report any mismatches
                final_smiles = Chem.MolToSmiles(closure_mol, isomericSmiles=True)
                if final_smiles != expected_smiles:
                    print("FAILURE in record", num_records+1)
                    print("     input_smiles:", input_smiles)
                    print("  begin/end atoms:", begin_atom, end_atom)
                    print("fragmented smiles:", fragmented_smiles)
                    print("   closure smiles:", closure_smiles)
                    print("     final smiles:", final_smiles)
                    print("  expected smiles:", expected_smiles)
                    has_failure = True
            if has_failure:
                num_failures += 1
            else:
                num_successes += 1
                #print("SUCCESS", input_smiles)
        
        print("Done. Records:", num_records, "Successes:", num_successes,
              "Failures:", num_failures)
            
if __name__ == "__main__":
    simple_test()
    file_test()

Thanks!

Thanks to Greg Landrum both for RDKit and for help in tracking down some of the stubborn cases. Thanks also to the University of Hamburg for SMARTSViewer, which I use as a SMILES structure viewer so I don't have to worry about bond type, aromaticity, or chiral re-intepretations.

Edits

This essay has had a few post-publication edits. I added more emphasis that this essay exists for pegadogical reasons, described more that SanitizeMol() is meant as solution to get around a problem currently under investigation, moved the ChEMBL SDF analysis code to its own post.



Copyright © 2001-2013 Andrew Dalke Scientific AB