Dalke Scientific Software: More science. Less time. Products
[ previous | next ]     /home/writings/diary/archive/2017/03/27/chembl_target_sets_association_network

ChEMBL target sets association network

This is part 3 of a three-part series in generating fingerprint-based set similarities using chemfp. Read part 1 to see some of the ways to compare two fingerprint sets, and part 2 where I figure out how to use the ChEMBL bioactivity data.

I usually work with entity-based similarities. I have a molecule X and I want to find other molecules which are similar to it. Set-based similarities are a bit different. I think of them as comparing two objects by the clouds around them. Instead of comparing two proteins based on a more intrinsic property like their sequence or 3D structure, I might want to compare two proteins by the types of molecules which bind to them or affect them. This might reveal if two proteins have similar binding pockets or are involved in the same chemical pathway.

Before jumping into the nitty-gritty, I thought I would try a non-molecular example. Suppose you read Neal Stephenson's Cryptonomicon and enjoy it so much that you want to read more like it. "Like it" can mean many things: books that talk about technology, books which combine two different time periods, books with historical fiction taking place during the Second World War, books with many vignettes, and so on. For some, Foucault's Pendulum is like Cryptonomicon. And of course "like" can mean other books by the same author, or even by the same publishing house or imprint.

The "entity" in this case is a book. While there are many possible scoring functions, the end result is a book or list of books, likely ranked in preference order.

Suppose however you have read many books by Stephenson and want to find another author like him. Here too there are many ways to make a comparison. One is to use the book similarity function. For each author under consideration, compare all of that author's books to all of Stephenson's books, and come up with some aggregate scoring function to give the set similarity. Use that to figure out the most similar author.

If you repeat this many time you can create a network of authors, associated by similarity based on their books.

IC50 activity sets from ChEMBL

Back into the world of molecules. I want to compare target proteins in human assays based on the set of molecules with an IC50 of at least 1 micromolar for each molecule. I'll use the ChEMBL 21 bioactivity data to generate a data.

The following SQL query is based on an example Iain Wallace sent me, adapted to use the SQLite console. I'll first turn on the timer and have it save the output to the file "chembl_sets.tsv", as tab-separated fields. The query looks for "single protein" targets in humans (tax_id=9606), where the assay activity is an IC50 with better than 1000nM. For each of those assays, get the target name and the ChEMBL id for the compound used in the assay.

% sqlite3 chembl_21.db
SQLite version 3.8.5 2014-08-15 22:37:57
Enter ".help" for usage hints.
sqlite>
   ...>
   ...>
   ...>
   ...>
   ...>
   ...>
   ...>
   ...>
   ...>
select distinct target_dictionary.pref_name, molecule_dictionary.chembl_id
  from target_dictionary, assays, activities, molecule_dictionary
 where target_dictionary.tax_id = 9606
   and target_dictionary.target_type = "SINGLE PROTEIN"
   and target_dictionary.tid = assays.tid
   and assays.assay_id = activities.assay_id
   and activities.published_type = "IC50"
   and activities.standard_units = "nM"
   and activities.standard_value < 1000
   and activities.molregno = molecule_dictionary.molregno;
Run Time: real 25.771 user 4.440337 sys 2.698867 sqlite> quit
Most of the 26 seconds was likely spent in reading from the hard disk. However, do note that this was after I did an ANALYZE on some of the tables in the database. Without the ANALYZE, I suspect the query will take a lot longer.

The above console commands produce the file "chembl_sets.tsv" where the first few line for me look like:

Beta-1 adrenergic receptor      CHEMBL305153
Dopamine D4 receptor    CHEMBL303519
Endothelin-converting enzyme 1  CHEMBL415967
Neprilysin      CHEMBL415967
FK506-binding protein 1A        CHEMBL140442
Coagulation factor X    CHEMBL117716
Trypsin I       CHEMBL117716
Retinoid X receptor alpha       CHEMBL111217
Epidermal growth factor receptor erbB1  CHEMBL68920
Receptor protein-tyrosine kinase erbB-2 CHEMBL68920
Epidermal growth factor receptor erbB1  CHEMBL69960
Receptor protein-tyrosine kinase erbB-2 CHEMBL69960
Proteinase-activated receptor 1 CHEMBL330643
Tyrosine-protein kinase LCK     CHEMBL69638
Neuropeptide Y receptor type 5  CHEMBL75193
Gonadotropin-releasing hormone receptor CHEMBL65614
 ...
(That's a copy&paste of the terminal output, which doesn't preserve spaces.)

Compare two real sets

In an earlier essay I came up with an ad hoc comparison function I called "sss()":

from chemfp import search

def sss(arena1, arena2, threshold=0.8):
  results = search.threshold_tanimoto_search_arena(
                 arena1, arena2, threshold=threshold)
  # The scaling factor of 300 was chosen so that random ChEMBL
  # subsets have a score of about 0.01. It doesn't matter for
  # ranking as it's used as an arbitrary scaling factor.
  similarity = 300 * results.cumulative_score_all() / (len(arena1) * len(arena2))
  return similarity
What score does it give to two sets which are similar?

Load set data

The file "chembl_sets.tsv" which I just created in SQLite contains the set names and the compounds ids which are in the set. The file "chembl_21.rdkit2048.fpb" created in part 1 contains compound ids and fingerprints. I can combine the two to get the fingerprints for each compound id. The first step is to read the set information, which I do with the following function:

import collections

# Load a file with lines of the form <set_name> <tab> <compound_id>
# Example:  "DNA-dependent protein kinase\tCHEMBL104450\n"
# Return a dictionary mapping set name to a list of all of its ids
def load_set_members(filename):
    set_members_table = collections.defaultdict(list)
    with open(filename) as infile:
        for line in infile:
            set_name, chembl_id = line.rstrip("\n").split("\t")
            set_members_table[set_name].append(chembl_id)
    # Turn the defaultdict into a dict so that a lookup of
    # a name which doesn't exist raises an exception instead
    # of creating and returning an empty list.
    return dict(set_members_table)
I'll use that function to read the set file. I also looked through the list of target names and guessed that "Estrogen receptor alpha" and "Estrogen receptor beta" might be similar, so I'll use that as my initial test case:
>>> set_members_table = load_set_members("chembl_sets.tsv")
>>> len(set_members_table["Estrogen receptor alpha"])
912
>>> len(set_members_table["Estrogen receptor beta"])
1066
>>> len(set(set_members_table["Estrogen receptor alpha"])
...   & set(set_members_table["Estrogen receptor beta"]))
697

Load set fingerprints

Next I'll extract the fingerprints from the FPB file. The easiest way is to find the index for each of the compound ids and pass the list of indices to the copy() method.

>>> import chemfp
>>> chembl_21 = chemfp.load_fingerprints("chembl_21.rdkit2048.fpb")
>>> target_ids1 = set_members_table["Estrogen receptor alpha"]
>>> indices1 = [chembl_21.get_index_by_id(target_id) for target_id in target_ids1]
>>> target1 = chembl_21.copy(indices1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "chemfp/arena.py", line 576, in copy
    new_indices.append(range_check[i])
TypeError: sequence index must be integer, not 'NoneType'
Well, that was unexpected. What happened? It looks like there are 8 None elements in the list of indices:
>>> indices1.count(None)
8
How could I have an activity for a compound, but not have the compound?

There can be a few reasons. Perhaps ChEMBL didn't include the structure data. Except they do. Perhaps RDKit couldn't parse the record. Except they could. The real clue came in beause Iain Watson also sent me the dataset he generated with his sample SQL. There are 76 additions in my file which aren't in his, including 8 estrogen receptor alpha records:

% diff iain_sorted.tsv chembl_sets_sorted.tsv
5697a5698
> Acetylcholinesterase	CHEMBL2448138
7776a7778
> Adenosine A3 receptor	CHEMBL1386
9350a9353
> Alpha-2a adrenergic receptor	CHEMBL1366
9410a9414
> Alpha-2a adrenergic receptor	CHEMBL508338
9455a9460
> Alpha-2b adrenergic receptor	CHEMBL1366
9479a9485
  ...
54058a54091
> Epidermal growth factor receptor erbB1	CHEMBL1909064
58244a58278,58279
> Estrogen receptor alpha	CHEMBL219003
> Estrogen receptor alpha	CHEMBL219004
58246a58282
> Estrogen receptor alpha	CHEMBL219390
58247a58284
> Estrogen receptor alpha	CHEMBL219763
58506a58544
> Estrogen receptor alpha	CHEMBL373625
58555a58594
> Estrogen receptor alpha	CHEMBL385993
58556a58596,58597
> Estrogen receptor alpha	CHEMBL386948
> Estrogen receptor alpha	CHEMBL387335
65855a65897
> Glutathione reductase	CHEMBL2068507
65874a65917
 ...
He generated his data from one of the dump files for a server-based database, like MySQL. I generated my data from the SQLite dump file. My guess is that the SQLite file was generated slightly later, and includes a few records which were added during that time delta.

The reason my fingerprint file doesn't contain the entries is that the chembl_21.sdf file I used was also generated from the first snapshot, so doesn't include those new structures.

At least, that's my working theory. It's also purely coincidence that I happened to start with one of the few set names where this was a problem.

I'll write a function to skip ids when the id can't be found in the fingerprint arena:

def create_subset(arena, ids):
    indices = []
    for id in ids:
        idx = arena.get_index_by_id(id)
        if idx is not None:
          indices.append(idx)
    return arena.copy(indices=indices)

>>> target1 = create_subset(chembl_21, set_members_table["Estrogen receptor alpha"])
>>> target2 = create_subset(chembl_21, set_members_table["Estrogen receptor beta"])

Compare two fingerprint sets

With the two data sets in hand, it's a simple matter of calling the scoring function:

>>> sss(target1, target2)
4.986669863305914
I'll make a little helper function to compare two classes by name:
def compare(name1, name2):
    target1 = create_subset(chembl_21, set_members_table[name1])
    target2 = create_subset(chembl_21, set_members_table[name1])
    return sss(target1, target2)
and use it to compare a few other sets, judiciously chosen after I implemented the next section:
>>> compare("Estrogen receptor alpha", "Estrogen sulfotransferase")
6.107895939764545
>>> compare("Estrogen receptor alpha", "Vitamin D receptor")
6.107895939764545
>>> compare("Dopamine D5 receptor", "Histamine H1 receptor")
110.6279251170047
>>> compare("Dopamine D5 receptor", "Histamine H2 receptor")
110.6279251170047
>>> compare("Histamine H1 receptor", "Histamine H2 receptor")
24.037402406896796
>>> compare("NEDD8-activating enzyme E1 regulatory subunit", "RNase L")
100.0
A real chemist or biologist would need to tell me if these make sense.

Compare all sets

I can use the code from the previous section to generate the Nx(N-1)/2 comparison of every set to every other set. The general algorithm is to load each of the sets into its own object, which I'll call a "Subset" instance. A Subset has a "name" and an "arena", and I'll ignore empty subsets (like "Transient receptor potential cation channel subfamily M member 6"):

class Subset(object):
    def __init__(self, name, arena):
        self.name = name
        self.arena = arena

def load_subsets(arena, set_members_table):
    subsets = []
    for i, (name, ids) in enumerate(sorted(set_members_table.items())):
        set_arena = create_subset(arena, ids)
        if not set_arena:
            sys.stderr.write("No members: %s\n" % (name,))
            continue
        subsets.append(Subset(name, set_arena))
    return subsets
This returns N subsets. I'll iterate over the upper-triangle of the comparison matrix to generate all pair scores. If the score is at least 1.0, I'll print the score and the two set ids. Otherwise I won't report anything.

I also include some progress and run-time information to stderr, which makes the code a bit more complicated to read, but helps soothe the nerves of at least this observer.

Here's the full code, which I saved into the file "set_compare.py":

# set_compare.py
from __future__ import print_function, division

import sys
import collections
import chemfp
from chemfp import search
import time

# Load a file with lines of the form <set_name> <tab> <compound_id>
# Example:  "DNA-dependent protein kinase\tCHEMBL104450\n"
# Return a dictionary mapping set name to a list of all of its ids
def load_set_members(filename):
    set_members_table = collections.defaultdict(list)
    with open(filename) as infile:
        for line in infile:
            set_name, chembl_id = line.rstrip("\n").split("\t")
            set_members_table[set_name].append(chembl_id)
    # Turn the defaultdict into a dict so that a lookup of
    # a name which doesn't exist raises an exception instead
    # of creating and returning an empty list.
    return dict(set_members_table)


# Ad hoc scoring function using the sum of scores.
# Please don't use this for real work unless you've validated it.
def sss(arena1, arena2, threshold=0.8):
  results = search.threshold_tanimoto_search_arena(
                 arena1, arena2, threshold=threshold)
  # The scaling factor of 300 was chosen so that random ChEMBL
  # subsets have a score of about 0.01. It doesn't matter for
  # ranking as it's used as an arbitrary scaling factor.
  similarity = 300 * results.cumulative_score_all() / (len(arena1) * len(arena2))
  return similarity


# Make a subset arena of the given arena using the given ids
def create_subset(arena, ids):
    indices = []
    for id in ids:
        idx = arena.get_index_by_id(id)
        if idx is not None:
          indices.append(idx)
    return arena.copy(indices=indices)

class Subset(object):
    def __init__(self, name, arena):
        self.name = name
        self.arena = arena

def load_subsets(arena, set_members_table):
    subsets = []
    for i, (name, ids) in enumerate(sorted(set_members_table.items())):
        set_arena = create_subset(arena, ids)
        if not set_arena:
            sys.stderr.write("No members: %s\n" % (name,))
            continue
        subsets.append(Subset(name, set_arena))
    return subsets

def main():
    chembl_21 = chemfp.load_fingerprints("chembl_21.rdkit2048.fpb")
    set_members_table = load_set_members("chembl_sets.tsv")
    start_time = time.time()
    subsets = load_subsets(chembl_21, set_members_table)
    load_time = time.time()

    N = len(subsets)
    for i in range(N-1):
        sys.stderr.write("\rProcessing %d/%d" % (i, N-1))
        subset1 = subsets[i]
        for j in range(i+1, N):
            subset2 = subsets[j]
            score = sss(subset1.arena, subset2.arena)
            if score > 1.0:
                sys.stderr.write("\r                     \r")
                print("%.2f\t%s\t%s" % (score, subset1.name, subset2.name))
                sys.stderr.write("\rProcessing %d/%d" % (i, N-1))
        sys.stderr.write("\r                     \r")
    compare_time = time.time()
    
    print("load time:", load_time-start_time, file=sys.stderr)
    print("compare time:", compare_time-load_time, file=sys.stderr)

if __name__ == "__main__":
    main()

I ran it like this, which ran with 4 OpenMP threads:

% python set_compare.py > set_compare_output.txt
No members: Transient receptor potential cation channel subfamily M member 6
No members: Transient receptor potential cation channel subfamily V member 2
No members: Transient receptor potential cation channel subfamily V member 5
No members: Voltage-gated P/Q-type calcium channel alpha-1A subunit
load time: 4.57191491127
compare time: 491.902451038

That command found 5410 set comparisons. I'll show the 5 smallest, 5 largest, and values near the median and quartiles:

% wc -l set_compare_output.txt
    5410 set_compare_output.txt
% sort -n set_compare_output.txt | head -5
1.00	Apoptosis regulator Bcl-2	Membrane-associated guanylate kinase-related 3
1.00	Cyclin-dependent kinase 2	Testis-specific serine/threonine-protein kinase 2
1.00	Cytochrome P450 1B1	Tubulin beta-1 chain
1.00	Fructose-1,6-bisphosphatase	Receptor tyrosine-protein kinase erbB-3
1.00	Glucagon receptor	Serine/threonine-protein kinase GAK
% awk 'NR==int(5410*1/4)' set_compare_output.txt  # first quartile
1.55	Cyclin-dependent kinase 4	Cyclin-dependent kinase-like 1
% awk 'NR==5410/2' set_compare_output.txt  # median
1.24	Glyceraldehyde-3-phosphate dehydrogenase liver	Histone-lysine N-methyltransferase, H3 lysine-9 specific 3
% awk 'NR==int(5410*3/4)' set_compare_output.txt  # third quartile
7.62	Neuropeptide Y receptor type 1	Neuropeptide Y receptor type 2
% sort -n set_compare_output.txt | tail -5
300.00	Serine palmitoyltransferase 1	Serine palmitoyltransferase 2
300.00	Serine/threonine-protein kinase 24	Serine/threonine-protein kinase MST1
300.00	Sodium channel protein type I alpha subunit	Sodium channel protein type XI alpha subunit 
300.00	Ubiquitin carboxyl-terminal hydrolase 10	Ubiquitin carboxyl-terminal hydrolase 13
300.00	Xaa-Pro aminopeptidase 2	Xaa-Pro dipeptidase
The largest possible score in my similarity metric is 300.0. These last few lines indicate perfect matches.

Pre-compile sets (advanced topic)

It took 4-5 seconds to load the dataset. This was less than 1% of the overall run-time, so optimizing it is usually not worthwhile. However, suppose you want to write a set similarity web service. You'll often end up reloading the server during development, and the 4-5 second wait each time will become annoying.

One possibility is to create an FPB file for each of the subsets. The problem is there are over 1,400 sets. By design the FPB file uses a memory map for each file. Each memory map uses a file descriptor, and many OSes limit the number of file descriptors that a process may use. On my Mac, the default resource limit ("rlimit") is 256, though that can be increased.

The way I usually solve this in chemfp is to store all of the subsets sequentially in a single FPB file, and have a ".range" file which specifies the start/end range for each set. By default the FPB file reorders the fingerprints by popcount, so the fingerprints with 0 on-bits comes first, then those with 1 on-bit, etc. When they are ordered that way then I can create the sublinear search index.

I can disable reordering, so that the fingerprints are stored in the same order they were added to the FPB file. If I know that set A is between indices start and end then I can use arena[start:end] to get the subset.

Create an FPB file with the sets in input order

The following program reads the chembl_21.rdkit2048.fpb and chembl_sets.tsv file to compile a single FPB file named chembl_sets.fpb with the fingerprint sets, in order, and an range file named "chembl_sets.range" with the start/end indices of each set and its range.

from __future__ import print_function, division

import collections
import chemfp

# Load a file with lines of the form <set_name> <tab> <compound_id>
# Example:  "DNA-dependent protein kinase\tCHEMBL104450\n"
# Return a dictionary mapping set name to a list of all of its ids
def load_set_members(filename):
    set_members_table = collections.defaultdict(list)
    with open(filename) as infile:
        for line in infile:
            set_name, chembl_id = line.rstrip("\n").split("\t")
            set_members_table[set_name].append(chembl_id)
    # Turn the defaultdict into a dict so that a lookup of
    # a name which doesn't exist raises an exception instead
    # of creating and returning an empty list.
    return dict(set_members_table)

# return a list of (id, fingerprint) pairs for the given ids in the arena
def get_indices(arena, ids):
    indices = []
    for id in ids:
        idx = arena.get_index_by_id(id)
        if idx is None:
            continue
        indices.append(idx)
    return indices

def main():
    chembl_21 = chemfp.load_fingerprints("chembl_21.rdkit2048.fpb")
    set_members_table = load_set_members("chembl_sets.tsv")

    with open("chembl_sets.ranges", "w") as range_file:
        with chemfp.open_fingerprint_writer("chembl_sets.fpb", reorder=False,
                                            metadata=chembl_21.metadata) as writer:
            start_index = 0
            for name, chembl_ids in sorted(set_members_table.items()):
                indices = get_indices(chembl_21, chembl_ids)
                if not indices:
                    continue  # no ids found
                end_index = start_index + len(indices)
                
                range_file.write("%d\t%d\t%s\n" % (start_index, end_index, name))
                writer.write_fingerprints(chembl_21.copy(indices=indices))
                start_index = end_index

if __name__ == "__main__":
    main()
I ran it and generated the two files. The "chembl_sets.ranges" file starts with the following:
0	77	1-acylglycerol-3-phosphate O-acyltransferase beta
77	1834	11-beta-hydroxysteroid dehydrogenase 1
1834	1881	11-beta-hydroxysteroid dehydrogenase 2
1881	1885	14-3-3 protein gamma
1885	1992	15-hydroxyprostaglandin dehydrogenase [NAD+]
1992	2016	2-acylglycerol O-acyltransferase 2
2016	2019	25-hydroxyvitamin D-1 alpha hydroxylase, mitochondrial
2019	2025	26S proteasome non-ATPase regulatory subunit 14
2025	2027	3-beta-hydroxysteroid dehydrogenase/delta 5--<4-isomerase type I
2027	2030	3-keto-steroid reductase

If you've been following along then nothing here should be new here in the code, except this line:

                writer.write_fingerprints(chembl_21.copy(indices=indices))
I use the indices to make a new arena containing just those fingerprints. (By default these fingerprints will be in popcount order, which comes in useful in a bit.) I then pass the new arena to write_fingerprints(). This function takes an iterator of (id, fingerprint) pairs, which is what an arena returns if you try to iterate over it.

The end result is to save the selected ids and fingerprints to a file, in popcount order.

Search using pre-compiled fingerprints

It's tempting to load the FPB file, get the set arena using the start/end from the ranges file, and do the comparison. This will work, but it will be slow. The FPB file is not in popcount order and has no popcount index. This means that chemfp's sublinear search optimization will not be used.

If the arena is not indexed by popcount then its slice, like unordered_arena[start:end] will also not be indexed by popcount. Making a simple copy() doesn't help because the copy() by default preserves the ordering. That is, the copy will be ordered if and only if the original arena is ordered.

Instead, I need to tell the copy() to always reorder, with:

unordered_arena[start:end].copy(reorder=True)

As a further optimization, if the copy(reorder=True) notices that the input is already in popcount order then it will skip the step to sort() the fingerprints by popcount. That's the case for us since the compiled chembl_sets.fpb file was created by passing ordered subset arenas to write_fingerprints(), and iterating through ordered arenas returns the (id, fingerprints) in increasing popcount order.

I changed the previous "set_compare.py" program to use the compiled sets file. I call this new program "set_compare2.py". The new code is the function "load_subsets()", which reads the .ranges file, gets subsets from the compiled FPB file, and makes an ordered copy of it.

# set_compare2.py
from __future__ import print_function, division

import sys
import collections
import chemfp
from chemfp import search
import time

# Ad hoc scoring function using the sum of scores.
# Please don't use this for real work unless you've validated it.
def sss(arena1, arena2, threshold=0.8):
  results = search.threshold_tanimoto_search_arena(
                 arena1, arena2, threshold=threshold)
  # The scaling factor of 300 was chosen so that random ChEMBL
  # subsets have a score of about 0.01. It doesn't matter for
  # ranking as it's used as an arbitrary scaling factor.
  similarity = 300 * results.cumulative_score_all() / (len(arena1) * len(arena2))
  return similarity

class Subset(object):
    def __init__(self, name, arena):
        self.name = name
        self.arena = arena
        
def load_subsets(arena, filename):
    subsets = []
    with open(filename) as range_file:
        for line in range_file:
            start, end, name = line.rstrip("\n").split("\t")
            start = int(start)
            end = int(end)
            subset = Subset(name, arena[start:end].copy(reorder=True))
            subsets.append(subset)
    return subsets

def main():
    chembl_sets_arena = chemfp.load_fingerprints("chembl_sets.fpb")
    start_time = time.time()
    subsets = load_subsets(chembl_sets_arena, "chembl_sets.ranges")
    load_time = time.time()

    N = len(subsets)
    for i in range(N-1):
        sys.stderr.write("\rProcessing %d/%d" % (i, N-1))
        subset1 = subsets[i]
        for j in range(i+1, N):
            subset2 = subsets[j]
            score = sss(subset1.arena, subset2.arena)
            if score > 1.0:
                sys.stderr.write("\r                     \r")
                print("%.2f\t%s\t%s" % (score, subset1.name, subset2.name))
                sys.stderr.write("\rProcessing %d/%d" % (i, N-1))
        sys.stderr.write("\r                     \r")
    compare_time = time.time()
    
    print("load time:", load_time-start_time, file=sys.stderr)
    print("compare time:", compare_time-load_time, file=sys.stderr)

if __name__ == "__main__":
    main()

I ran "set_compare2.py" and compared the results to "set_compare.py". The output files were exactly the same, as expected. The biggest difference was the load time:

# times from set_compare.py
load time: 4.57191491127
compare time: 491.902451038

# times from set_compare2.py
load time: 0.801107883453
compare time: 451.236635923
That saves nearly 4 seconds of load time. The run time also looks faster, but I think that's due to the variability of my desktop, which also has a few Firefox windows and more than a few tabs open.

Z-score

Up until now I've been using an ad hoc scoring function with an arbitrary scaling factor to make it so the background score is 0.01 across the range of input set sizes. It has a few flaws: 1) it has a maximum score of 300, which is unusual, 2) no one will understand how to interpret it, without specific experience with it, and 3) the standard deviation is a function of the input set sizes.

One common technique to remove those flaws is to transform the score in a "z-score" or "standard score":

zscore = (score - background_score) / background_standard_deviation
The "background score" is about 0.01 for my scoring function, but the "background standard deviation" varies based on the input set sizes. I can determine it by generating enough (let's say 25) pairs of subsets containing randomly selecting fingerprints, and with the same sizes as the pair I'm interested in, then computing the standard deviation of all of those comparisons. With this approach my "300*" scaling factor is irrelevant as the numerator and denominator are equally scaled. The division by the product of the sizes also disappears, for the same reason.

You likely see the downside of this scoring function - each Z-score requires 25 additional Tanimoto searches!

There are a few ways to mitigate this. First, while I have 1,400+ sets, there are only 365 different set sizes. I can reuse values for multiple comparisons of the same pair of sizes, which will reduce the number of tests I need to do by about a factor of 15. Second, random sets rarely have matches with a 0.8 similarity. The sublinear index should quickly reject those obvious mismatches. Third, I could cache the values to the file system, database, or other permanent storage and re-use them for future searches. It won't help the first time, but I end up with mistakes in my code, or have something I want to tweak, and will often re-run the code many times before it finally works.

Fourth, and not something I'll do in this essay (if at all), I could do some curve fitting and come up with an equation which does a reasonable job of interpolating or predicting values.

(Oh, my. I was certainly optimistic by using 25 samples. After over an hour it had only processed 229 of 1408 elements. I killed the code, changed the number of samples to 10, and restarted it. I'm glad I put that cache in! If you do this for real, you might vary the number of samples as a function of the sizes. It's much harder getting a non-zero standard deviation for a 1x1 comparison than for a 1000x1000 comparison.)

The biggest change to this code is a new "ZScorer" class, which replaces the "sss" function. The main entry point is "compute_zscore()". That in turn calls "compute_raw_score()" to compute the cumulative score between two arenas, and "compute_background_values()", which computes the mean and standard deviation for the requested set sizes. The latter function also caches to an internal dictionary.

The ZScorer class also has a way to load and save the cache to a file. Before computing the set similarities I ask it to "load()" from the file named ""zscore.cache". I then wrapped the main code with a try/finally block so I can save() the cache no matter if the code went to completion or if it was interrupted by a ^C or coding error.

I also modified the scoring threshold so the Z-score had to be at least 10 (that is, the difference from the average background must be at least 10 standard deviations).

To make the data a bit more useful, I also included information about the number of elements in each set. These are node properties easily determined by getting the length of the corresponding arena. I also added the number of ids common to both sets (this is a new edge property). For this I turned to Python's native 'set' type. Each Subset instance make a set of arena ids:

class Subset(object):
    def __init__(self, name, arena):
        self.name = name
        self.arena = arena
        self.id_set = set(arena.ids) # used to find intersection counts
so the number of elements in common is:
num_in_common = len(subset1.id_set & subset2.id_set)

As a final note before presenting the code, I called this program "set_zcompare.py". It's based on the original "set_compare.py" and does not use the pre-compiled FPB file of set_compare2.py".

# I call this "set_zcompare.py"
from __future__ import print_function, division

import sys
import random
import collections
import time

import numpy as np

import chemfp
from chemfp import search

# Support both Python 2 and Python 3
try:
  xrange # Check if this is Python 2
except NameError:
  xrange = range # Use 'range' for Python 3
  
# Load a file with lines of the form <set_name> <tab> <compound_id>
# Example:  "DNA-dependent protein kinase\tCHEMBL104450\n"
# Return a dictionary mapping set name to a list of all of its ids
def load_set_members(filename):
    set_members_table = collections.defaultdict(list)
    with open(filename) as infile:
        for line in infile:
            set_name, chembl_id = line.rstrip("\n").split("\t")
            set_members_table[set_name].append(chembl_id)
    # Turn the defaultdict into a dict so that a lookup of
    # a name which doesn't exist raises an exception instead
    # of creating and returning an empty list.
    return dict(set_members_table)


# Z-score based on the sum of the similar scores.  Please don't use
# this scoring function for real work unless you've validated it.

def make_random_subarena(arena, n):
  indices = random.sample(xrange(len(arena)), n)
  return arena.copy(indices=indices)

class ZScorer(object):
    def __init__(self, arena, threshold=0.8, num_samples=25):
        self.arena = arena
        self.threshold = threshold
        self.num_samples = num_samples
        self.cached_values = {}

    def compute_zscore(self, arena1, arena2):
        # The main entry point
        score = self.compute_raw_score(arena1, arena2)
        mean, std = self.compute_background_values(len(arena1), len(arena2))
        if std == 0.0:
            return 0.0
        return (score - mean) / std

    def compute_raw_score(self, arena1, arena2):
        return search.threshold_tanimoto_search_arena(
            arena1, arena2, threshold=self.threshold).cumulative_score_all()

    def compute_background_values(self, i, j):
        # The scoring function is symmetric so normalize so i <= j
        if i > j:
            i, j = j, i
        # Check if it exists in the cache
        key = (i, j)
        try:
            return self.cached_values[key]
        except KeyError:
            pass

        # Does not already exist, so compute the mean and standard deviation
        scores = []
        for _ in range(self.num_samples):
            subarena1 = make_random_subarena(self.arena, i)
            subarena2 = make_random_subarena(self.arena, j)
            scores.append(self.compute_raw_score(subarena1, subarena2))

        mean, std = np.mean(scores), np.std(scores)
        values = (mean, std)
        self.cached_values[key] = values # cache the result and return it
        return values

    def load(self, filename):
        # Load values from filename into self.cached_values
        try:
            infile = open(filename)
        except IOError:
            sys.stderr.write("Warning: cache file %r does not exist\n" % (filename,))
        
        with infile:
            for line in infile:
                terms = line.split()
                i, j, mean, std = int(terms[0]), int(terms[1]), float(terms[2]), float(terms[3])
                self.cached_values[(i, j)] = (mean, std)
            
    def save(self, filename):
        # Save values from self.cached_values into the named file
        with open(filename, "w") as outfile:
            for key, value in sorted(self.cached_values.items()):
                i, j = key
                mean, std = value
                outfile.write("%s %s %s %s\n" % (i, j, mean, std))
    
# Make a subset arena of the given arena using the given ids
def create_subset(arena, ids):
    indices = []
    for id in ids:
        idx = arena.get_index_by_id(id)
        if idx is not None:
          indices.append(idx)
    return arena.copy(indices=indices)

class Subset(object):
    def __init__(self, name, arena):
        self.name = name
        self.arena = arena
        self.id_set = set(arena.ids) # used to find intersection counts

def load_subsets(arena, set_members_table):
    subsets = []
    for i, (name, ids) in enumerate(sorted(set_members_table.items())):
        set_arena = create_subset(arena, ids)
        if not set_arena:
            sys.stderr.write("No members: %s\n" % (name,))
            continue
        subsets.append(Subset(name, set_arena))

    return subsets

                        

def main():
    chembl_21 = chemfp.load_fingerprints("chembl_21.rdkit2048.fpb")
    set_members_table = load_set_members("chembl_sets.tsv")
    zscorer = ZScorer(chembl_21, threshold=0.8, num_samples=10) # 25 took too much time
    zscorer.load("zscore.cache")
    try:
        start_time = time.time()
        subsets = load_subsets(chembl_21, set_members_table)
        load_time = time.time()

        print("name1\tsize1\tname2\tsize2\t#in_common\tzscore") # write a header
        N = len(subsets)
        for i in range(N-1):
            sys.stderr.write("\rProcessing %d/%d" % (i, N-1))
            subset1 = subsets[i]
            for j in range(i+1, N):
                subset2 = subsets[j]
                zscore = zscorer.compute_zscore(subset1.arena, subset2.arena)
                if zscore > 10.0:  # Use 10 standard deviations as my threshold of importance
                    sys.stderr.write("\r                     \r")
                    num_in_common = len(subset1.id_set & subset2.id_set)
                    print("%s\t%d\t%s\t%d\t%d\t%.2f" % (subset1.name, len(subset1.arena),
                                                        subset2.name, len(subset2.arena),
                                                        num_in_common, zscore))
                    sys.stderr.write("\rProcessing %d/%d" % (i, N-1))
            sys.stderr.write("\r                     \r")
        compare_time = time.time()
    finally:
        zscorer.save("zscore.cache")
    
    print("load time:", load_time-start_time, file=sys.stderr)
    print("compare time:", compare_time-load_time, file=sys.stderr)

if __name__ == "__main__":
    main()

I ran this and saved the 6351 lines of output to "chembl_target_network_zscore_10.tsv. The first few lines look like:
name1	size1	name2	size2	#in_common	zscore
1-acylglycerol-3-phosphate O-acyltransferase beta	77	TRAF2- and NCK-interacting kinase	16	0	23.56
11-beta-hydroxysteroid dehydrogenase 1	1757	11-beta-hydroxysteroid dehydrogenase 2	47	35	92.93
14-3-3 protein gamma	4	Androgen Receptor	847	0	13.05
14-3-3 protein gamma	4	Histone deacetylase 1	1598	0	12.41
14-3-3 protein gamma	4	Histone deacetylase 6	799	0	20.43
14-3-3 protein gamma	4	Histone deacetylase 8	396	0	33.06
15-hydroxyprostaglandin dehydrogenase [NAD+]	107	Aldehyde dehydrogenase 1A1	16	1	91.58
15-hydroxyprostaglandin dehydrogenase [NAD+]	107	Serine/threonine-protein kinase PIM1 	540	0	17.87
15-hydroxyprostaglandin dehydrogenase [NAD+]	107	Serine/threonine-protein kinase PIM2	346	0	17.48

Network Visualization

Code, code, everywhere, and not an image to see! How do I know that the above code works as I epxected, much less gives useful information which a biochemist might be interested in?

Association network visualization is widely used in bioinformatics. With some pointers from Iain, I downloaded the Java-based Cytoscape 3.4 and used it to visualize the Z-score results from the previous section.

Actually, that network proved too complicated to visualize. I used awk to reduce it to 1450 node pairs with a Z-score of at least 100, in the file named chembl_target_network_zscore_100.tsv. I loaded the file into Cytoscape (File->Import->Network->File...), selected chembl_target_network_zscore_100.tsv, made sure the 'name1' column was the query, 'size1' a query attribute, 'name2' was the target, 'size2' a target attribute, '#in_common' an integer node attribute, and 'zscore' a floating point node attribute. I tried different layouts, but the "preferred layout" seemed best. Here's a screenshot of one part:

portion of the ChEMBL target sets network with Z-score of at least 100

It looks reasonable, in that I know dopamine and serotonin are related. I'm neither a chemist nor biologist, nor do I play one on the internet. My goal was to show how you might use chemfp to do this sort of analysis. This image shows I'm in the right neighborhood, so it's good enough for me to say I'm done.


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2013 Andrew Dalke Scientific AB