Using RDKit BulkTanimotoSimilarity

[ previous | newer ] /home/writings/diary/archive/2020/10/02/using_rdkit_bulktanimotosimilarity

Using RDKit BulkTanimotoSimilarity

This is part of a series of essays on how to write a similarity search program for the RDKit Morgan fingerprints distributed by ChEMBL.

In the first essay of this series I developed a program to do a Tanimoto similarity search of chembl_27.fps. I evaluated three different data types: an RDKit ExplicitBitVect, a GMP mpz integer, and a byte string. I found that the byte string was fastest, at about 5 seconds per search. All three programs worked by parsing and testing each FPS record in sequence. A scan search like this requires essentially constant memory overhead no matter the size of the FPS file.

In the third essay of this series I developed an in-memory Tanimoto search program with two stages. The first stage read all of the fingerprints into memory, and the second did the linear Tanimoto calculation and tests of all of the fingerprints. This program took 4-5 seconds to load the fingerprints, and about 0.5 seconds to search. (Yesterday's essay added the BitBound algorithm, which prunes some of the search space and is increasingly effective as the similarity threshold increases.)

In this essay I'll revisit the choice of data type and try an in-memory search using RDKit's ExplicitBitVect() and BulkTanimotoSimilarity(). I find that the load time is about 32 seconds and the search time is about 0.6 seconds per query.

If you really want high-performance cheminformatics fingerprint similarity search, try chemfp.

Loading ExplicitBitVect fingerprints

On Wednesday I ended up with chembl_in_memory_search.py. It is not hard to change the code so it uses an ExplicitBitVect instead of a byte string for the fingerprints. The result is at chembl_bulk_tanimoto.py.

The two differences are:

get_query_fp() returns the ExplicitBitVect() directly from GetMorganFingerprintAsBitVect() rather than convert it to a byte string.
read_fingerprints() uses DataStructs.CreateFromFPSText() to parse the hex-encoded fingerprints into an ExplicitBitVect(), instead of using bytes.fromhex() to create a byte string.

Load the data sets

I'm going to use BulkTanimotoSimilarity() to compute the similarity scores for each fingerprint. That takes an RDKit fingerprint as the query and a list (or tuple) of RDKit fingerprints as the targets. However, the reader returns iterator of (id, fingerprint) pairs. I'll need to convert that into two parallel lists; one with ids and one with fingerprints.

The obvious way to do that is something like:

target_ids = []
target_fingerprints = []
for target_id, target_fingerprints in fingerprint_reader:
    target_ids.append(target_id)
target_fingerprints.append(target_fingerprint)

However, there's an easy but non-obvious way to do that, using zip(*values).

zip(*values)

The zip() built-in function lets you iterate over zero or more lists in parallel. For example, if you have a list of identifiers and an associated list of SMILES then you might print them out together like this:

>>> id_list = ["CHEBI:18153", "CHEBI:18407", "CHEBI:27518", "CHEBI:33602"]
>>> smiles_list = ["C=C", "C#N", "C#C", "BBB"]
>>> for id, smiles in zip(id_list, smiles_list):
...     print(f"{id} is {smiles}")
...
CHEBI:18153 is C=C
CHEBI:18407 is C#N
CHEBI:27518 is C#C
CHEBI:33602 is BBB

The construct f(*args) is special syntax to unpack an argument list. It uses args[0] as the first argument, args[1] as the second, and so on. For example, suppose data contains a list of elements and I want to print the list out, with one element per line:

>>> data = [('CHEBI:18153', 'C=C'), ('CHEBI:18407', 'C#N'), ('CHEBI:27518', 'C#C'), ('CHEBI:33602', 'BBB')]
>>> print(*data, sep="\n")
('CHEBI:18153', 'C=C')
('CHEBI:18407', 'C#N')
('CHEBI:27518', 'C#C')
('CHEBI:33602', 'BBB')

The print(*data, sep="\n") call is identical to print(data[0], data[1], data[2], data[3], sep="\n"), which prints each element, with a newline between each one (and the default end="\n" printing the final newline.)

The zip() and the f(*args) work together to give the following:

>>> data = [('CHEBI:18153', 'C=C'), ('CHEBI:18407', 'C#N'), ('CHEBI:27518', 'C#C'), ('CHEBI:33602', 'BBB')]
>>> id_list, smiles_list = list(zip(*data))
>>> id_list
('CHEBI:18153', 'CHEBI:18407', 'CHEBI:27518', 'CHEBI:33602')
>>> smiles_list
('C=C', 'C#N', 'C#C', 'BBB')

The step-by-step transformations are:

list(zip(*data)) is the same as:
list(zip(data[0], data[1], data[2], data[3])) is the same as:
list(zip(('CHEBI:18153', 'C=C'), ('CHEBI:18407', 'C#N'), ('CHEBI:27518', 'C#C'), ('CHEBI:33602', 'BBB'))) is the same as:
[('CHEBI:18153', 'CHEBI:18407', 'CHEBI:27518', 'CHEBI:33602'), ('C=C', 'C#N', 'C#C', 'BBB')] (because zip yields all of first elements, then all of the second elements, and there are only two elements in each parameter).

As a result, id_list contains all of the first terms in the data elements, and smiles_list contains all of the second terms in the data elements.

For some people it may help to think of zip(*args) as working together to generate the transpose of args.

Actually loading the data sets

All that discussion was there to show how the old code, which read the ids and fingerprints into a list of (id, fingerprint) pairs using the following:

targets = list(read_fingerprints(open(args.targets)))

was transformed into reading a list of ids and a list of fingerprints by using the following:

target_ids, target_fingerprints = list(zip(*read_fingerprints(open(args.targets))))

Easy to write. Not obvious to understand the first few times you come across that idiom!

BulkTanimotoSimilarity()

RDKit's BulkTanimotoSimilarity() takes a query fingerprint and a list of target fingerprints, and returns a list of scores, one for each target fingerprint. It's straight-forward to adapt the original in-memory search program to use this bulk function:

# Compute the score with each of the targets.
scores = DataStructs.BulkTanimotoSimilarity(query_fp, target_fingerprints)
# Find the records with a high enough similarity.
for i, score in enumerate(scores):
    if score >= threshold:
        # Need to get the corresponding id
        target_id = target_ids[i]
        has_match = True
        print(f"{query_id}\t{target_id}\t{score:.3f}")

Since this function computes all of the Tanimoto scores, you could easily modify this code to sort it or use a priority queue and find the nearest k neighbors.

Also, RDKit implements many Bulk* functions (BulkAllBitSimilarity, BulkAsymmetricSimilarity, BulkBraunBlanquetSimilarity, BulkCosineSimilarity, BulkDiceSimilarity, BulkKulczynskiSimilarity, BulkMcConnaugheySimilarity, BulkOnBitSimilarity, BulkRogotGoldbergSimilarity, BulkRusselSimilarity, BulkSokalSimilarity, BulkTanimotoSimilarity, BulkTverskySimilarity) with the same function call API, so it's easy to modify the code to use one of these alterate comparison methods.

Timings

The result is chembl_bulk_tanimoto.py. I kept all of the timing code I developed for Wednesday's chembl_in_memory_search.py, which I'll enable so I can compare the two:

% python chembl_in_memory_search.py --queries wikipedia_100.smi --times > old.out
load: 5.27 search: 52.63 #queries: 100 #/sec: 1.90
% python chembl_bulk_tanimoto.py --queries wikipedia_100.smi --times > new.out
load: 32.92 search: 60.22 #queries: 100 #/sec: 1.66
% cmp old.out new.out

It's always a relief to see that a cross-comparison gives the same results!

I'm not surprised the RDKit search time is a bit slower than the raw byte string search code beause the RDKit implementation does more work, for examples, the BulkTanimotoSimilarity() function accepts ExplicitBitVect() and SparseBitVect() fingerprint types (the latter is for sparse fingerprints) so there's some overhead figuring out the right way to do the computation, and it double-checks that the fingerprints are of the same length. While the programs I wrote will only work for 2048-bit fingerprints and may cause Python itself to crash if you don't use it correctly.

Are there obvious ways to speed things up?

The same BitBound algorithm in yesterday's essay can be applied to improve the BulkTanimotoSimilarity() performance. However, most of the time is not spent in computing fingerprints! I used kernprof with the first 100 Wikipedia SMILES to profile the program and found almost 70% of the time is spent iterating through the scores:

         .... lines omitted ...
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
         .... lines omitted ...
   115         1   32567959.0 32567959.0      9.7      target_ids, target_fingerprints = list(zip(*read_fingerprints(open(args.targets))))
         .... lines omitted ...
   129                                                   # Compute the score with each of the targets.
   130       100   40625764.0 406257.6     12.1          scores = DataStructs.BulkTanimotoSimilarity(query_fp, target_fingerprints)
   131                                                   # Find the records with a high enough similarity.
   132 194141100  133618300.0      0.7     39.9          for i, score in enumerate(scores):
   133 194141000  127750939.0      0.7     38.2              if score >= threshold:
         .... lines omitted ...

The only way to make this significantly faster is push more work into C.

If you are interested in improving the performance of the parts of RDKit I covered here, I suggest you look into speeding up DataStructs.CreateFromFPSText(), which seems to take rather longer than I think it should.

Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me