Simple FPS fingerprint similarity search: variations on a theme

[ previous | newer ] /home/writings/diary/archive/2020/09/28/simple_fps_fingerprint_search

Simple FPS fingerprint similarity search: variations on a theme

This is part of a series of essays on how to write a similarity search program for the RDKit Morgan fingerprints distributed by ChEMBL.

Simple FPS fingerprint similarity search: variations on a theme
Simple k-NN FPS Tanimoto and cosine similarity search
Simple in-memory ChEMBL similarity search
Simple BitBound ChEMBL similarity search
Using RDKit BulkTanimotoSimilarity
Faster in-memory ChEMBL search by using more C
Faster BitBound ChEMBL search by using more C
Even faster in-memory search with intersection popcount
Cache and reuse popcount-sorted ChEMBL fingerprints

It's easy to write a fingerprint search tool. Peter Willett tells a story about how very soon after he, Winterman, and Bawden published Implementation of nearest-neighbor searching in an online chemical structure search system (1986) (which described their nearest-neighbor similarity search implementation and observed that Tanimoto similarity gave more satisfactory results than cosine similarity), he heard from a company which wrote their own implementation, on a Friday afternoon, and found it to be very useful.

Now, my memory of his story may be missing in the details, but the key point is that it's always been easy to write a fingerprint similarity search tool. So, let's do it!

I'll call my program ssimsearch because it's going to be a simplified version of chemfp's simsearch command-line tool. In fact, I'll hard-code just about everything, with only the bare minimum of checking.

In tomorrow's essay, I extend this for k-nearest neighbors and allow cosine similarity as an alternative to Tanimoto similarity. In Wednesday's essay I implement an in-memory version which has a 5 second startup cost but faster search time. In Thursday's essay I add support for BitBound, which has an 8 second startup cost but even faster search time, especially for higher thresholds. In Friday's essay I revisit using RDKit's ExplicitBitVect and show that most of the time is spent parsing the fingerprint data, not searching.

If you want to know more about cheminformatics fingerprints, see my Molecular fingerprints, background essay from 2008.

What's in chembl_27.fps?

To make it even easier, it will be hard-coded so it can only search "chembl_27.fps" in the local directory, which is the decompressed contents of chembl_27.fps.gz.

Let's take a look at the first few lines of that file, which is in FPS format:

% fold -w 90 chembl_27.fps | head -18
#FPS1
#num_bits=2048
#type=RDKit-Morgan/1 radius=2 fpSize=2048 useFeatures=0 useChirality=0 useBondTypes=1
#software=RDKit/2019.09.2
#source=chembl_27.fps.gz
#date=2020-05-18 15:43:14.060167
000804000000000000000000000000000010000000000000000000000000000000000000000000000000008000
000004000000000000400000000100000000000000000000010000000000000000000000000000081000000000
200000000000000000008000000000000000000888000000000080000000000010000000000000000000020000
000000000000000002000108002040000000000000000005000000020001000000000000001000000002000000
000000000000000000000000000000800000000000200000000000000000010000000004000000000000000000
00000000000080000002000000000000000008000000000000000000000000	CHEMBL153534
020800000002000000800300000030000010810000200000220000400800040000000008400000000000008004
0000050004000000000040004000000400000000000800020000b0000000000000030000040000000102000002
700009008008000004408000000000000200000800040000004088000002000004200000022c10000000020000
0000004844000000020180084000000010000000000000000080040400002000000000000010000a0000000000
01101004000500000000008400014010800000000000a500020400000000002900008800000001490000000200
00008020008002020082000004802040804000800000000000100001800000	CHEMBL440060

The line starting with "#" are part of the header. The first line identifies the file format, the other header lines contain key/value-formatted metadata. In this case, the file contains 2048-bit Morgan fingerprints from RDKit version 2019.09.2, and it was created on 18 May 2020.

fingerprint type

The RDKit code to generate a fingerprint corresponding to the type parmaeters, given a RDMol molecule object, is:

from rdkit.Chem import rdMolDescriptors

query_rd_fp = rdMolDescriptors.GetMorganFingerprintAsBitVect(
    mol, radius=2, nBits=2048, useChirality=0,
    useBondTypes=1, useFeatures=0)

This returns a Datastructs.ExplicitBitVect instance.

Fingerprint lines

After the header lines are the fingerprint lines. There is one record per line, with tab-separated fields. The first field is the hex-encoded fingerprint, the second is the identifier. The other fields are user-defined - chemfp ignores them. (They could contain SMILES, property values, etc.)

The FPS format uses hex-encoding because it's very easy in most programs to convert hex-strings into byte-strings. Here's a Python3 example using a hex-encoded MACCS fingerprint for caffeine:

>>> bytes.fromhex("000000003000000001d414d91323915380f138ea1f")
b'\x00\x00\x00\x000\x00\x00\x00\x01\xd4\x14\xd9\x13#\x91S\x80\xf18\xea\x1f'

In Python 2 it was:

>>> "000000003000000001d414d91323915380f138ea1f".decode("hex")
'\x00\x00\x00\x000\x00\x00\x00\x01\xd4\x14\xd9\x13#\x91S\x80\xf18\xea\x1f'

How to represent fingerprints and compute the Tanimoto?

Data representation and data calculation go hand-in-hand. I'll need to compute the binary Tanimoto, which is defined as popcount(fp1&fp2)/popcount(fp1|fp2) where | and & are the usual bitwise and and or operators, and popcount() is the number of bits in the result. Bitwise-and and -or are available in standard Python integers but popcount() is not.

I can think of three approaches:

Turn the hex-encoded fingerprint into an RDKit ExplicitBitVect() and have RDKit compute the Tanimoto for me;
Turn both into a third data type where it's easy to compute the Tanimoto;
Turn the RDKit fingerprint into a byte string and figure out some way to compute the Tanimoto between them.

I'm writing this essay for pedagogical reasons, so I'm going to explore all three options.

Using RDKit's ExplicitBitVect()

RDKit's CreateFromFPS() function parses a hex-encoded fingerprint into an ExplicitBitVect(), and TanimotoSimilarity() computes the Tanimoto similarity between two fingerprints, which makes it easy to write a search program. It's so easy, I'll let the comments speak for themselves:

from rdkit import Chem
from rdkit.Chem import rdMolDescriptors
from rdkit import DataStructs

## Input parameters
smiles = "CN1C=NC2=C1C(=O)N(C(=O)N2C)C"
threshold = 0.7

## Convert the SMILES into an RDKit fingerprint
mol = Chem.MolFromSmiles(smiles)
if mol is None:
    raise SystemExit(f"Cannot parse SMILES {smiles!r}")

query_rd_fp = rdMolDescriptors.GetMorganFingerprintAsBitVect(
    mol, radius=2, nBits=2048, useChirality=0,
    useBondTypes=1, useFeatures=0)

## Search chembl_27.fps

# Skip the header; make sure I skipped only to the end of the header
infile = open("chembl_27.fps")
for i in range(6):
    line = infile.readline()
assert line.startswith("#date")

# Process each line, convert to an RDKit fingerprint, and compare
for line in infile:
    target_hex_fp, target_id = line.split()
    target_rd_fp = DataStructs.CreateFromFPSText(target_hex_fp)
    score = DataStructs.TanimotoSimilarity(query_rd_fp, target_rd_fp)
    if score >= threshold:
        print(f"{target_id}\t{score}")

The output from running this program is:

CHEMBL113	1.0
CHEMBL1232048	0.7096774193548387

CHEMBL113 is caffeine, so getting the expected score of 1.0 shows I didn't mess up on the GetMorganFingerprintAsBitVect() parameters. CHEMBL1232048 is bisdionin C, with a caffeine-caffeine linker, so that also makes sense.

You can surely see that this is not a complicated program, and can easily be done in an afternoon! The hardest part is probably finding out which functions to call and how to call them.

However, this search took 30 seconds. Can we do better?

Using gmpy2 integers

I could convert a hex-encoded fingerprint to a Python integer and do the bitwise-and and -or operations, but there's no way to get the popcount() of a Python integer directly. (I could convert the Python integer into a byte string and compute the popcount of that, but then why not just work on byte strings directly?)

Instead, I'll try using the GNU Multiple Precision Arithmetic Library through the gmpy2 Python wrapper. The integer-like mpz object implements the bitwise operation and the popcount method, and has an optimized popcount implementation. Here's an example using the RDKit MACCS keys for caffeine and theobromine:

>>> import gmpy2
>>> caffeine = gmpy2.mpz("000000003000000001d414d91323915380f138ea1f", 16)
>>> theobromine = gmpy2.mpz("000000003000000001d414d91323915380e178ea1f", 16)
>>> gmpy2.popcount(theobromine & caffeine)
45
>>> gmpy2.popcount(theobromine | caffeine)
47
>>> gmpy2.popcount(theobromine & caffeine) / gmpy2.popcount(theobromine | caffeine)
0.9574468085106383

(I used chemfp to verify that it computed the same MACCS Tanimoto similarity.)

That, plus the knowledge that BitVectToFPSText() turns an ExplicitBitVect into a hex-encoded fingerprint, makes it straight-forward to try out a GMPY2-based search system:

from rdkit import Chem
from rdkit.Chem import rdMolDescriptors
from rdkit import DataStructs
import gmpy2

## Input parameters
smiles = "CN1C=NC2=C1C(=O)N(C(=O)N2C)C"
threshold = 0.7

## Convert the SMILES into an RDKit fingerprint then into a GMP mpz integer
mol = Chem.MolFromSmiles(smiles)
if mol is None:
    raise SystemExit(f"Cannot parse SMILES {smiles!r}")

query_rd_fp = rdMolDescriptors.GetMorganFingerprintAsBitVect(
    mol, radius=2, nBits=2048, useChirality=0,
    useBondTypes=1, useFeatures=0)

query_gmp_fp = gmpy2.mpz(DataStructs.BitVectToFPSText(query_rd_fp), 16)

## Search chembl_27.fps

# Skip the header; make sure I skipped only to the end of the header
infile = open("chembl_27.fps")
for i in range(6):
    line = infile.readline()
assert line.startswith("#date")

# Process each line, convert to a GMP mpz integer, and compare
for line in infile:
    target_hex_fp, target_id = line.split()
    target_gmp_fp = gmpy2.mpz(target_hex_fp, 16)
    score = gmpy2.popcount(query_gmp_fp & target_gmp_fp) / gmpy2.popcount(query_gmp_fp | target_gmp_fp)
    if score >= threshold:
        print(f"{target_id}\t{score}")

The GMP-based search takes about 8 seconds, which is a nice speedup from 30 seconds.

One downside of this approach is that each of the bitwise-operators returns a new Python object, which is used only to find the popcount. That's a lot of object creation, which means a lot of object overhead.

Using byte string + my own Tanimoto function

The third option is to use native Python byte strings. This has less overhead, but Python offers no fast way to compute the Tanimoto between two byte strings. I could make one in Python using, say, a byte-based lookup table, but it's clear that a C or C++ extension will be faster, and there are compiler intrinsics for many compilers (gcc/clang, Visual Studio), or std::popcount (for C++) to make it fast.

CFFI - C foreign-function interface for Python

While I could use the Python/C API to develop an extension, that documentation points out Third party tools like Cython, cffi, SWIG and Numba offer both simpler and more sophisticated approaches to creating C and C++ extensions for Python. I'll use cffi, which implements a way for Python code to call C functions directly.

This includes a way to specify the source code for the extension. Here's an example, which will create the extension "_popc" with a function called byte_tanimoto_256 which is hard-coded to compute the Tanimoto between two 2048-bit/256-byte bytesstrings.

# I call this file "popc.py". It should work for gcc and clang.
from cffi import FFI

ffibuilder = FFI()

# Create a Python module which can be imported via "import _popc".
# It will contain the single function byte_tanimoto_256() which
# expects two byte strings of length 256 bytes - exactly!
ffibuilder.set_source("_popc", r"""

#include <stdint.h>

static double
byte_tanimoto_256(const unsigned char *fp1, const unsigned char *fp2) {
    int num_words = 2048 / 64;
    int union_popcount = 0, intersect_popcount = 0;

    /* Interpret as 64-bit integers and assume possible mis-alignment is okay. */
    uint64_t *fp1_64 = (uint64_t *) fp1, *fp2_64 = (uint64_t *) fp2;

    for (int i=0; i<num_words; i++) {
        intersect_popcount += __builtin_popcountll(fp1_64[i] & fp2_64[i]);
        union_popcount += __builtin_popcountll(fp1_64[i] | fp2_64[i]);
    }
    if (union_popcount == 0) {
        return 0.0;
    }
    return ((double) intersect_popcount) / union_popcount;
}

""",
    # Tell the compiler to always expect the POPCNT instruction will be available.
    extra_compile_args=["-mpopcnt"])

# Tell cffi to export the above function for use by Python.
ffibuilder.cdef("""
double byte_tanimoto_256(unsigned char *fp1, unsigned char *fp2);
""")

if __name__ == "__main__":
    ffibuilder.compile(verbose=True)

(If I don't include -mpopcnt then the compiler will generate slower code that works even on older Intel-compatible CPUs that don't support the POPCNT instruction added to SSE4a in 2007.)

I then run the program to compile the _popc module:

% python popc.py
generating ./_popc.c
the current directory is '/Users/dalke/demo'
running build_ext
building '_popc' extension
gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -I/Users/dalke/venvs/py38-2020-8/include -I/Users/dalke/local-3.8/include/python3.8 -c _popc.c -o ./_popc.o -mpopcnt
gcc -bundle -undefined dynamic_lookup ./_popc.o -o ./_popc.cpython-38-darwin.so

Finally, I'll test it on a couple of strings to check that nothing is obviously wrong:

>>> import _popc
>>> _popc.lib.byte_tanimoto_256(b"\1"*256, b"\3"*256)
0.5
>>> _popc.lib.byte_tanimoto_256(b"\1"*255 + b"\3", b"\3"*256)
0.501953125
>>> 257/512
0.501953125

A bytestring search implementation

Putting it all together, here's an implementation which uses the _popc module I created in the previous subsection:

from rdkit import Chem
from rdkit.Chem import rdMolDescriptors
from rdkit import DataStructs
from _popc.lib import byte_tanimoto_256

## Input parameters
smiles = "CN1C=NC2=C1C(=O)N(C(=O)N2C)C"
threshold = 0.7

## Convert the SMILES into Morgan fingerprint byte string
mol = Chem.MolFromSmiles(smiles)
if mol is None:
    raise SystemExit(f"Cannot parse SMILES {smiles!r}")

query_rd_fp = rdMolDescriptors.GetMorganFingerprintAsBitVect(
    mol, radius=2, nBits=2048, useChirality=0,
    useBondTypes=1, useFeatures=0)

query_byte_fp = DataStructs.BitVectToBinaryText(query_rd_fp)

## Search chembl_27.fps

# Skip the header; make sure I skipped only to the end of the header
infile = open("chembl_27.fps")
for i in range(6):
    line = infile.readline()
assert line.startswith("#date")

# Process each line, convert to a byte string, and compare
for line in infile:
    target_hex_fp, target_id = line.split()
    target_byte_fp = bytes.fromhex(target_hex_fp)
    score = byte_tanimoto_256(query_byte_fp, target_byte_fp)
    if score >= threshold:
        print(f"{target_id}\t{score}")

This takes a bit over 5 seconds. Recall that the GMP-based search takes about 8 seconds, and the RDKit-based search takes about 30 seconds.

Chemfp's simsearch performance

While it's easy to write a fingerprint search tool, it's hard to make a fast fingerprint search tool. There are many obvious ways to improve the performance of the above code. For examples: 1) local variable name looks in a Python function are faster than using module variables, so simply moving the module-level program into its own function increases performance by 8%, 2) profiling shows that looping over the lines takes the longest time, followed by the split() and then the conversion from hex to a byte, so reducing that overhead is the place to look next, not optimizing the popcount any more.

Instead of going through that process yourself, you could use chemfp's simsearch tool, which is fairly well optimized already. A caffeine search takes under 2 seconds, and the implementation supports many different fingerprint types and sizes:

% time simsearch --query "CN1C=NC2=C1C(=O)N(C(=O)N2C)C" --query-format smistring chembl_27.fps --threshold 0.7
#Simsearch/1
#num_bits=2048
#type=Tanimoto k=all threshold=0.7
#software=chemfp/3.4.1
#targets=chembl_27.fps
#target_source=chembl_27.fps.gz
2	Query1	CHEMBL113	1.00000	CHEMBL1232048	0.70968
1.876u 0.263s 0:01.98 107.5%	0+0k 0+0io 426pf+0w

You can test it out for yourself by installing chemfp (under the Base License Agreement) using:

python -m pip install chemfp -i https://chemfp.com/packages/

The no-cost Base License Agreement says you may do single-query FPS searches against any sized FPS file, for in-house purposes. If you are interested in faster performance, contact me for a license key to try out the binary FPB format, which is faster to load and supports optimized in-memory searches.

Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me