Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2020/09/30/simple_in_memory_search

Simple in-memory ChEMBL similarity search

This is part of a series of essays on how to write a similarity search program for the RDKit Morgan fingerprints distributed by ChEMBL.

  1. Simple FPS fingerprint similarity search: variations on a theme
  2. Simple k-NN FPS Tanimoto and cosine similarity search
  3. Simple in-memory ChEMBL similarity search
  4. Simple BitBound ChEMBL similarity search
  5. Using RDKit BulkTanimotoSimilarity
  6. Faster in-memory ChEMBL search by using more C
  7. Faster BitBound ChEMBL search by using more C
  8. Even faster in-memory search with intersection popcount
  9. Cache and reuse popcount-sorted ChEMBL fingerprints

In the previous two essays I showed how to search chembl_27.fps to find records with similar fingerprints to a query fingerprint, then how to implement a nearest-neighbor search and replace Tanimoto similarity with cosine similarity. The final program took about 5 seconds.

In this essay I'll show how to increase the search performance by shifting more of the work to a load step. This sort of precomputation can be useful if the load step is done once, with the extra overhead shared over multiple searches.

If you want to skip all the discussion, here are the instructions to use the final code, which assumes you have a gcc or clang C compiler and that RDKit's Python bindings are installed:

  1. Download chembl_27.fps.gz.
  2. Download popc.py.
  3. Run python popc.py to generate the _popc module.
  4. Download chembl_in_memory_search.py.
  5. Run python chembl_in_memory_search.py --times to do the default caffeine search and see the internal timings.
  6. Run python chembl_in_memory_search.py --help to see the command-line help, then try your own queries, including from a SMILES or FPS file.

The new program averages about 0.5 seconds per query, with an initial 4-5 second load time.

In tomorrow's essay, I'll describe how to implement the BitBound algorithm, which speeds up most searches by reducing the number of fingerprints to test. I'll revisit this in-memory search algorithm next week when I implement more of the search code in C.

Measuring FPS parsing overhead

I profiled the program and found that most of the time was spent in parsing the FPS file, not in computing the Tanimoto. To profile, I wrapped the entire program into a function (the profiler I'm using works on functions), used the @profile decorator, which tells kernprof which function(s) to profile, profiled the result with:

% kernprof -l ssimsearch_bytes.py

then viewed the profiler results with

% python -m line_profiler ssimsearch_bytes.py.lprof
Timer unit: 1e-06 s

Total time: 13.0334 s
File: ssimsearch_bytes.py
Function: main at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           @profile
     2                                           def main():
     3         1       6859.0   6859.0      0.1      from _popc.lib import byte_tanimoto_256
     4
     5         1     224418.0 224418.0      1.7      from rdkit import Chem
     6         1          6.0      6.0      0.0      from rdkit import DataStructs
     7         1       5629.0   5629.0      0.0      from rdkit.Chem import rdMolDescriptors
     8
     9         1          1.0      1.0      0.0      smiles = "CN1C=NC2=C1C(=O)N(C(=O)N2C)C"
    10         1          1.0      1.0      0.0      threshold = 0.7
    11
    12         1      11504.0  11504.0      0.1      mol = Chem.MolFromSmiles(smiles)
    13         1          2.0      2.0      0.0      if mol is None:
    14                                                   raise SystemExit(f"Cannot parse SMILES {smiles!r}")
    15
    16         2         98.0     49.0      0.0      rd_query_fp = rdMolDescriptors.GetMorganFingerprintAsBitVect(
    17         1          1.0      1.0      0.0          mol, radius=2, nBits=2048, useChirality=0,
    18         1          0.0      0.0      0.0          useBondTypes=1, useFeatures=0)
    19         1         15.0     15.0      0.0      query_fp = DataStructs.BitVectToBinaryText(rd_query_fp)
    20
    21
    22         1         97.0     97.0      0.0      infile = open("chembl_27.fps")
    23         7          8.0      1.1      0.0      for i in range(6):
    24         6         32.0      5.3      0.0          line = infile.readline()
    25         1          3.0      3.0      0.0      assert line.startswith("#date")
    26
    27   1941411    3356477.0      1.7     25.8      for line in infile:
    28   1941410    3207597.0      1.7     24.6          target_hex_fp, target_id = line.split()
    29   1941410    2605370.0      1.3     20.0          target_fp = bytes.fromhex(target_hex_fp)
    30
    31   1941410    2071065.0      1.1     15.9          score = byte_tanimoto_256(query_fp, target_fp)
    32   1941410    1544099.0      0.8     11.8          if score >= threshold:
    33         2         98.0     49.0      0.0              print(f"{target_id}\t{score}")

Lines 27-29 are all part of parsing the FPS file, and they take 70% of the search time. The actual Tanimoto calculation (line 31) and score test (line 32) take together about 25% of the time.

Multiple queries

What if you want to search for multiple queries? One implementation might do the full search of the first query, then of the second, and so on. This would take about 5*N seconds.

The above profile suggests that if you could load the fingerprint data into a more easily processed form, where the pre-processing was done once, then the overall performance would be faster. This may be useful in a web service where extra work can be done during startup in order to improve the per-request performance.

I changed the main loop (lines 27-33 in the above) so there are two stages. Stage 1 loads all of the fingerprints into memory, and stage 2 searches those fingerprints, and added timing measurements for each stage:

    import time
    t1 = time.time()
    targets = []
    for line in infile:
        target_hex_fp, target_id = line.split()
        target_fp = bytes.fromhex(target_hex_fp)
        targets.append( (target_id, target_fp) )

    t2 = time.time()
    for target_id, target_fp in targets:
        score = byte_tanimoto(query_fp, target_fp)
        if score >= threshold:
            print(f"{target_id}\t{score}")

This reports:

Load: 4.64 Search: 0.79

What multiple queries?

Previously I supported a hard-coded SMILES string, or a user-defined SMILES string on the command-line. If there are going to be multiple queries, what are they, and where do they come from?

I decided I was going to support a SMILES file or FPS file as input, based on if the given query filename ends with .smi or not. I already have a way to read_fingerprints() from an FPS file, and I have a way to compute the RDKit Morgan fingerprint given a SMILES, so all I need is a way to read a SMILES file that yields the same (id, fingerprint) pairs as read_fingerprints().

SMILES file

There are two main types of SMILES files: Daylight-like and CSV-like. A Daylight SMILES file contain one record per line. Each line contains a SMILES followed by a whitespace character followed by a molecule title/id, up to the rest of the line.

A CSV-like SMILES file contains a SMILES followed by a space or tab seperator character, followed by an molecule title/id, followed optionally by successive seperator characters and additional fields. In addition, the first line may contain column headers.

I'm going to support Daylight-like SMILES file.

SMILES file parser

It's pretty easy to read either SMILES file variant, though more complex if you want full configurability. In short, for every line in the file, split on the first whitespace. The first field is the SMILES, the second is the identifier. Convert the SMILES to a fingerprint, or skip it if the SMILES could not be processed. That give the needed (id, fingerprint) pairs:

# Read a SMILES file, convert the SMILES to a fingerprint,
# and yield (id, fingerprint) pairs for each parsable record.
def read_structure_fingerprints_from_smiles_file(infile):
    for lineno, line in enumerate(infile, 1):
        # Expecting the format: <SMILES> + whitespace + <title> + "\n"
        smiles, id = line[:-1].split(None, 1)
        
        try:
            # Get the corresponding RDKit Morgan fingerprint
            fp = get_query_fp(smiles)
        except ValueError as err:
            # Skip records which can't be parsed
            sys.stderr.write(f"ERROR: {err}: Skipping line {lineno}\n")
            continue
        
        yield id, fp

Reporting search results

Since there may be multiple queries, each with 0 or more hits, I need some way to figure out which query matches which targets. I decided to use a simple three-column tab-separated format with the query id in the first column the target id in the second column, and the score in the third, like the following:

Amygdalin	CHEMBL2303964	0.796
Amygdalin	CHEMBL2303958	0.700
Boron nitride	*	*
Bicarbonate	CHEMBL1353	0.875
Bicarbonate	CHEMBL2106975	0.875
Benzoic acid	CHEMBL541	1.000

A problem with this format is there's no obvious way to say that a query has no hits. Perhaps those should be omitted from the output? What I decided was to have one output line for that case, with * for the target id and score. You can see that in the Boron nitride line, above.

A search implementation

Assuming targets is a list of (target id, target fingerprint) pairs, and query_reader is an iterator of (query id, query fingerprint) pairs (either from read_fingerprints() or read_structure_fingerprints_from_smiles_file()) then the following shows the structure of the main search loop and reporting.

for query_id, query_fp in query_reader:
    has_match = False
    
    # Check each of the targets
    for target_id, target_fp in targets:
        score = similarity_function(query_fp, target_fp)
        # If it's similar enough, print details.
        if score >= threshold:
            has_match = True
            print(f"{query_id}\t{target_id}\t{score:.3f}")
    
    # Print something in case there were no matches.
    if not has_match:
        print(f"{query_id}\t*\t*")

(The final code includes a count of the number of queries, which is used for reporting.)

Command-line options with argparse

There are several Python packages for parsing command-line arguments. I use argparse because it's part of Python's standard library. Many are fond of click as a third-party package.

I decided I wanted to allow a single query SMILES input or input from either a SMILES file or an FPS file. I also wanted to be able to change the threshold, specify the location of the chembl_27.fps file (in case it isn't in the current directory), and support a --times option to report the load and search times.

Here's what the --help from the program looks like:

usage: chembl_in_memory_search.py [-h] [--threshold THRESHOLD]
          [--query SMILES | --queries FILENAME] [--times] [FILENAME]
in-memory ChEMBL fingerprint search

positional arguments:
  FILENAME              location of chembl_27.fps

optional arguments:
  -h, --help            show this help message and exit
  --threshold THRESHOLD, -t THRESHOLD
                        minimum threshold (default: 0.7)
  --query SMILES        query smiles
  --queries FILENAME    input query file (SMILES or FPS)
  --times               print timing information to stderr

This sort of configuration is pretty easy to do with argparse. The hardest thing is that the single-query SMILES search and the file-based search are exclusive, meaning they can't both be specified at the same time.

parser = argparse.ArgumentParser(
    description = "in-memory ChEMBL fingerprint search")

parser.add_argument("--threshold", "-t", type=float, default=0.7,
                        help="minimum threshold (default: 0.7)")

g = parser.add_mutually_exclusive_group()
g.add_argument("--query", metavar="SMILES",
                   help="query smiles")
g.add_argument("--queries", metavar="FILENAME",
                   help="input query file (SMILES or FPS)")

parser.add_argument("--times", action="store_true",
                        help="print timing information to stderr")

parser.add_argument("targets", metavar="FILENAME", default="chembl_27.fps", nargs="?",
                        help="location of chembl_27.fps")

An argparse.ArgumentParser() instance (like parser in the above) is usually used something like:

args = parser.parse_args()
print("Threshold:", args.threshold)

That is, parse_args() parses the command line and returns an argparse.Namespace() instance with the parsed command-line parameters accessible as attributes.

Setting up the query reader

The last tricky part to explain is how to handle both --query SMILES and --queries FILENAME. I decided to have a function which returns a query reader, that is, something which can be iterated over to get the successive pairs of (query id, query fingerprint). I have all the parts - I just need to hook them together correctly. And, if neither option is specified, I'll have the program search for caffeine, with a warning that that's what it will do.

# Helper function to figure out how the queries are specified
def open_query_reader(args):
    if args.query is None:
        if args.queries is None:
            # Default search
            sys.stderr.write("No query specified, using caffeine.\n")
            query_fp = get_query_fp("CN1C=NC2=C1C(=O)N(C(=O)N2C)C")
            return [("caffeine", query_fp)]
        
        elif args.queries.endswith(".smi"):
            # If --queries ends with ".smi"
            return read_structure_fingerprints_from_smiles_file(open(args.queries))
        else:
            # Otherwise, assume it's an FPS file
            return read_fingerprints(open(args.queries))

    else:
        # User specified a --query SMILES
        try:
            query_fp = get_query_fp(args.query)
        except ValueError as err:
            raise SystemExit("Cannot parse --query SMILES {args.query!r}")
        return [("Query", query_fp)]

chembl_in_memory_search.py

I've walked through all of the the important parts, though there are a few missing connector pieces. Rather than show it here, you can download chembl_in_memory_search.py for yourself. You'll need to download and run popc.py in order to generate the _popc Python module. (See Monday's essay for more details.)

Search Wikipedia structures

The fun Wikipedia Chemical Structure Explorer is an interactive tool to explore the chemical structures scrapped from Wikipedia. The structures can be downloaded as a SMILES file. It currently contains 17,846 structures, but it's NOT a SMILES file because the id and SMILES fields are swapped. I'll extract the first 10 to use for my timings and use awk to swap the two fields (the -F\t configures it to use the tab character as the field seperator):

% head -10 smiles.txt | awk '-F\t' '{print $2, $1}' > wikipedia_10.smi

Many if not most of the Wikipedia structures are in ChEMBL, along with similar structures, so I'll use a high similarity threshold in order to limit the number of lines in the output:

% python chembl_in_memory_search.py --queries wikipedia_10.smi --threshold 0.9
Ammonia	CHEMBL1160819	1.000
Aspirin	CHEMBL25	1.000
Acetylene	CHEMBL116336	1.000
Adenosine triphosphate	CHEMBL14249	1.000
Adenosine triphosphate	CHEMBL339385	0.942
Adenosine triphosphate	CHEMBL437508	0.942
Adenosine triphosphate	CHEMBL490984	1.000
Adenosine triphosphate	CHEMBL591905	0.942
Adenosine triphosphate	CHEMBL407938	0.942
Adenosine triphosphate	CHEMBL14830	0.981
Adenosine triphosphate	CHEMBL1331383	0.942
Adenosine triphosphate	CHEMBL1626485	0.942
Adenosine triphosphate	CHEMBL1610462	0.925
Adenosine triphosphate	CHEMBL1162201	0.942
Adenosine triphosphate	CHEMBL4298329	0.942
Ampicillin	CHEMBL174	1.000
Ampicillin	CHEMBL453388	0.978
Ampicillin	CHEMBL1078033	0.978
Ampicillin	CHEMBL1473908	1.000
Chemistry of ascorbic acid	CHEMBL196	1.000
Chemistry of ascorbic acid	CHEMBL486293	1.000
Chemistry of ascorbic acid	CHEMBL1161421	1.000
Chemistry of ascorbic acid	CHEMBL196	1.000
Chemistry of ascorbic acid	CHEMBL486293	1.000
Chemistry of ascorbic acid	CHEMBL1161421	1.000
Amphetamine	CHEMBL405	1.000
Amphetamine	CHEMBL19393	1.000
Amphetamine	CHEMBL612	1.000
Amphetamine	CHEMBL287386	0.950
Amphetamine	CHEMBL554211	0.950
Amphetamine	CHEMBL1788294	0.950
Aspartame	CHEMBL171679	1.000
Aspartame	CHEMBL312245	1.000
Aspartame	CHEMBL2110238	1.000
Amoxicillin	CHEMBL1082	1.000
Amoxicillin	CHEMBL1367635	1.000
Amoxicillin	CHEMBL1433952	1.000

If I add the --times option, then I see I'm doing about 2 searches per second:

load: 4.57 search: 5.01 #queries: 10 #/sec: 2.00

The equivalent chemfp search

Chemfp is a high-performance cheminformatics similarity search package I sell. Its command-line is similar to the chembl_in_memory_search.py program I developed above. I'll try it out:

% simsearch --queries wikipedia_10.smi --threshold 0.9 --times chembl_27.fps > /dev/null
open 0.00 read 0.00 search 8.55 output 0.00 total 8.56

This means it takes about 9 seconds to search the file, which is about the same time as the simple Python code I wrote!

--scan vs. --memory searches

That was unexpected. The problem is that chemfp has two search modes. The --scan mode searches the file chunk by chunk (like Monday' essay), while --memory loads the data set into memory before searching.

Years ago, the tradeoff was about 20 queries. Now I see it's much smaller! If I ask it to do an in-memory search then it takes only 5.7 seconds total:

% simsearch --queries wikipedia_10.smi --threshold 0.9 --times chembl_27.fps --memory > /dev/null
open 5.60 read 0.00 search 0.07 output 0.00 total 5.67

This means it took 5.6 seconds to read the fingerprints and organize them for fast search, and the search itself took only 0.07 seconds.

Looks like it's time to re-tune that parameter!

chemfp's FPB format

One reason I didn't notice this until now is that I tend to work with chemfp's FPB format, which is optimized for fast loading. It takes all of the FPS readers over 4 seconds to parse the file to put the data into a format which is more appropriate for faster searches.

The FPB format organizes the fingerprint data so it can loaded quickly. In fact, if the file is uncompressed it can be memory-mapped and used directly - for some searches chemfp doesn't even need to load the full data set!

First, I'll convert the FPS file to FPB format:

% fpcat chembl_27.fps -o chembl_27.fpb

I'll then do the same search with 10 Wikipedia queries using a warm disk cache:

% simsearch --queries wikipedia_10.smi --threshold 0.9 --times chembl_27.fpb > /dev/null
open 0.00 read 0.00 search 0.22 output 0.00 total 0.23

That's about 22 milliseconds per query, with very little load time needed. This is fast enough to include a ChEMBL similarity search in a scripting environment.

chemfp with 1,000 Wikipedia queries

I then tried with nearly 1,000 Wikipedia queries (RDKit could not parse a couple of structures; I've omitted RDKit's warning and error messages) using chemfp and the FPS and FPB formats:

% simsearch --queries wikipedia_1000.smi --threshold 0.9 --times chembl_27.fps > /dev/null
open 5.92 read 0.26 search 2.58 output 0.01 total 8.77
% simsearch --queries wikipedia_1000.smi --threshold 0.9 --times chembl_27.fpb > /dev/null
open 0.02 read 0.25 search 2.77 output 0.02 total 3.07
With 1,000 queries the read time is now noticeable; that includes the time for RDKit to parse the SMILES string and generate the Morgan fingerprint. In case you are curious, the performance is over 100x faster than the simple in-memory program I developed for this essay:
% python chembl_in_memory_search.py --queries wikipedia_1000.smi --threshold 0.9 --times > /dev/null
load: 4.69 search: 499.65 #queries: 994 #/sec: 1.99

chemfp advertisement

You can install and test chemfp chemfp (under the Base License Agreement) by doing:

python -m pip install chemfp -i https://chemfp.com/packages/
The Base License Agreement does not allow you to create FPB files or do an in-memory search with more than 50,000 fingerprints. If you are interested in those features, take a look at the other licensing options then contact me to ask for an evaluation license key.


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2020 Andrew Dalke Scientific AB