Similarity web service

[ previous | newer ] /home/writings/diary/archive/2014/11/28/similarity_web_service

Similarity web service

Now that I've indexed ChEMBL for record lookup and similarity search, I can write a simple interactive web server. By "interactive" I mean that it will show you results, if available, as you enter the SMILES or record id. This is possible because chemfp takes under 0.1 seconds to search the 1.4 million fingerprints of ChEMBL and report the results. By "simple" I mean there will be no sketcher and a plain looking user interface. I'll also stick to using jQuery for the browser-side code; I usually use Knockout for this sort of app.

I also decided to try out the Flask web development framework instead of using Django. This has the advantage that my web service code fits into a single file.

Here's what the result looks like, so you can see where I'm going:

Multiple threads

Flask is a multi-threaded server, which means I need to configure a few things to work correctly.

The record data is in a SQLite database. SQLite is threadsafe but "Older SQLite versions had issues with sharing connections between threads. That's why the Python module disallows sharing connections and cursors between threads."

I don't write to the database and I'll be careful and create a new cursor for each request, so I'm safe. However, I do need to disable that check, which I do through the check_same_thread parameter:

import sqlite3

database_filename = "chembl_19.sqlite3"
db = sqlite3.connect(database_filename, check_same_thread=False)

Chemfp can be compiled for OpenMP, which is the default. Unfortunately, on the Mac, multi-threaded code and OpenMP don't mix. I need to tell chemfp to always use the non-OpenMP code path, as otherwise the program will crash on the first similarity search:

import chemfp

chemfp.set_num_threads(1)

Load the data sets

I'll open the SQLite database and make sure it has compounds, then open the fingerprint file and make sure it has fingerprints. Hopefully the two are the same! I made sure when I created the fingerprint file to include the fingerprint type in the metadata. Now I can use the type string to get the fingerprint type object, and from that I can get the correct toolkit to use.

database_filename = "chembl_19.sqlite3"
fingerprint_filename = "chembl_19.fpb"

db = sqlite3.connect(database_filename, check_same_thread=False)
cursor = db.cursor()
try:
    num_compounds, = cursor.execute("SELECT COUNT(*) FROM Compounds").fetchone()
except sqlite3.ChemFPError, err:
    raise SystemExit("Cannot connect to SQLite database %r: %s" % (database_filename, err))
del cursor  # remove the global cursor so there's no chance of accidental reuse

if num_compounds == 0:
    sys.stderr.write("WARNING: Database %r contains no compounds." % (database_filename,))

# Load the fingerprints
arena = chemfp.load_fingerprints(fingerprint_filename)

# Use the fingerprint type string from the metadata to get the fingerprint type object
try:
    fptype = arena.get_fingerprint_type()
except chemfp.ChemFPError, err:
    raise SystemExit("Unable to load the fingerprint type for fingerprint file %r: %s"
                     % (fingerprint_filename, err))

# Get the toolkit associated with this fingerprint type
toolkit = fptype.toolkit
if toolkit is None:
    raise SystemExit("No toolkit available to generate fingerprints of type %r"
                     % (fptype.base_name,))

# Report information about the number of compounds and fingerprints
if num_compounds == len(arena):
    sys.stderr.write("Loaded %d compounds and fingerprints of type %r\n"
                     % (num_compounds, fptype.get_type()))
else:
    sys.stderr.write("Loaded %d compounds and %d fingerprints of type %r\n"
                     % (num_compounds, len(arena), fptype.get_type()))

When my server start (or in debug mode, when it restarts) it prints:

Loaded 1404532 compounds and fingerprints of type 'RDKit-Fingerprint/2 minPath=1 maxPath=7 fpSize=2048 nBitsPerHash=2 useHs=1'

The FPB format for chemfp-1.2 was developed in part so that web server reloads, which is important during web development, would be quick. I basically don't notice the time it takes for the above code to run.

Defining the web service API

1) I'll have a single similarity search API, named '/search', which take the following parameters:

q: the query, which is a SMILES string or a record id
k: the number of neighbors to find (default: 3)
threshold: the minimum required threshold for a valid result
oformat: the output format. "ids" return the match identifiers, one per line; "smiles" returns a tab-separated file with the SMILES in the first column, id in the second, and score in the third; "sdf" returns the original SD records, each with a new field called "SCORE" containing the similarity score, and "json" returns a JSON document containing the ids, SMILES, scores, and a few other pieces of information.

2) I'll have the home page ('/') return an interactive search brower.

3) I'll have a way to get the record for CHEMBL286759 by going to a URL like "/CHEMBL286759.sdf". In Flask terminolgy, I need to match anything of the form "/<string:id>.sdf", though if the given id doesn't exist I'll need to return a 404 NOT FOUND error.

All of them require hooking on to the Flask 'app', which is defined like this:

from flask import Flask, Response, jsonify, request

app = Flask(__name__)

Get a record given its id

I'll start with the easiest, where "/CHEMBL286759.sdf" returns the original SDF record for CHEMBL286759.

# get a record given its id, or return "404 Not Found"
@app.route("/<string:id>.sdf")
def get_record(id):
    cursor = db.cursor()
    cursor.execute("SELECT record FROM Compounds WHERE id = ?", (id,))
    row = cursor.fetchone()
    if not row:
        page_not_found()

    record = zlib.decompress(row[0])
    return Response(record, mimetype="text/plain")

I don't think it needs much explanation, but it might help to look at the interactive example from my previous essay.

This calls a error handler function to report the 404 NOT FOUND error, which is:

@app.errorhandler(404)
def page_not_found(error=None):
    return "Not found", 404

Similarity search - parse query arguments

The similarity search code is pretty involved, but mostly because it need to handle different input and output cases. The input contains a lot of code like this:

# Similarity search
@app.route("/search")
def similarity_search():
    ...
    k = request.args.get("k", "3")
    ...
    try:
        k = int(k)
    except ValueError:
        return _error_response("bad value for k")

which get the query parameter "k" if it exists, or uses the default of "3" if it doesn't. It then converts it into an integer, and if that fails it returns an error response. That function constructs a JSON document:

def _error_response(errmsg):
    return jsonify({
       "status": "error",
       "message": errmsg,
       "columns": ["id", "score", "smiles", "exact match"],
       "hits": [],
       })

Similarity search - handle the SMILES or record id

The "q" parameter is more complicated, because it can be a SMILES string or a query id. I'll resolve this by trying to parse it as a SMILES string. If that fails, I'll see if there's a record in the database with that name, which also means I need a new cursor for the database connection:

def similarity_search():
    ...
    q = request.args.get("q", "").strip()
    ...
    cursor = db.cursor()
    
    # Flask queries are Unicode strings, but SMILES and record identifiers
    # can only be ASCII, so convert and report any failures
    try:
        q = str(q)
    except UnicodeEncodeError:
        return _error_response("non-ASCII query")
    errmsg, mol = _get_molecule(cursor, q)
    if errmsg:
        return _error_response(errmsg)
    ...

Here's where I implement the business logic. Given the toolkit, parse the query as a SMILES. If that fails (chemfp.ParseError is also a ValueError) then get the SMILES for the given id. If that fails, or is the SMILES cannot parsed, then report an error. Otherwise return the newly parsed molecule object. (It's actually a two-value result, where the first is the error message and the second is the molecule. At least one of them will be None.)

def _get_molecule(cursor, query):
    if not query:
        return None, None

    # Is this a valid SMILES?
    try:
        mol = toolkit.parse_molecule(query, "smistring")
    except ValueError, err:
        # No. Then is this an identifier?
        match = cursor.execute("SELECT smiles FROM Compounds WHERE id = ?", (query,)).fetchone()
        if not match:
            return "Not a valid SMILES or identifier", None
        smiles = match[0]
        try:
            mol = toolkit.parse_molecule(smiles, "smistring")
        except ValueError:
            return "Invalid SMILES in database!", None

    return None, mol

Similarity search - compute the fingerprint

It's only a couple of lines to compute the fingerprint for the molecule and search the FPB file for the k-nearest neighbors within the given threshold.

    fp = fptype.compute_fingerprint(mol)
    hits = arena.knearest_tanimoto_search_fp(fp, k=k, threshold=threshold)

Similarity search - handle the oformat parameter and return ids

The 'oformat', short for 'output format', specifies the desired output format if there are no errors. The "ids" output format returns the hits as a list of target identifiers, with one id per line.

def similarity_search():
    ...
    oformat = request.args.get("oformat", "json")
    ...
    if oformat not in ("json", "ids", "smiles", "sdf"):
        oformat = "json"
    ...
    hits = arena.knearest_tanimoto_search_fp(fp, k=k, threshold=threshold)
    ...
    if oformat == "ids":
        return Response("\n".join(hits.get_ids()) + "\n", mimetype="text/plain")
    ...

Similarity search - oformat == "smiles"

Returning SMILES is a bit more complicated because I need to get the SMILES string for each hit id, and that requires a database lookup. I decided to make a function call "get_columns()", which takes the list of requested columns and the hits of interest, and searches for the intersection of them.

As far as I can tell, there's no easy way to ask for all of the rows for a given list of ids. There is the "WHERE id IN (?, ?, ... ?)" construction, but I need a '?' for each query id. That's not hard to do, because this is Python, but I feel like there should be a better way to make the query.

def get_columns(cursor, columns, hits):
    sql_query = "SELECT %s FROM Compounds WHERE id IN (" % (", ".join(columns),)
    sql_query += ",".join("?"*len(hits)) + ")"
    cursor.execute(sql_query, tuple(hits.get_ids()))
    return cursor

I use the get_columns() function to get a list of 2-element rows, which I convert into a dictionary. There's the off chance that the SQLite database and the fingerprint database can get out of synchronization. A loop through the hits, which has the ids and scores, also gives me the SMILES. If a given id is missing, I'll use "MISSING" as the SMILES string for it. And I'll format the score to three decimal places.

    if oformat == "smiles":
        id_to_cansmiles = dict(get_columns(cursor, ["id", "smiles"], hits))
        lines = []
        for id, score in hits.get_ids_and_scores():
            smiles = id_to_cansmiles.get(id, "MISSING")
            lines.append("%s\t%s\t%.3f\n" % (smiles, id, score))
        return Response("".join(lines), mimetype="text/plain")

The text_toolkit and "text molecules"

The SDF output is a bit more complicated because I need to decompress the compressed SD record, and because I want to modify the original record to have a new tag containing the "SCORE". For that I'll use the text_toolkit, but this time for its "text molecule" toolkit API. It implements the same toolkit API as rdkit_toolkit, openeye_toolkit and rdkit_toolkit, but instead of being real molecule objects that knows chemistry, it's really just a light-weight container for the record, with enough knowledge of SD records to do some light changes to the tags.

Here's what I mean. I'll start with a record (I'll leave out the left side of Python's interactive shell to make it easier to copy and paste).

record = """\
CHEMBL17564
  SciTegic10311311082D

  1  0  0  0  0  0            999 V2000
    0.0000    0.0000    0.0000 C   0  0
M  END
> 
CHEMBL17564

$$$$

I'll then "parse" the record into a text molecule, using the text_toolkit, and some of the SD tag API:

>>> from chemfp import text_toolkit
>>> text_mol = text_toolkit.parse_molecule(record, "sdf")
>>> text_mol.get_tag("chembl_id")
'CHEMBL17564'
>>> text_mol.get_tag_pairs()
[('chembl_id', 'CHEMBL17564')]
>>> text_mol.add_tag("SCORE", "0.567")
>>> text_mol.get_tag_pairs()
[('chembl_id', 'CHEMBL17564'), ('SCORE', '0.567')]

Now I'll convert the modified record back into a string, with the new tag(s) at the end:

>>> print text_toolkit.create_string(text_mol, "sdf")
CHEMBL17564
  SciTegic10311311082D

  1  0  0  0  0  0            999 V2000
    0.0000    0.0000    0.0000 C   0  0
M  END
> <chembl_id>
CHEMBL17564

> <SCORE>
0.567

$$$$

As a side note, I think this API is too complicated. A future version of chemfp will likely have an "add_sdf_tag" which operates directly on strings rather than through text molecule objects.

Similarity search - oformat == "sdf"

Once you know how to use the text_toolkit to add an SD tag, the SDF output format handler is pretty straight-forward:

    if oformat == "sdf":
        id_to_sdfz = dict(get_columns(cursor, ["id", "record"], hits))
        records = []
        for id, score in hits.get_ids_and_scores():
            recordz = id_to_sdfz.get(id, None)
            if recordz is None:
                continue
            record = zlib.decompress(recordz)
            text_mol = text_toolkit.parse_molecule(record, "sdf")
            text_mol.add_tag("SCORE", "%.3f" % (score,))
            writers.append(text_toolkit.create_string(text_mol, "sdf"))

        return Response("".join(writer), mimetype="text/plain")

Similarity search - oformat == "json"

I decided that I want the JSON output to have the id, score, and SMILES string, and also have a flag (either 0 or 1) if it matches the query SMILES:

    query_cansmiles = toolkit.create_string(mol, "canstring")

I then need to get the ids and smiles for the hits

    id_to_cansmiles = dict(get_columns(cursor, ["id", "smiles"], hits))

get the results, as a 4-element tuple of (id, score, SMILES, and match flag)

    results = []
    for id, score in hits.get_ids_and_scores():
        cansmiles = id_to_cansmiles.get(id, None)
        results.append( (id, score, cansmiles, query_cansmiles == cansmiles) )

and return the results, along with some timing information I haven't yet described:

    return jsonify({
        "status": "ok",
        "message": "",
        "search_time": fp_search_time,
        "lookup_time": smiles_lookup_time,
        "total_time": time.time() - start_time,
        "hits": results,
        })

The GUI

The code to display the search page is simple, because it's just returning a HTML as a string.

@app.route("/")
def homepage():
    return Response(js_code, mimetype="text/html")

But how does the Javascript work? You should be aware that I am not as good of a Javascript programmer as I am a Python programmer. What you see here is advice, and not a recommendation as for best practices.

The main part of it says that if any of the values for the "threshold", "k", or "smiles" form elements should change, then call make_request(), and when the form starts also call make_request().

  $("#threshold").change(make_request);
  $("#k").change(make_request);
  $("#smiles").on("input", make_request);
  make_request();  // In case fields remain after a browser refresh.

The make_request() function starts by getting request parameters from the different form element, though if the query is empty it won't do a new search.

var make_request = function() {
  var k = $("#k").val();
  var threshold = $("#threshold").val();  
  var smiles = $("#smiles").val();
  if (smiles === "") {
    $("#request").text("(empty query)");
    return;
  }

Otherwise it reports the query parameters and starts an AJAX request with the correct parameters. This request can either succeed or fail. I'll leave out the success part for now. On failure you can see that it adds a message to the "error" text and clears out the "message" text.

  $("#request").text("threshold=" + threshold + " k=" + k + " q=" + smiles);
  $.ajax({
      url: "search",
      data: [
        {name: "q", value: smiles},
        {name: "k", value: k},
        {name: "threshold", value: threshold}
        ]
  }).done(function(data) {

       ... code to call on success .. see below ....

  }).fail(function() {
    $("#error").text("Server failure");
    $("#message").text("");
  });

"Success" is relative. It might successfully report that the SMILES string was invalid, so unless the status is "ok" I'll report the resulting message as an error:

    if (data["status"] !== "ok") {
      $("#error").text(data["message"]);
      $("#message").text("");
      return;
    }

Otherwise clear the error message and report some timing numbers.

    $("#error").text("");
    var hits = data["hits"];
    var round_trip_time = (new Date() - start_time) / 1000.0;
    $("#message").text("Found " + hits.length + " hits. Times for " +
                       " Search: " + data["search_time"].toFixed(3) +
                       " Lookup: " + data["lookup_time"].toFixed(3) +
                       " Handler: " + data["total_time"].toFixed(3) +
                       " Round-trip: " + round_trip_time.toFixed(3) + " seconds.");

The timings here are all in seconds, and reported to 3 digits. The meanings are:

Search: the time for the fingerprint search
Lookup: the time to get the SMILES strings
Handler: the total time in my Flask handler
Round-trip: the elapsed time from just before the AJAX request to just before reporting the time

I then clear the table rows that contain the results, and convert the results into new rows. It's a bit more complicated than I would like; this is where client-side templates can be nice.

    $(".result-row").remove();
    var tbody = $("tbody");

    $.each(data["hits"], function(rowno, row) {
        var tr = $(document.createElement("tr")).addClass("result-row");
        tbody.append(tr);
        $.each(row, function(colno, value) {
            var ele = $(document.createElement("td"));
            if (colno == 0) {
              // The identifier, wrapped in a hyperlink to the orginal SD record
              var a = $(document.createElement("a"));
              a.attr({"href": value + ".sdf"});
              a.text(value);
              ele.append(a);
            } else if (colno == 1) {
              // The score, to three decimals
              ele.addClass("score");
              ele.text(value.toFixed(3));
            } else {
              ele.text(value);
            }
            tr.append(ele);
        });
    });

The last part is to change the "Download" button. Because searches are so fast, all I do is tell it the new query parameters, and let the user choose the download format. The form download is actually a regular query against the "/search" handler. This is easier than storing all of the ids and scores from the current result and making a new API to somehow resynthesize the correct results.

Here's the form I'm updating:

<form action="search" method="GET">
<input type="hidden" name="q">
<input type="hidden" name="k">
<input type="hidden" name="threshold">
Choose format:
<select name="oformat">
< <option value="ids">ids</option>
< <option value="smiles">SMILES</option>
< <option value="sdf" SELECTED>SDF</option>
< <option value="json">JSON</option>
</select>
<input id="submit-button" type="submit" disabled="disabled" value="Download" />
</form>

and the update code is:

    $("form input[name=q]").val(smiles);
    $("form input[name=k]").val(k);
    $("form input[name=threshold]").val(threshold);
    $("#submit-button").prop("disabled", data["hits"].length == 0);
    if (data["hits"].length == 1) {
      $("#submit-button").val("Download 1 record");
    } else {
      $("#submit-button").val("Download " + data["hits"].length + " records");
    }

I also did a bit extra to report the number of available records to download, and don't allow a download if there are no records.

The final code

I left out a few pieces here and there, like most of the HTML and some of the timing details. Here's the final code:

import sys
import sqlite3
import zlib
import time

from flask import Flask, Response, jsonify, request

import chemfp
from chemfp import text_toolkit


database_filename = "chembl_19.sqlite3"
fingerprint_filename = "chembl_19.fpb"

# Threads and OpenMP don't work well under Mac OS X.
# Tell chemfp to not use OpenMP.
chemfp.set_num_threads(1)


db = sqlite3.connect(database_filename, check_same_thread=False)
cursor = db.cursor()
try:
    num_compounds, = cursor.execute("SELECT COUNT(*) FROM Compounds").fetchone()
except sqlite3.ChemFPError, err:
    raise SystemExit("Cannot connect to SQLite database %r: %s" % (database_filename, err))
del cursor  # remove the global cursor so there's no chance of accidental reuse

if num_compounds == 0:
    sys.stderr.write("WARNING: Database %r contains no compounds." % (database_filename,))

# Load the fingerprints
arena = chemfp.load_fingerprints(fingerprint_filename)

# Use the fingerprint type string from the metadata to get the fingerprint type object
try:
    fptype = arena.get_fingerprint_type()
except chemfp.ChemFPError, err:
    raise SystemExit("Unable to load the fingerprint type for fingerprint file %r: %s"
                     % (fingerprint_filename, err))

# Get the toolkit associated with this fingerprint type
toolkit = fptype.toolkit
if toolkit is None:
    raise SystemExit("No toolkit available to generate fingerprints of type %r"
                     % (fptype.base_name,))

# Report information about the number of compounds and fingerprints
if num_compounds == len(arena):
    sys.stderr.write("Loaded %d compounds and fingerprints of type %r\n"
                     % (num_compounds, fptype.get_type()))
else:
    sys.stderr.write("Loaded %d compounds and %d fingerprints of type %r\n"
                     % (num_compounds, len(arena), fptype.get_type()))

app = Flask(__name__)

@app.errorhandler(404)
def page_not_found(error=None):
    return "Not found", 404

js_code = """<html>
<head>
 <title>Search demo</title>
 <script type="text/javascript" src="https://code.jquery.com/jquery-2.1.1.min.js"></script>
 <style>
#error {
  color: red;
}
table {
  border-collapse: collapse;
}
table tr:first-child + tr td {
 border-top: 1px solid black;
}
table tr:last-child td {
 border-bottom: 1px solid black;
}
.score {
  width: 5em;
}
 </style>
</head>
<body>
<h2>Search demo</h2>
Target contains NUM_FINGERPRINTS fingerprints of type FINGERPRINT_TYPE<br />
<br />
<div>
 k: <select id="k">
   <option value="1">1</option>
   <option value="3">3</option>
   <option value="5" SELECTED>5</option>
   <option value="10">10</option>
   <option value="100">100</option>
   <option value="1000">1000</option>
  </select>
 threshold: <select id="threshold">
   <option value="1.0">100%</option>
   <option value="0.99">99%</option>
   <option value="0.98">98%</option>
   <option value="0.95">95%</option>
   <option value="0.9">90%</option>
   <option value="0.8 SELECTED">80%</option>
   <option value="0.7">70%</option>
   <option value="0.5">50%</option>
   <option value="0.2">20%</option>
   <option value="0.0">0%</option>
  </select>
 SMILES or id: <input id="smiles" size="50"></input>
</div>
<div>
<br />
Request: <span id="request"></span><br />
Response: <span id="error"></span><span id="message">(no response available)</span>
<table>
<tr><th>id</th><th class="score">score</th><th>SMILES</th><th>Exact match?</th></tr>
</table>
</div>
<form action="search" method="GET">
<input type="hidden" name="q">
<input type="hidden" name="k">
<input type="hidden" name="threshold">
Choose format:
<select name="oformat">
  <option value="ids">ids</option>
  <option value="smiles">SMILES</option>
  <option value="sdf" SELECTED>SDF</option>
  <option value="json">JSON</option>
</select>
<input id="submit-button" type="submit" disabled="disabled" value="Download" />
</form>

<script type="text/javascript">
var make_request = function() {
  var k = $("#k").val();
  var threshold = $("#threshold").val();  
  var smiles = $("#smiles").val();
  if (smiles === "") {
    $("#request").text("(empty query)");
    return;
  }
  var start_time = new Date();
  $("#request").text("threshold=" + threshold + " k=" + k + " q=" + smiles);
  $.ajax({
      url: "search",
      data: [
        {name: "q", value: smiles},
        {name: "k", value: k},
        {name: "threshold", value: threshold}
        ]
  }).done(function(data) {
    if (data["status"] !== "ok") {
      $("#error").text(data["message"]);
      $("#message").text("");
      return;
    }
    // Summarize the timings
    $("#error").text("");
    var hits = data["hits"];
    var round_trip_time = (new Date() - start_time) / 1000.0;
    $("#message").text("Found " + hits.length + " hits. Times for " +
                       " Search: " + data["search_time"].toFixed(3) +
                       " Lookup: " + data["lookup_time"].toFixed(3) +
                       " Handler: " + data["total_time"].toFixed(3) +
                       " Round-trip: " + round_trip_time.toFixed(3) + " seconds.");

    // Display the results as rows in the table
    $(".result-row").remove();
    var tbody = $("tbody");

    $.each(data["hits"], function(rowno, row) {
        var tr = $(document.createElement("tr")).addClass("result-row");
        tbody.append(tr);
        $.each(row, function(colno, value) {
            var ele = $(document.createElement("td"));
            if (colno == 0) {
              // The identifier, wrapped in a hyperlink to the orginal SD record
              var a = $(document.createElement("a"));
              a.attr({"href": value + ".sdf"});
              a.text(value);
              ele.append(a);
            } else if (colno == 1) {
              // The score, to three decimals
              ele.addClass("score");
              ele.text(value.toFixed(3));
            } else {
              ele.text(value);
            }
            tr.append(ele);
        });
    });
    // Configure the download form
    $("form input[name=q]").val(smiles);
    $("form input[name=k]").val(k);
    $("form input[name=threshold]").val(threshold);
    $("#submit-button").prop("disabled", data["hits"].length == 0);
    if (data["hits"].length == 1) {
      $("#submit-button").val("Download 1 record");
    } else {
      $("#submit-button").val("Download " + data["hits"].length + " records");
    }
    
  }).fail(function() {
    $("#error").text("Server failure");
    $("#message").text("");
  });

};

$().ready(function() {
  $("#threshold").change(make_request);
  $("#k").change(make_request);
  $("#smiles").on("input", make_request);
  make_request();  // In case fields remain after a browser refresh.
});
</script>

</body>
""".replace("NUM_FINGERPRINTS", str(len(arena))).replace("FINGERPRINT_TYPE", fptype.get_type())


# Return the client-side application
@app.route("/")
def homepage():
    return Response(js_code, mimetype="text/html")
       

# Parse the query as either a SMILES string or a database entry
# Return the (error string, None), or (None, parsed molecule)
def _get_molecule(cursor, query):
    if not query:
        return None, None

    # Is this a valid SMILES?
    try:
        mol = toolkit.parse_molecule(query, "smistring")
    except ValueError, err:
        # No. Then is this an identifier?
        match = cursor.execute("SELECT smiles FROM Compounds WHERE id = ?", (query,)).fetchone()
        if not match:
            return "Not a valid SMILES or identifier", None
        smiles = match[0]
        try:
            mol = toolkit.parse_molecule(smiles, "smistring")
        except ValueError:
            return "Invalid SMILES in database!", None

    return None, mol

# Return an error document in JSON format
def _error_response(errmsg):
    return jsonify({
       "status": "error",
       "message": errmsg,
       "columns": ["id", "score", "smiles", "exact match"],
       "hits": [],
       })

# Get the selected columns for the given hits
def get_columns(cursor, columns, hits):
    sql_query = "SELECT %s FROM Compounds WHERE id IN (" % (", ".join(columns),)
    sql_query += ",".join("?"*len(hits)) + ")"
    cursor.execute(sql_query, tuple(hits.get_ids()))
    return cursor


# Similarity search
@app.route("/search")
def similarity_search():
    start_time = time.time()
    q = request.args.get("q", "").strip()
    k = request.args.get("k", "3")
    oformat = request.args.get("oformat", "json")
    threshold = request.args.get("threshold", "0.0")

    cursor = db.cursor()

    # Flask queries are Unicode strings, but SMILES and record identifiers
    # can only be ASCII, so convert and report any failures
    try:
        q = str(q)
    except UnicodeEncodeError:
        return _error_response("non-ASCII query")
    errmsg, mol = _get_molecule(cursor, q)
    if errmsg:
        return _error_response(errmsg)
    try:
        k = int(k)
    except ValueError:
        return _error_response("bad value for k")
    try:
        threshold = float(threshold)
    except ValueError:
        return _error_response("bad value for threshold")
    if oformat not in ("json", "ids", "smiles", "sdf"):
        oformat = "json"

    if mol is None:
        return jsonify({
          "status": "ok",
          "message": "",
          "search_time": 0.0,
          "lookup_time": 0.0,
          "total_time": time.time() - start_time,
          "hits": [],
        })
    
    # Compute the fingerprint and do the similarity search
    fp = fptype.compute_fingerprint(mol)
    t1 = time.time()
    hits = arena.knearest_tanimoto_search_fp(fp, k=k, threshold=threshold)
    fp_search_time = time.time() - t1

    if oformat == "ids":
        return Response("\n".join(hits.get_ids()) + "\n", mimetype="text/plain")
    
    if oformat == "smiles":
        id_to_cansmiles = dict(get_columns(cursor, ["id", "smiles"], hits))
        lines = []
        for id, score in hits.get_ids_and_scores():
            smiles = id_to_cansmiles.get(id, "MISSING")
            lines.append("%s\t%s\t%.3f\n" % (smiles, id, score))
        return Response("".join(lines), mimetype="text/plain")

    if oformat == "sdf":
        id_to_sdfz = dict(get_columns(cursor, ["id", "record"], hits))
        records = []
        for id, score in hits.get_ids_and_scores():
            recordz = id_to_sdfz.get(id, None)
            if recordz is None:
                continue
            record = zlib.decompress(recordz)
            text_mol = text_toolkit.parse_molecule(record, "sdf")
            text_mol.add_tag("SCORE", "%.3f" % (score,))
            writers.append(text_toolkit.create_string(text_mol, "sdf"))

        return Response("".join(writer), mimetype="text/plain")

    # Otherwise it's a json output format
    query_cansmiles = toolkit.create_string(mol, "canstring")
    t1 = time.time()
    id_to_cansmiles = dict(get_columns(cursor, ["id", "smiles"], hits))
    smiles_lookup_time = time.time() - t1

    # get the (id, score, cansmiles, and SMILES match flag)
    results = []
    for id, score in hits.get_ids_and_scores():
        cansmiles = id_to_cansmiles.get(id, None)
        results.append( (id, score, cansmiles, query_cansmiles == cansmiles) )

    return jsonify({
        "status": "ok",
        "message": "",
        "search_time": fp_search_time,
        "lookup_time": smiles_lookup_time,
        "total_time": time.time() - start_time,
        "hits": results,
        })
    

# get a record given its id, or return "404 Not Found"
@app.route("/<string:id>.sdf")
def get_record(id):
    cursor = db.cursor()
    cursor.execute("SELECT record FROM Compounds WHERE id = ?", (id,))
    row = cursor.fetchone()
    if not row:
        page_not_found()

    record = zlib.decompress(row[0])
    return Response(record, mimetype="text/plain")

if __name__ == "__main__":
    #app.run(debug=False, host="0.0.0.0")
    app.run(debug=True)

Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me