Dalke Scientific Software: More science. Less time. Products

Diary RSS | All of Andrew's writings | Diary archive

Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure. Code to taste. Best served at room temperature.

Similarity web service #

Now that I've indexed ChEMBL for record lookup and similarity search, I can write a simple interactive web server. By "interactive" I mean that it will show you results, if available, as you enter the SMILES or record id. This is possible because chemfp takes under 0.1 seconds to search the 1.4 million fingerprints of ChEMBL and report the results. By "simple" I mean there will be no sketcher and a plain looking user interface. I'll also stick to using jQuery for the browser-side code; I usually use Knockout for this sort of app.

I also decided to try out the Flask web development framework instead of using Django. This has the advantage that my web service code fits into a single file.

Here's what the result looks like, so you can see where I'm going:

Multiple threads

Flask is a multi-threaded server, which means I need to configure a few things to work correctly.

The record data is in a SQLite database. SQLite is threadsafe but "Older SQLite versions had issues with sharing connections between threads. That's why the Python module disallows sharing connections and cursors between threads."

I don't write to the database and I'll be careful and create a new cursor for each request, so I'm safe. However, I do need to disable that check, which I do through the check_same_thread parameter:

import sqlite3

database_filename = "chembl_19.sqlite3"
db = sqlite3.connect(database_filename, check_same_thread=False)
Chemfp can be compiled for OpenMP, which is the default. Unfortunately, on the Mac, multi-threaded code and OpenMP don't mix. I need to tell chemfp to always use the non-OpenMP code path, as otherwise the program will crash on the first similarity search:
import chemfp

chemfp.set_num_threads(1)

Load the data sets

I'll open the SQLite database and make sure it has compounds, then open the fingerprint file and make sure it has fingerprints. Hopefully the two are the same! I made sure when I created the fingerprint file to include the fingerprint type in the metadata. Now I can use the type string to get the fingerprint type object, and from that I can get the correct toolkit to use.

database_filename = "chembl_19.sqlite3"
fingerprint_filename = "chembl_19.fpb"

db = sqlite3.connect(database_filename, check_same_thread=False)
cursor = db.cursor()
try:
    num_compounds, = cursor.execute("SELECT COUNT(*) FROM Compounds").fetchone()
except sqlite3.ChemFPError, err:
    raise SystemExit("Cannot connect to SQLite database %r: %s" % (database_filename, err))
del cursor  # remove the global cursor so there's no chance of accidental reuse

if num_compounds == 0:
    sys.stderr.write("WARNING: Database %r contains no compounds." % (database_filename,))

# Load the fingerprints
arena = chemfp.load_fingerprints(fingerprint_filename)

# Use the fingerprint type string from the metadata to get the fingerprint type object
try:
    fptype = arena.get_fingerprint_type()
except chemfp.ChemFPError, err:
    raise SystemExit("Unable to load the fingerprint type for fingerprint file %r: %s"
                     % (fingerprint_filename, err))

# Get the toolkit associated with this fingerprint type
toolkit = fptype.toolkit
if toolkit is None:
    raise SystemExit("No toolkit available to generate fingerprints of type %r"
                     % (fptype.base_name,))

# Report information about the number of compounds and fingerprints
if num_compounds == len(arena):
    sys.stderr.write("Loaded %d compounds and fingerprints of type %r\n"
                     % (num_compounds, fptype.get_type()))
else:
    sys.stderr.write("Loaded %d compounds and %d fingerprints of type %r\n"
                     % (num_compounds, len(arena), fptype.get_type()))
When my server start (or in debug mode, when it restarts) it prints:
Loaded 1404532 compounds and fingerprints of type 'RDKit-Fingerprint/2 minPath=1 maxPath=7 fpSize=2048 nBitsPerHash=2 useHs=1'
The FPB format for chemfp-1.2 was developed in part so that web server reloads, which is important during web development, would be quick. I basically don't notice the time it takes for the above code to run.

Defining the web service API

1) I'll have a single similarity search API, named '/search', which take the following parameters:

2) I'll have the home page ('/') return an interactive search brower.

3) I'll have a way to get the record for CHEMBL286759 by going to a URL like "/CHEMBL286759.sdf". In Flask terminolgy, I need to match anything of the form "/<string:id>.sdf", though if the given id doesn't exist I'll need to return a 404 NOT FOUND error.

All of them require hooking on to the Flask 'app', which is defined like this:

from flask import Flask, Response, jsonify, request

app = Flask(__name__)

Get a record given its id

I'll start with the easiest, where "/CHEMBL286759.sdf" returns the original SDF record for CHEMBL286759.

# get a record given its id, or return "404 Not Found"
@app.route("/<string:id>.sdf")
def get_record(id):
    cursor = db.cursor()
    cursor.execute("SELECT record FROM Compounds WHERE id = ?", (id,))
    row = cursor.fetchone()
    if not row:
        page_not_found()

    record = zlib.decompress(row[0])
    return Response(record, mimetype="text/plain")

I don't think it needs much explanation, but it might help to look at the interactive example from my previous essay.

This calls a error handler function to report the 404 NOT FOUND error, which is:

@app.errorhandler(404)
def page_not_found(error=None):
    return "Not found", 404

Similarity search - parse query arguments

The similarity search code is pretty involved, but mostly because it need to handle different input and output cases. The input contains a lot of code like this:

# Similarity search
@app.route("/search")
def similarity_search():
    ...
    k = request.args.get("k", "3")
    ...
    try:
        k = int(k)
    except ValueError:
        return _error_response("bad value for k")
which get the query parameter "k" if it exists, or uses the default of "3" if it doesn't. It then converts it into an integer, and if that fails it returns an error response. That function constructs a JSON document:
def _error_response(errmsg):
    return jsonify({
       "status": "error",
       "message": errmsg,
       "columns": ["id", "score", "smiles", "exact match"],
       "hits": [],
       })

Similarity search - handle the SMILES or record id

The "q" parameter is more complicated, because it can be a SMILES string or a query id. I'll resolve this by trying to parse it as a SMILES string. If that fails, I'll see if there's a record in the database with that name, which also means I need a new cursor for the database connection:

def similarity_search():
    ...
    q = request.args.get("q", "").strip()
    ...
    cursor = db.cursor()
    
    # Flask queries are Unicode strings, but SMILES and record identifiers
    # can only be ASCII, so convert and report any failures
    try:
        q = str(q)
    except UnicodeEncodeError:
        return _error_response("non-ASCII query")
    errmsg, mol = _get_molecule(cursor, q)
    if errmsg:
        return _error_response(errmsg)
    ...
Here's where I implement the business logic. Given the toolkit, parse the query as a SMILES. If that fails (chemfp.ParseError is also a ValueError) then get the SMILES for the given id. If that fails, or is the SMILES cannot parsed, then report an error. Otherwise return the newly parsed molecule object. (It's actually a two-value result, where the first is the error message and the second is the molecule. At least one of them will be None.)
def _get_molecule(cursor, query):
    if not query:
        return None, None

    # Is this a valid SMILES?
    try:
        mol = toolkit.parse_molecule(query, "smistring")
    except ValueError, err:
        # No. Then is this an identifier?
        match = cursor.execute("SELECT smiles FROM Compounds WHERE id = ?", (query,)).fetchone()
        if not match:
            return "Not a valid SMILES or identifier", None
        smiles = match[0]
        try:
            mol = toolkit.parse_molecule(smiles, "smistring")
        except ValueError:
            return "Invalid SMILES in database!", None

    return None, mol

Similarity search - compute the fingerprint

It's only a couple of lines to compute the fingerprint for the molecule and search the FPB file for the k-nearest neighbors within the given threshold.

    fp = fptype.compute_fingerprint(mol)
    hits = arena.knearest_tanimoto_search_fp(fp, k=k, threshold=threshold)

Similarity search - handle the oformat parameter and return ids

The 'oformat', short for 'output format', specifies the desired output format if there are no errors. The "ids" output format returns the hits as a list of target identifiers, with one id per line.

def similarity_search():
    ...
    oformat = request.args.get("oformat", "json")
    ...
    if oformat not in ("json", "ids", "smiles", "sdf"):
        oformat = "json"
    ...
    hits = arena.knearest_tanimoto_search_fp(fp, k=k, threshold=threshold)
    ...
    if oformat == "ids":
        return Response("\n".join(hits.get_ids()) + "\n", mimetype="text/plain")
    ...

Similarity search - oformat == "smiles"

Returning SMILES is a bit more complicated because I need to get the SMILES string for each hit id, and that requires a database lookup. I decided to make a function call "get_columns()", which takes the list of requested columns and the hits of interest, and searches for the intersection of them.

As far as I can tell, there's no easy way to ask for all of the rows for a given list of ids. There is the "WHERE id IN (?, ?, ... ?)" construction, but I need a '?' for each query id. That's not hard to do, because this is Python, but I feel like there should be a better way to make the query.

def get_columns(cursor, columns, hits):
    sql_query = "SELECT %s FROM Compounds WHERE id IN (" % (", ".join(columns),)
    sql_query += ",".join("?"*len(hits)) + ")"
    cursor.execute(sql_query, tuple(hits.get_ids()))
    return cursor
I use the get_columns() function to get a list of 2-element rows, which I convert into a dictionary. There's the off chance that the SQLite database and the fingerprint database can get out of synchronization. A loop through the hits, which has the ids and scores, also gives me the SMILES. If a given id is missing, I'll use "MISSING" as the SMILES string for it. And I'll format the score to three decimal places.
    if oformat == "smiles":
        id_to_cansmiles = dict(get_columns(cursor, ["id", "smiles"], hits))
        lines = []
        for id, score in hits.get_ids_and_scores():
            smiles = id_to_cansmiles.get(id, "MISSING")
            lines.append("%s\t%s\t%.3f\n" % (smiles, id, score))
        return Response("".join(lines), mimetype="text/plain")

The text_toolkit and "text molecules"

The SDF output is a bit more complicated because I need to decompress the compressed SD record, and because I want to modify the original record to have a new tag containing the "SCORE". For that I'll use the text_toolkit, but this time for its "text molecule" toolkit API. It implements the same toolkit API as rdkit_toolkit, openeye_toolkit and rdkit_toolkit, but instead of being real molecule objects that knows chemistry, it's really just a light-weight container for the record, with enough knowledge of SD records to do some light changes to the tags.

Here's what I mean. I'll start with a record (I'll leave out the left side of Python's interactive shell to make it easier to copy and paste).

record = """\
CHEMBL17564
  SciTegic10311311082D

  1  0  0  0  0  0            999 V2000
    0.0000    0.0000    0.0000 C   0  0
M  END
> 
CHEMBL17564

$$$$
I'll then "parse" the record into a text molecule, using the text_toolkit, and some of the SD tag API:
>>> from chemfp import text_toolkit
>>> text_mol = text_toolkit.parse_molecule(record, "sdf")
>>> text_mol.get_tag("chembl_id")
'CHEMBL17564'
>>> text_mol.get_tag_pairs()
[('chembl_id', 'CHEMBL17564')]
>>> text_mol.add_tag("SCORE", "0.567")
>>> text_mol.get_tag_pairs()
[('chembl_id', 'CHEMBL17564'), ('SCORE', '0.567')]
Now I'll convert the modified record back into a string, with the new tag(s) at the end:
>>> print text_toolkit.create_string(text_mol, "sdf")
CHEMBL17564
  SciTegic10311311082D

  1  0  0  0  0  0            999 V2000
    0.0000    0.0000    0.0000 C   0  0
M  END
> <chembl_id>
CHEMBL17564

> <SCORE>
0.567

$$$$
As a side note, I think this API is too complicated. A future version of chemfp will likely have an "add_sdf_tag" which operates directly on strings rather than through text molecule objects.

Similarity search - oformat == "sdf"

Once you know how to use the text_toolkit to add an SD tag, the SDF output format handler is pretty straight-forward:

    if oformat == "sdf":
        id_to_sdfz = dict(get_columns(cursor, ["id", "record"], hits))
        records = []
        for id, score in hits.get_ids_and_scores():
            recordz = id_to_sdfz.get(id, None)
            if recordz is None:
                continue
            record = zlib.decompress(recordz)
            text_mol = text_toolkit.parse_molecule(record, "sdf")
            text_mol.add_tag("SCORE", "%.3f" % (score,))
            writers.append(text_toolkit.create_string(text_mol, "sdf"))

        return Response("".join(writer), mimetype="text/plain")

Similarity search - oformat == "json"

I decided that I want the JSON output to have the id, score, and SMILES string, and also have a flag (either 0 or 1) if it matches the query SMILES:

    query_cansmiles = toolkit.create_string(mol, "canstring")
I then need to get the ids and smiles for the hits
    id_to_cansmiles = dict(get_columns(cursor, ["id", "smiles"], hits))
get the results, as a 4-element tuple of (id, score, SMILES, and match flag)
    results = []
    for id, score in hits.get_ids_and_scores():
        cansmiles = id_to_cansmiles.get(id, None)
        results.append( (id, score, cansmiles, query_cansmiles == cansmiles) )
and return the results, along with some timing information I haven't yet described:
    return jsonify({
        "status": "ok",
        "message": "",
        "search_time": fp_search_time,
        "lookup_time": smiles_lookup_time,
        "total_time": time.time() - start_time,
        "hits": results,
        })

The GUI

The code to display the search page is simple, because it's just returning a HTML as a string.

@app.route("/")
def homepage():
    return Response(js_code, mimetype="text/html")
But how does the Javascript work? You should be aware that I am not as good of a Javascript programmer as I am a Python programmer. What you see here is advice, and not a recommendation as for best practices.

The main part of it says that if any of the values for the "threshold", "k", or "smiles" form elements should change, then call make_request(), and when the form starts also call make_request().

  $("#threshold").change(make_request);
  $("#k").change(make_request);
  $("#smiles").on("input", make_request);
  make_request();  // In case fields remain after a browser refresh.
The make_request() function starts by getting request parameters from the different form element, though if the query is empty it won't do a new search.
var make_request = function() {
  var k = $("#k").val();
  var threshold = $("#threshold").val();  
  var smiles = $("#smiles").val();
  if (smiles === "") {
    $("#request").text("(empty query)");
    return;
  }
Otherwise it reports the query parameters and starts an AJAX request with the correct parameters. This request can either succeed or fail. I'll leave out the success part for now. On failure you can see that it adds a message to the "error" text and clears out the "message" text.
  $("#request").text("threshold=" + threshold + " k=" + k + " q=" + smiles);
  $.ajax({
      url: "search",
      data: [
        {name: "q", value: smiles},
        {name: "k", value: k},
        {name: "threshold", value: threshold}
        ]
  }).done(function(data) {

       ... code to call on success .. see below ....

  }).fail(function() {
    $("#error").text("Server failure");
    $("#message").text("");
  });

"Success" is relative. It might successfully report that the SMILES string was invalid, so unless the status is "ok" I'll report the resulting message as an error:

    if (data["status"] !== "ok") {
      $("#error").text(data["message"]);
      $("#message").text("");
      return;
    }
Otherwise clear the error message and report some timing numbers.
    $("#error").text("");
    var hits = data["hits"];
    var round_trip_time = (new Date() - start_time) / 1000.0;
    $("#message").text("Found " + hits.length + " hits. Times for " +
                       " Search: " + data["search_time"].toFixed(3) +
                       " Lookup: " + data["lookup_time"].toFixed(3) +
                       " Handler: " + data["total_time"].toFixed(3) +
                       " Round-trip: " + round_trip_time.toFixed(3) + " seconds.");
The timings here are all in seconds, and reported to 3 digits. The meanings are:

I then clear the table rows that contain the results, and convert the results into new rows. It's a bit more complicated than I would like; this is where client-side templates can be nice.

    $(".result-row").remove();
    var tbody = $("tbody");

    $.each(data["hits"], function(rowno, row) {
        var tr = $(document.createElement("tr")).addClass("result-row");
        tbody.append(tr);
        $.each(row, function(colno, value) {
            var ele = $(document.createElement("td"));
            if (colno == 0) {
              // The identifier, wrapped in a hyperlink to the orginal SD record
              var a = $(document.createElement("a"));
              a.attr({"href": value + ".sdf"});
              a.text(value);
              ele.append(a);
            } else if (colno == 1) {
              // The score, to three decimals
              ele.addClass("score");
              ele.text(value.toFixed(3));
            } else {
              ele.text(value);
            }
            tr.append(ele);
        });
    });

The last part is to change the "Download" button. Because searches are so fast, all I do is tell it the new query parameters, and let the user choose the download format. The form download is actually a regular query against the "/search" handler. This is easier than storing all of the ids and scores from the current result and making a new API to somehow resynthesize the correct results.

Here's the form I'm updating:

<form action="search" method="GET">
<input type="hidden" name="q">
<input type="hidden" name="k">
<input type="hidden" name="threshold">
Choose format:
<select name="oformat">
< <option value="ids">ids</option>
< <option value="smiles">SMILES</option>
< <option value="sdf" SELECTED>SDF</option>
< <option value="json">JSON</option>
</select>
<input id="submit-button" type="submit" disabled="disabled" value="Download" />
</form>
and the update code is:
    $("form input[name=q]").val(smiles);
    $("form input[name=k]").val(k);
    $("form input[name=threshold]").val(threshold);
    $("#submit-button").prop("disabled", data["hits"].length == 0);
    if (data["hits"].length == 1) {
      $("#submit-button").val("Download 1 record");
    } else {
      $("#submit-button").val("Download " + data["hits"].length + " records");
    }
I also did a bit extra to report the number of available records to download, and don't allow a download if there are no records.

The final code

I left out a few pieces here and there, like most of the HTML and some of the timing details. Here's the final code:

import sys
import sqlite3
import zlib
import time

from flask import Flask, Response, jsonify, request

import chemfp
from chemfp import text_toolkit


database_filename = "chembl_19.sqlite3"
fingerprint_filename = "chembl_19.fpb"

# Threads and OpenMP don't work well under Mac OS X.
# Tell chemfp to not use OpenMP.
chemfp.set_num_threads(1)


db = sqlite3.connect(database_filename, check_same_thread=False)
cursor = db.cursor()
try:
    num_compounds, = cursor.execute("SELECT COUNT(*) FROM Compounds").fetchone()
except sqlite3.ChemFPError, err:
    raise SystemExit("Cannot connect to SQLite database %r: %s" % (database_filename, err))
del cursor  # remove the global cursor so there's no chance of accidental reuse

if num_compounds == 0:
    sys.stderr.write("WARNING: Database %r contains no compounds." % (database_filename,))

# Load the fingerprints
arena = chemfp.load_fingerprints(fingerprint_filename)

# Use the fingerprint type string from the metadata to get the fingerprint type object
try:
    fptype = arena.get_fingerprint_type()
except chemfp.ChemFPError, err:
    raise SystemExit("Unable to load the fingerprint type for fingerprint file %r: %s"
                     % (fingerprint_filename, err))

# Get the toolkit associated with this fingerprint type
toolkit = fptype.toolkit
if toolkit is None:
    raise SystemExit("No toolkit available to generate fingerprints of type %r"
                     % (fptype.base_name,))

# Report information about the number of compounds and fingerprints
if num_compounds == len(arena):
    sys.stderr.write("Loaded %d compounds and fingerprints of type %r\n"
                     % (num_compounds, fptype.get_type()))
else:
    sys.stderr.write("Loaded %d compounds and %d fingerprints of type %r\n"
                     % (num_compounds, len(arena), fptype.get_type()))

app = Flask(__name__)

@app.errorhandler(404)
def page_not_found(error=None):
    return "Not found", 404

js_code = """<html>
<head>
 <title>Search demo</title>
 <script type="text/javascript" src="https://code.jquery.com/jquery-2.1.1.min.js"></script>
 <style>
#error {
  color: red;
}
table {
  border-collapse: collapse;
}
table tr:first-child + tr td {
 border-top: 1px solid black;
}
table tr:last-child td {
 border-bottom: 1px solid black;
}
.score {
  width: 5em;
}
 </style>
</head>
<body>
<h2>Search demo</h2>
Target contains NUM_FINGERPRINTS fingerprints of type FINGERPRINT_TYPE<br />
<br />
<div>
 k: <select id="k">
   <option value="1">1</option>
   <option value="3">3</option>
   <option value="5" SELECTED>5</option>
   <option value="10">10</option>
   <option value="100">100</option>
   <option value="1000">1000</option>
  </select>
 threshold: <select id="threshold">
   <option value="1.0">100%</option>
   <option value="0.99">99%</option>
   <option value="0.98">98%</option>
   <option value="0.95">95%</option>
   <option value="0.9">90%</option>
   <option value="0.8 SELECTED">80%</option>
   <option value="0.7">70%</option>
   <option value="0.5">50%</option>
   <option value="0.2">20%</option>
   <option value="0.0">0%</option>
  </select>
 SMILES or id: <input id="smiles" size="50"></input>
</div>
<div>
<br />
Request: <span id="request"></span><br />
Response: <span id="error"></span><span id="message">(no response available)</span>
<table>
<tr><th>id</th><th class="score">score</th><th>SMILES</th><th>Exact match?</th></tr>
</table>
</div>
<form action="search" method="GET">
<input type="hidden" name="q">
<input type="hidden" name="k">
<input type="hidden" name="threshold">
Choose format:
<select name="oformat">
  <option value="ids">ids</option>
  <option value="smiles">SMILES</option>
  <option value="sdf" SELECTED>SDF</option>
  <option value="json">JSON</option>
</select>
<input id="submit-button" type="submit" disabled="disabled" value="Download" />
</form>

<script type="text/javascript">
var make_request = function() {
  var k = $("#k").val();
  var threshold = $("#threshold").val();  
  var smiles = $("#smiles").val();
  if (smiles === "") {
    $("#request").text("(empty query)");
    return;
  }
  var start_time = new Date();
  $("#request").text("threshold=" + threshold + " k=" + k + " q=" + smiles);
  $.ajax({
      url: "search",
      data: [
        {name: "q", value: smiles},
        {name: "k", value: k},
        {name: "threshold", value: threshold}
        ]
  }).done(function(data) {
    if (data["status"] !== "ok") {
      $("#error").text(data["message"]);
      $("#message").text("");
      return;
    }
    // Summarize the timings
    $("#error").text("");
    var hits = data["hits"];
    var round_trip_time = (new Date() - start_time) / 1000.0;
    $("#message").text("Found " + hits.length + " hits. Times for " +
                       " Search: " + data["search_time"].toFixed(3) +
                       " Lookup: " + data["lookup_time"].toFixed(3) +
                       " Handler: " + data["total_time"].toFixed(3) +
                       " Round-trip: " + round_trip_time.toFixed(3) + " seconds.");

    // Display the results as rows in the table
    $(".result-row").remove();
    var tbody = $("tbody");

    $.each(data["hits"], function(rowno, row) {
        var tr = $(document.createElement("tr")).addClass("result-row");
        tbody.append(tr);
        $.each(row, function(colno, value) {
            var ele = $(document.createElement("td"));
            if (colno == 0) {
              // The identifier, wrapped in a hyperlink to the orginal SD record
              var a = $(document.createElement("a"));
              a.attr({"href": value + ".sdf"});
              a.text(value);
              ele.append(a);
            } else if (colno == 1) {
              // The score, to three decimals
              ele.addClass("score");
              ele.text(value.toFixed(3));
            } else {
              ele.text(value);
            }
            tr.append(ele);
        });
    });
    // Configure the download form
    $("form input[name=q]").val(smiles);
    $("form input[name=k]").val(k);
    $("form input[name=threshold]").val(threshold);
    $("#submit-button").prop("disabled", data["hits"].length == 0);
    if (data["hits"].length == 1) {
      $("#submit-button").val("Download 1 record");
    } else {
      $("#submit-button").val("Download " + data["hits"].length + " records");
    }
    
  }).fail(function() {
    $("#error").text("Server failure");
    $("#message").text("");
  });

};

$().ready(function() {
  $("#threshold").change(make_request);
  $("#k").change(make_request);
  $("#smiles").on("input", make_request);
  make_request();  // In case fields remain after a browser refresh.
});
</script>

</body>
""".replace("NUM_FINGERPRINTS", str(len(arena))).replace("FINGERPRINT_TYPE", fptype.get_type())


# Return the client-side application
@app.route("/")
def homepage():
    return Response(js_code, mimetype="text/html")
       

# Parse the query as either a SMILES string or a database entry
# Return the (error string, None), or (None, parsed molecule)
def _get_molecule(cursor, query):
    if not query:
        return None, None

    # Is this a valid SMILES?
    try:
        mol = toolkit.parse_molecule(query, "smistring")
    except ValueError, err:
        # No. Then is this an identifier?
        match = cursor.execute("SELECT smiles FROM Compounds WHERE id = ?", (query,)).fetchone()
        if not match:
            return "Not a valid SMILES or identifier", None
        smiles = match[0]
        try:
            mol = toolkit.parse_molecule(smiles, "smistring")
        except ValueError:
            return "Invalid SMILES in database!", None

    return None, mol

# Return an error document in JSON format
def _error_response(errmsg):
    return jsonify({
       "status": "error",
       "message": errmsg,
       "columns": ["id", "score", "smiles", "exact match"],
       "hits": [],
       })

# Get the selected columns for the given hits
def get_columns(cursor, columns, hits):
    sql_query = "SELECT %s FROM Compounds WHERE id IN (" % (", ".join(columns),)
    sql_query += ",".join("?"*len(hits)) + ")"
    cursor.execute(sql_query, tuple(hits.get_ids()))
    return cursor


# Similarity search
@app.route("/search")
def similarity_search():
    start_time = time.time()
    q = request.args.get("q", "").strip()
    k = request.args.get("k", "3")
    oformat = request.args.get("oformat", "json")
    threshold = request.args.get("threshold", "0.0")

    cursor = db.cursor()

    # Flask queries are Unicode strings, but SMILES and record identifiers
    # can only be ASCII, so convert and report any failures
    try:
        q = str(q)
    except UnicodeEncodeError:
        return _error_response("non-ASCII query")
    errmsg, mol = _get_molecule(cursor, q)
    if errmsg:
        return _error_response(errmsg)
    try:
        k = int(k)
    except ValueError:
        return _error_response("bad value for k")
    try:
        threshold = float(threshold)
    except ValueError:
        return _error_response("bad value for threshold")
    if oformat not in ("json", "ids", "smiles", "sdf"):
        oformat = "json"

    if mol is None:
        return jsonify({
          "status": "ok",
          "message": "",
          "search_time": 0.0,
          "lookup_time": 0.0,
          "total_time": time.time() - start_time,
          "hits": [],
        })
    
    # Compute the fingerprint and do the similarity search
    fp = fptype.compute_fingerprint(mol)
    t1 = time.time()
    hits = arena.knearest_tanimoto_search_fp(fp, k=k, threshold=threshold)
    fp_search_time = time.time() - t1

    if oformat == "ids":
        return Response("\n".join(hits.get_ids()) + "\n", mimetype="text/plain")
    
    if oformat == "smiles":
        id_to_cansmiles = dict(get_columns(cursor, ["id", "smiles"], hits))
        lines = []
        for id, score in hits.get_ids_and_scores():
            smiles = id_to_cansmiles.get(id, "MISSING")
            lines.append("%s\t%s\t%.3f\n" % (smiles, id, score))
        return Response("".join(lines), mimetype="text/plain")

    if oformat == "sdf":
        id_to_sdfz = dict(get_columns(cursor, ["id", "record"], hits))
        records = []
        for id, score in hits.get_ids_and_scores():
            recordz = id_to_sdfz.get(id, None)
            if recordz is None:
                continue
            record = zlib.decompress(recordz)
            text_mol = text_toolkit.parse_molecule(record, "sdf")
            text_mol.add_tag("SCORE", "%.3f" % (score,))
            writers.append(text_toolkit.create_string(text_mol, "sdf"))

        return Response("".join(writer), mimetype="text/plain")

    # Otherwise it's a json output format
    query_cansmiles = toolkit.create_string(mol, "canstring")
    t1 = time.time()
    id_to_cansmiles = dict(get_columns(cursor, ["id", "smiles"], hits))
    smiles_lookup_time = time.time() - t1

    # get the (id, score, cansmiles, and SMILES match flag)
    results = []
    for id, score in hits.get_ids_and_scores():
        cansmiles = id_to_cansmiles.get(id, None)
        results.append( (id, score, cansmiles, query_cansmiles == cansmiles) )

    return jsonify({
        "status": "ok",
        "message": "",
        "search_time": fp_search_time,
        "lookup_time": smiles_lookup_time,
        "total_time": time.time() - start_time,
        "hits": results,
        })
    

# get a record given its id, or return "404 Not Found"
@app.route("/<string:id>.sdf")
def get_record(id):
    cursor = db.cursor()
    cursor.execute("SELECT record FROM Compounds WHERE id = ?", (id,))
    row = cursor.fetchone()
    if not row:
        page_not_found()

    record = zlib.decompress(row[0])
    return Response(record, mimetype="text/plain")

if __name__ == "__main__":
    #app.run(debug=False, host="0.0.0.0")
    app.run(debug=True)


Indexing ChEMBL for chemistry search #

I've just downloaded the SD file for ChEMBL-19. I'm happy that after many years they have finally placed the record identifier in the title line of each record! It used to contain either a blank line, or occasionally text that was essentially junk.

Now I want to be able to work with the data in the file. More specifically, I want a simple cheminformatics system where I can:

and I want to show how to use the new features in chemfp-1.2 to do this.

Parse records, find the SMILES, and add to the SQLite database

The easiest way is to set up two files. The first is a SQLite database containing a Compound table. The Compound will contain three columns: the record identifier, the original record (as a compressed string), and the canonical SMILES string. SQLite is a very widely used SQL database, and it's included as part of the standard Python distribution. Here's the SQL that I'll used to create the database schema.

CREATE TABLE Compounds (
   id TEXT PRIMARY KEY NOT NULL,
   record BLOB,
   smiles TEXT)
which when embedded in Python code looks like:
import sqlite3

database_filename = "chembl_19.sqlite3"
conn = sqlite3.connect(database_filename)
cursor = conn.cursor()

cursor.execute("""
    CREATE TABLE Compounds (
       id TEXT PRIMARY KEY NOT NULL,
       record BLOB,
       smiles TEXT)
       """)

I need some way to get the SD records for each one, so I can save it as a compressed blob in the database, and so I can create the canonical SMILES string for the record. I pointed out in my previous essay that chemfp includes a "read_sdf_ids_and_records" function in its text_toolkit submodule, which reads each record and also pulls out the identifier. The code for this, along with commented-out code to support older ChEMBL releases which only stored the identifier in the "chembl_id" tag, looks like:

from chemfp import text_toolkit

filename = "/Users/dalke/databases/chembl_19.sdf.gz"
id_tag = None # Use the title as the id
#id_tag = "chembl_id" # Uncomment for pre-ChEMBL-19 releases

reader = text_toolkit.read_sdf_ids_and_records(filename, id_tag=id_tag, errors="ignore")

for (id, record) in sdf_reader:
   # work with the identifier and SDF record
In practice I also want to track the current record number. I can use the enumerate() function for that, and I'll seed it so that the first record is 1.
recno = 0
for recno, (id, record) in enumerate(sdf_reader, 1):
   # work with the identifier and SDF record
You might wonder why I set recno to 0 before doing the iteration. This handles the case where the input file is empty. If that happens then recno will never be set, leaving it with whatever the old value was. If the value was never set, which is the case here, it would raise a NameError. Here's an example:
>>> for i, c in enumerate(""):
...   print i, c
... 
>>> print i
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'i' is not defined

I want to include the isomeric canonical SMILES string. Chemfp-1.2 has a common toolkit API to work with RDKit, OpenEye/OEChem, and Open Babel, so I'll simply use "toolkit" for respectively "rdkit_toolkit", "openeye_toolkit", or "openbabel_toolkit". The conversion is straight-forward; turn the SDF record into a toolkit molecule, then turn the toolkit molecule into the "smistring" (that's the chemfp format name for "isomeric canonical SMILES string").

toolkit_mol = toolkit.parse_molecule(record, "sdf")
smistring = toolkit.create_string(toolkit_mol, "smistring")
It's possible that the toolkit doesn't handle a given record, and unlikely but still perhaps possible that toolkit can't make a SMILES string for the molecule. If that happens, these functions will raise a chemfp.ParseError, which I can catch, ignore, and continue on to the next molecule.
recno = 0
for recno, (id, record) in enumerate(sdf_reader, 1):
    try:
        toolkit_mol = toolkit.parse_molecule(record, "sdf")
        smistring = toolkit.create_string(toolkit_mol, "smistring")
    except chemfp.ParseError:
        continue
    .... work with the id, SDF record, and SMILES string ...
All that's left for the SQLite database is to compress the SDF record using the zlib library, and add the new row to the database.
    # Compress the record (result is about 1/3rd of the original string)
    compressed_record = zlib.compress(record)

    # Save into the database and write to the fingerprint file
    conn.execute("""
        INSERT INTO Compounds (id, record, smiles)
                    VALUES (?, ?, ?)""",
        (id, sqlite3.Binary(compressed_record), smistring))
Compression is useful because the uncompressed file takes up 2.8GB while the original compressed file takes 470 MB for a 6x compression ratio. Compressing each record independently isn't as effective, because the compressor can't share any of its analysis between records, but it still manages about a 3x compression ratio.

However, the compressed string is a byte string with NUL characters and non-ASCII data. The BLOB and the sqlite3.Binary() tell SQLite and Python's sqlite3 API to treat the data as a byte string and not as a Unicode string.

The last bit of code creates an index on the SMILES column (so I can see if a given SMILES exists or find records with the given SMILES) and commits the new data to the database and closes it.

conn.execute("CREATE INDEX Compounds_by_smiles on Compounds (smiles)")

conn.commit()
conn.close()

Generate fingerprint and save to the FPB fingerprint file

The other file I need is an FPB fingerprint file containing the fingerprints for similarity search. I'll use RDKit's hash fingerprints, and since I'm using RDKit to generate the fingerprints I might as well use RDKit as the toolkit to convert the SDF record into a SMILES string.

fingerprint_type = "RDKit-Fingerprint fpSize=2048"
fptype = chemfp.get_fingerprint_type(fingerprint_type)
toolkit = fptype.toolkit

I'll ask chemfp to open up a fingerprint writer. I want it to include the canonical fingerprint type string in the metadata, and for good record-keeping I'll also record the original source filename for the structures. The fingerprint type's get_metadata() has a helper function to create the correct metatdata object, which I can pass to the writer.

filename = "/Users/dalke/databases/chembl_19.sdf.gz"
fingerprint_filename = "chembl_19.fpb"
metadata = fptype.get_metadata(sources=[filename])
fp_writer = chemfp.open_fingerprint_writer(fingerprint_filename, metadata=metadata)

Then for each toolkit molecule, generate the fingerprint and save the id and the fingerprint to the file.

recno = 0
for recno, (id, record) in enumerate(sdf_reader, 1):
    # This part is shared with the SQLite builder
    try:
        toolkit_mol = toolkit.parse_molecule(record, "sdf")
        smistring = toolkit.create_string(toolkit_mol, "smistring")
    except chemfp.ParseError:
        pass
     ...
    # Compute the fingerprint and save to the file
    fp = fptype.compute_fingerprint(toolkit_mol)

    fp_writer.write_fingerprint(id, fp)
   
Once finished, I'll close the file explicitly, rather than depend on the garbage collector to do it for me.
fp_writer.close()

Putting it all together

You've now seen the details, here's it is when I put it all together, along with some progress information and a summary report when it's complete.

import sys
import sqlite3
import zlib
import time

import chemfp
from chemfp import text_toolkit

# Configuration information

filename = "/Users/dalke/databases/chembl_19.sdf.gz"
id_tag = None # Use the title as the id
#id_tag = "chembl_id" # Uncomment for pre-ChEMBL-19 releases

fingerprint_type = "RDKit-Fingerprint fpSize=2048"

database_filename = "chembl_19.sqlite3"
fingerprint_filename = "chembl_19.fpb"

# Open the output database and create the schema

#import os  # Uncomment to delete the database before trying again
#os.unlink(database_filename)
conn = sqlite3.connect(database_filename)
cursor = conn.cursor()

cursor.execute("""
    CREATE TABLE Compounds (
       id TEXT PRIMARY KEY NOT NULL,
       record BLOB,
       smiles TEXT)
       """)

# Use the fingerprint type string to get the fingerprint type
# and the toolkit for that type

fptype = chemfp.get_fingerprint_type(fingerprint_type)
toolkit = fptype.toolkit

# Open the fingerprint writer, along with the correct metadata.
# (It handles the canonical fingerprint type string automatically.
# I give it a bit more information about the source.)

metadata = fptype.get_metadata(sources=[filename])
fp_writer = chemfp.open_fingerprint_writer(fingerprint_filename, metadata=metadata)

# Iterate through the file as a set of SDF records.

sdf_reader = text_toolkit.read_sdf_ids_and_records(filename, id_tag=id_tag, errors="ignore")
start_time = time.time()

recno = 0
for recno, (id, record) in enumerate(sdf_reader, 1):
    # Display some progress information
    if recno == 1 or recno % 1000 == 0:
        dt = time.time() - start_time
        sys.stderr.write("Processing record %d. Elapsed time: %.1f seconds\n" % (recno, dt))

    # Convert the record into a molecule and then get the SMILES and fingerprint.
    # Ignore any conversion errors.
    
    try:
        toolkit_mol = toolkit.parse_molecule(record, "sdf")
        smistring = toolkit.create_string(toolkit_mol, "smistring")
    except chemfp.ParseError:
        continue

    # Compute the fingerprint
    fp = fptype.compute_fingerprint(toolkit_mol)

    # Compress the record (result is about 1/3rd of the original string)
    compressed_record = zlib.compress(record)

    # Save into the database and write to the fingerprint file
    conn.execute("""
        INSERT INTO Compounds (id, record, smiles)
                    VALUES (?, ?, ?)""",
        (id, sqlite3.Binary(compressed_record), smistring))
    
    # Save the id and fingerprint to the output fingerprint file
    fp_writer.write_fingerprint(id, fp)

# How long did it take?
end_read_time = time.time()
dt = end_read_time - start_time
sys.stderr.write("Processed %d records in %.1f seconds. %.2f records/sec\n" %
                 (recno, dt, recno/dt))

# Create the index to be able to search on SMILES.
conn.execute("CREATE INDEX Compounds_by_smiles on Compounds (smiles)")
dt = time.time() - end_read_time
sys.stderr.write("Indexing took %.1f seconds.\n" % (dt,))

# Save everything
conn.commit()
conn.close()
fp_writer.close()

I ran the above code, and at the end it reports:

Processed 1404752 records in 7243.6 seconds. 193.93 records/sec
Indexing took 29.1 seconds.

When I profile the code, I can see that most of the time is spent creating the fingerprint. When I first ran this code I used an older version of RDKit, which took twice as long. I'm glad that RDKit has gotten faster over time.

Let's see if it works, first using the sqlite interactive shell:

% sqlite3 chembl_19.sqlite3 
SQLite version 3.6.12
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> select count(*) from Compounds;
1404532
(Oddly, it takes a long time to compute this count the first time I run it, but once done I can quit and restart sqlite and it's fast.)

How about some more interesting queries?

sqlite> select smiles from Compounds where id = "CHEMBL23456";
CNC(=O)c1[nH]c2ccc(Cl)cc2c1Sc1ccccc1
sqlite> select id, smiles from Compounds limit 5;
CHEMBL153534|Cc1cc(-c2csc(N=C(N)N)n2)cn1C
CHEMBL440060|CC[C@H](C)[C@H](NC(=O)[C@H](CC(C)C)NC(=O)[C@@H](NC(=O)[C@@H](N)CCSC
)[C@@H](C)O)C(=O)NCC(=O)N[C@@H](C)C(=O)N[C@@H](C)C(=O)N[C@@H](Cc1c[nH]cn1)C(=O)N
[C@@H](CC(N)=O)C(=O)NCC(=O)N[C@@H](C)C(=O)N[C@@H](C)C(=O)N[C@@H](CCC(N)=O)C(=O)N
[C@@H](CC(C)C)C(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](CCCN=C(N)N)C(=O)N[C@@H](CCC(N)=O)
C(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](CCCN=C(N)N)C(=O)NCC(=O)N[C@@H](CCC(N)=O)C(=O)N[
C@@H](CC(C)C)C(=O)NCC(=O)N1CCC[C@H]1C(=O)N1CCC[C@H]1C(=O)NCC(=O)N[C@@H](CO)C(=O)
N[C@@H](CCCN=C(N)N)C(N)=O
CHEMBL440067|CC(C)[C@@H]1NC(=O)[C@@H](CCCCN)NC(=O)[C@H](Cc2c[nH]c3ccccc23)NC(=O)
[C@@H](Cc2c[nH]cn2)NC(=O)[C@H](NC(=O)[C@@H](N)Cc2ccc3ccccc3c2)CSSC[C@@H](C(=O)N[
C@@H](Cc2cccc(-c3ccccc3)c2)C(N)=O)NC1=O
CHEMBL440245|CCCC[C@H](NC(=O)[C@H](CC(C)C)NC(=O)[C@H](CCCCN)NC(=O)[C@H](CCCN=C(N
)N)NC(=O)[C@H](CC(N)=O)NC(=O)[C@H](CO)NC(=O)[C@H](Cc1c[nH]cn1)NC(=O)[C@H](C)NC(=
O)[C@H](CCC(N)=O)NC(=O)[C@H](CCC(N)=O)NC(=O)[C@H](C)NC(=O)[C@H](CC(C)C)NC(=O)[C@
H](CCC(N)=O)NC(=O)[C@@H]1CCCCNC(=O)CC[C@H](NC(=O)[C@H](CC(C)C)NC(=O)[C@@H](NC(=O
)[C@H](CCC(=O)O)NC(=O)[C@H](CCCN=C(N)N)NC(=O)[C@H](CC(C)C)NC(=O)[C@H](CC(C)C)NC(
=O)[C@H](Cc2c[nH]cn2)NC(=O)[C@H](N)Cc2ccccc2)C(C)C)C(=O)N[C@@H](CCCC)C(=O)N[C@@H
](C)C(=O)N[C@@H](CCCN=C(N)N)C(=O)N[C@@H](C)C(=O)N1)C(=O)N[C@@H](CCC(=O)O)C(=O)N[
C@H](C(=O)N[C@H](C(=O)C(N)=O)[C@@H](C)CC)[C@@H](C)CC
CHEMBL440249|CC(C)C[C@@H]1NC(=O)CNC(=O)[C@H](c2ccc(O)cc2)NC(=O)[C@@H]([C@@H](C)O
)NC(=O)[C@H](c2ccc(O[C@H]3O[C@H](CO)[C@@H](O)[C@H](O)[C@@H]3O[C@H]3O[C@H](CO)[C@
@H](O)[C@H](O)[C@@H]3O)cc2)NC(=O)[C@@H](CCCN)NC(=O)[C@H](Cc2ccccc2)NC(=O)[C@H]([
C@@H](C)O)NC(=O)[C@@H](c2ccc(O)cc2)NC(=O)[C@H](c2ccc(O)cc2)NC(=O)[C@@H](C(C)C)NC
(=O)[C@@H](CCCN)NC(=O)[C@@H](c2ccc(O)cc2)NC(=O)[C@@H](CNC(=O)[C@H](CC(N)=O)NC(=O
)Cc2cccc3ccccc32)[C@@H](C(N)=O)OC(=O)[C@H](c2ccc(O)c(Cl)c2)NC(=O)[C@@H](C)NC1=O
sqlite> select id, smiles from Compounds order by length(smiles) limit 5;
CHEMBL17564|C
CHEMBL1098659|O
CHEMBL1160819|N
CHEMBL1200739|S
CHEMBL1233550|I
sqlite> .quit
That last query takes a little while because the database isn't indexed by SMILES length.

Okay, now what about the FPB file? When know from the previous code that CHEMBL23456 has the SMILES CNC(=O)c1[nH]c2ccc(Cl)cc2c1Sc1ccccc1, so I'll search for its nearest 5 neighbors.

% time simsearch --query 'CNC(=O)c1[nH]c2ccc(Cl)cc2c1Sc1ccccc1' -k 5 chembl_19.fpb
#Simsearch/1
#num_bits=2048
#type=Tanimoto k=5 threshold=0.0
#software=chemfp/1.2b4
#targets=chembl_19.fpb
#query_sources=/Users/dalke/databases/chembl_19.sdf.gz
#target_sources=/Users/dalke/databases/chembl_19.sdf.gz
5	Query1	CHEMBL23456	1.00000	CHEMBL180330	0.94105	CHEMBL286759	0.93508
CHEMBL314618	0.93295	CHEMBL382002	0.92812
0.262u 0.139s 0:00.47 82.9%	0+0k 0+0io 0pf+0w
You can verify that CHEMBL23456 is in the output, with the expected similarity of 1.0.

Also, take a look at the timings. Because of the new binary format in chemfp-1.2, it takes less than 0.5 seconds to search 1.4 million fingerprints from ChEMBL, from the command-line. I'm rather proud of that.

Finally, I'll using Python's interactive shell to connect to the database and display the uncompressed SD record for CHEMBL286759.

>>> import sqlite3
>>> db = sqlite3.connect("chembl_19.sqlite3")
>>> c = db.cursor()
>>> c.execute('SELECT record FROM Compounds WHERE id =  "CHEMBL286759"')
<sqlite3.Cursor object at 0x100d690e0>
>>> record, = c.fetchone()
>>> record
<read-write buffer ptr 0x100c64280, size 385 at 0x100c64240>
>>> 
>>> import zlib
>>> print zlib.decompress(record)
CHEMBL286759
          11280714492D 1   1.00000     0.00000     0

 20 22  0     0  0            999 V2000
    6.7292   -2.0542    0.0000 C   0  0  0  0  0  0           0  0  0
    6.2417   -1.3875    0.0000 C   0  0  0  0  0  0           0  0  0
    6.2417   -2.7292    0.0000 N   0  0  0  0  0  0           0  0  0
    5.4542   -1.6417    0.0000 C   0  0  0  0  0  0           0  0  0
    5.4542   -2.4750    0.0000 C   0  0  0  0  0  0           0  0  0
    7.5542   -2.0500    0.0000 C   0  0  0  0  0  0           0  0  0
    6.4917   -0.6042    0.0000 S   0  0  0  0  0  0           0  0  0
    4.7375   -1.2292    0.0000 C   0  0  0  0  0  0           0  0  0
    7.9625   -1.3292    0.0000 O   0  0  0  0  0  0           0  0  0
    4.7375   -2.8875    0.0000 C   0  0  0  0  0  0           0  0  0
    7.9667   -2.7625    0.0000 N   0  0  0  0  0  0           0  0  0
    4.0167   -1.6417    0.0000 C   0  0  0  0  0  0           0  0  0
    5.9417    0.0083    0.0000 C   0  0  0  0  0  0           0  0  0
    4.0167   -2.4750    0.0000 C   0  0  0  0  0  0           0  0  0
    3.3000   -1.2292    0.0000 Cl  0  0  0  0  0  0           0  0  0
    6.2000    0.7958    0.0000 C   0  0  0  0  0  0           0  0  0
    5.1417   -0.1667    0.0000 C   0  0  0  0  0  0           0  0  0
    4.5875    0.4458    0.0000 C   0  0  0  0  0  0           0  0  0
    5.6417    1.4083    0.0000 C   0  0  0  0  0  0           0  0  0
    4.8375    1.2333    0.0000 C   0  0  0  0  0  0           0  0  0
  2  1  2  0     0  0
  3  1  1  0     0  0
  4  2  1  0     0  0
  5  3  1  0     0  0
  6  1  1  0     0  0
  7  2  1  0     0  0
  8  4  1  0     0  0
  9  6  2  0     0  0
 10  5  1  0     0  0
 11  6  1  0     0  0
 12  8  2  0     0  0
 13  7  1  0     0  0
 14 10  2  0     0  0
 15 12  1  0     0  0
 16 13  2  0     0  0
 17 13  1  0     0  0
 18 17  2  0     0  0
 19 16  1  0     0  0
 20 18  1  0     0  0
  4  5  2  0     0  0
 14 12  1  0     0  0
 20 19  2  0     0  0
M  END
> <chembl_id>
CHEMBL286759

$$$$

Simple variations

You can easily see how to expand the database. You likely also want to store and index the non-isomeric canonical SMILES, which you can do by asking for the "cansmiles" format. You might also want to pre-compute and store molecular properites, like cLogP and molecular weight, along with a searchable index. It's very easy to make your own specialized database once you know how.

Of course, I really would like to integrate the SQL query with the fingerprint query, to search for things like "similar records with a given logP range." It's possible to hack these a bit, but what I would like to do some day is write a chemfp virtual table for SQLite - a sort of similarity cartridge - so they can be integrated. I just need funding. ;)



MACCS in RDKit and Open Babel #

In this essay I'll compare the RDKit and Open Babel MACCS implementations at a somewhat coarse level. The main goal of this will be to show how some of the new chemfp-1.2 features make it easier to do cross-toolkit fingerprint analysis.

Fingerprint type API

It's hard to make things easy. I want to be able to generate a fingerprint given the fingerprint type string and a structure record as a string. Here are some examples of fingerprint type strings:

RDKit-Fingerprint fpSize=1024
OpenEye-Path numbits=1024 maxbonds=6
OpenBabel-FP2
You can look at the names and figure out that they use different cheminformatics toolkits, each with its own API for parsing a chemical record and generating a fingerprint. That's complicated.

Chemfp solves it in the usual way, with an extra layer of indirection. There's a common fingerprint type API, with back-end implementations for OEChem, Open Babel, and RDKit. Here's how it looks in action:

import chemfp

smiles = "c1ccccc1O"

for typestring in (
     # Iterate through fingerprints from three different toolkits
     "RDKit-Fingerprint fpSize=1024",
     "OpenEye-Path numbits=1024 maxbonds=6",
     "OpenBabel-FP2"):
    
    # Use the fingerprint type string to get a FingerprintType object
    fptype = chemfp.get_fingerprint_type(typestring)
    
    # The fingerprint type can parse a SMILES string and return a fingerprint
    fp = fptype.parse_molecule_fingerprint(smiles, "smistring")
    
    # The fingerprint type also has a way to get the canonical type string.
    print fptype.get_type()
    
    # Display the fingerprint in hex, and wrapped over several lines
    hex_fp = fp.encode("hex")
    for i in range(0, len(hex_fp), 60):
        print hex_fp[i:i+60]
    print
This generates the following output:
RDKit-Fingerprint/2 minPath=1 maxPath=7 fpSize=1024 nBitsPerHash=2 useHs=1
040000000000100000000000800000000000000002040000040041000000
000080000000400000400004080800000001000000000008000000208020
000000000000000000000800000000000100000000112000000000000000
040000000001000000010000000220040200020008000000000200004085
0000000000000000

OpenEye-Path/2 numbits=1024 minbonds=0 maxbonds=6 atype=Arom|AtmNum|Chiral|EqHalo|FCharge|HvyDeg|Hyb btype=Order|Chiral
200000000000000000000000008090000000000000000000800010000000
00800020000000100200000200008042000000000a000000000000100000
000000000400020000000001000000000000000000000000000004000010
00000000200802000000004008010000100000000000008a002184000000
2000000000000004

OpenBabel-FP2/1
000000000000000000000200000000000000000000000000000000000000
000000000000000000000000000800000000000002000000000000000000
000000000800000000000000000000000200000000800000000000004008
000000000000000000000000000200000000000000000002000000000020
0800000000000000

Use parse_molecule() to parse the record once

How does RDKit's fingerprint density change as a function of its length? It's not hard using the above example, though instead of creating a type string containing the size information this time I'll pass in the fingerprint parameters as the second argument to get_fingerprint_type().

import chemfp
from chemfp import bitops

smiles = "CC(=O)Oc1ccccc1C(=O)O"

print "fpSize popcount density"

for size in (256, 512, 1024, 2048, 4096):
    fptype = chemfp.get_fingerprint_type("RDKit-Fingerprint", {"fpSize": size})
    # The next line converts the SMILES to a molecule each time through the loop.
    fp = fptype.parse_molecule_fingerprint(smiles, "smistring")
    popcount = bitops.byte_popcount(fp)
    print "%6d   %4d    %.3f" % (size, popcount, float(popcount)/size)
For the curious, the output is:
fpSize popcount density
   256    208    0.812
   512    271    0.529
  1024    319    0.312
  2048    354    0.173
  4096    375    0.092

If you look at the code for a bit you'll see that it parses the same SMILES string 5 times. This seems a bit excessive. Instead, I would like to parse the SMILES string once, to get an RDKit molecule, and pass that molecule to the different fingerprint types. I can do this directly with the RDKit API, as in:

from rdkit import Chem
rdmol = Chem.MolFromSmiles(smiles)
I instead created a common cross-toolkit API, where "rdkit_toolkit", "openeye_toolkit", and "openbabel_toolkit", share the same toolkit API. Here I use chemfp.rdkit_toolkit to make a molecule:
from chemfp import rdkit_toolkit
rdmol = rdkit_toolkit.parse_molecule(smiles, "smistring")
and the same code would work with the other toolkits.

At this point it looks more complicated, or at least longer, to call the chemfp toolkit API than use RDKit's native API. I hope to convince you in a bit why it's useful.

I'll rewrite the density code to use the toolkit API and parse the SMILES string only once. This time I call the compute_fingerprint() method of the fingerprint type to compute the fingerprint given a toolkit molecule.

import chemfp
from chemfp import bitops
from chemfp import rdkit_toolkit

smiles = "CC(=O)Oc1ccccc1C(=O)O"

print "fpSize popcount density"

# Parse the SMILES once, into an RDKit molecule
rdmol = rdkit_toolkit.parse_molecule(smiles, "smistring")

for size in (256, 512, 1024, 2048, 4096):
    fptype = chemfp.get_fingerprint_type("RDKit-Fingerprint", {"fpSize": size})
    fp = fptype.compute_fingerprint(rdmol)
    popcount = bitops.byte_popcount(fp)
    print "%6d   %4d    %.3f" % (size, popcount, float(popcount)/size)
As expected, the output doesn't change, so I won't show it again.

How to use the right toolkit for a given fingerprint type.

The above code worked because I knew to use the rdkit_toolkit for the "RDKit-Fingerprint" type. What if I want to compare the fingerprint density for different sizes and different fingerprint types, including ones from other toolkits? I could keep track of them manually, but in chemfp each fingerprint type knows its toolkit, which makes things easier.

import chemfp
from chemfp import bitops

smiles = "CC(=O)Oc1ccccc1C(=O)O"

print "fpSize popcount density"

for typename, sizefield in (
        ("RDKit-Fingerprint", "fpSize"),
        ("RDKit-Morgan radius=1", "fpSize"),
        ("RDKit-Morgan radius=2", "fpSize"),
        ("RDKit-Morgan radius=3", "fpSize"),
        ("OpenEye-Path", "numbits"),
        ("OpenEye-Circular maxradius=1", "numbits"),
        ("OpenEye-Circular maxradius=2", "numbits"),
        ("OpenEye-Circular maxradius=3", "numbits")):

    # Each fingerprint type knows its toolkit.
    # Use that to create the toolkit-appropriate molecule
    fptype = chemfp.get_fingerprint_type(typename)
    
    toolkit = fptype.toolkit
    mol = toolkit.parse_molecule(smiles, "smistring")

    print "="*23, typename
    for size in (256, 512, 1024, 2048, 4096):
        fptype = chemfp.get_fingerprint_type(typename, {sizefield: size})
        fp = fptype.compute_fingerprint(mol)
        popcount = bitops.byte_popcount(fp)
        print "%6d   %4d    %.3f" % (size, popcount, float(popcount)/size)
The output of this is:
fpSize popcount density
======================= RDKit-Fingerprint
   256    208    0.812
   512    271    0.529
  1024    319    0.312
  2048    354    0.173
  4096    375    0.092
======================= RDKit-Morgan radius=1
   256     16    0.062
   512     16    0.031
  1024     16    0.016
  2048     16    0.008
  4096     17    0.004
======================= RDKit-Morgan radius=2
   256     24    0.094
   512     24    0.047
  1024     24    0.023
  2048     24    0.012
  4096     25    0.006
======================= RDKit-Morgan radius=3
   256     31    0.121
   512     31    0.061
  1024     31    0.030
  2048     31    0.015
  4096     32    0.008
======================= OpenEye-Path
   256    124    0.484
   512    142    0.277
  1024    153    0.149
  2048    161    0.079
  4096    164    0.040
======================= OpenEye-Circular maxradius=1
   256     15    0.059
   512     16    0.031
  1024     16    0.016
  2048     16    0.008
  4096     16    0.004
======================= OpenEye-Circular maxradius=2
   256     23    0.090
   512     24    0.047
  1024     24    0.023
  2048     24    0.012
  4096     24    0.006
======================= OpenEye-Circular maxradius=3
   256     29    0.113
   512     31    0.061
  1024     31    0.030
  2048     31    0.015
  4096     31    0.008
This new code once again parses the SMILES more often than it needs to. I only need to parse the string once for all of the OpenEye toolkit fingerprint calculations and once for RDKit's. I'll make a little cache where the key is "toolkit.name" (which is the string "openeye" or "rdkit" or "openbabel") and the value is the correct toolkit molecule. The following is a truncated of the result, since the other details aren't so important and get in the way.
# The cache key is the toolkit name
mol_cache = {}
for typename, sizefield in (
        ("RDKit-Fingerprint", "fpSize"),
              # ... removed many lines
        ("OpenEye-Circular maxradius=3", "numbits")):

    # Each fingerprint type knows its toolkit.
    # Use that to create the toolkit-appropriate molecule
    fptype = chemfp.get_fingerprint_type(typename)
    toolkit = fptype.toolkit
    if toolkit.name in mol_cache:
        mol = mol_cache[toolkit.name]
    else:
        mol = mol_cache[toolkit.name] = toolkit.parse_molecule(smiles, "smistring")

    # ... compute and display the fingerprints

Use a fingerprint type to read a structrue file

Now for something a bit different. Every wondered if the RDKit and Open Babel MACCS implementations give the same answers? They should be close, since they use the same SMARTS definitions, but there will likely be differences in chemistry. It's very easy to generate the MACCS fingerprints given a single fingerprint type. The fingerprint type has a "read_molecule_fingerprints()" method which takes the structure filename and returns the (id, fp) pairs. Here's what that looks like when I analyze a copy of ChEBI:

import chemfp
from chemfp.bitops import byte_popcount

# Choose one or the other, depending on your preference.
fptype = chemfp.get_fingerprint_type("RDKit-MACCS166")
#fptype = chemfp.get_fingerprint_type("OpenBabel-MACCS")

filename = "/Users/dalke/databases/ChEBI_lite.sdf.gz"
total_popcount = 0
with fptype.read_molecule_fingerprints(filename, errors="ignore") as reader:
    for recno, (id, fp) in enumerate(reader, 1):
        total_popcount += byte_popcount(fp)

print "#records %d  total popcount: %d  average popcount: %.1f  average density: %.3f" % (
    recno, total_popcount, total_popcount/float(recno), total_popcount/float(recno)/fptype.num_bits)
This gives the output for RDKit of:
#records 18928  total popcount: 637448  average popcount: 33.7  average density: 0.203
When I run it for Open Babel's MACCS I get:
#records 19640  total popcount: 649935  average popcount: 33.1  average density: 0.199
These are close, but not identical. It's not clear why there's a difference. Part of the reason is surely because RDKit and Open Babel aren't able to process all of the records. It's also likely that they have differences in aromaticity. But there may be other reasons.

It may just that the records that RDKit can't handle vs. those that Open Babel can handle, so the real analysis should only include the ones that both can handle.

Read SD records using the text_toolkit

I need some way to only analyze the records that both RDKit and Open Babel can understand. I can't use the fingerprint type's own parser API because it's difficult to keep the two parsers synchronized; I don't know if the RDKit skipped a record or the Open Bable one did.

Chemfp includes a "text_toolkit" module, with functions to parse individual SMILES and SDF records from a the corresponding files. I can use that to get the SDF records as a string, then use parse_molecule_fingerprint() to get the fingerprints from that string. The functions will raise a chemfp.ParseError exception (which is also a ValueError) if there was a problem. If there is an exception, I'll skip that record.

The code to read SDF records can be as simple as:

for id, sdf_record in text_toolkit.read_sdf_ids_and_records(filename):
    ... sdf_record contains the record as a string ...
though in practice I'll use a context-manager so the file is closed automatically. Here's the full code:
import chemfp
from chemfp import text_toolkit
from chemfp.bitops import byte_popcount, byte_intersect_popcount

rdkit_maccs_fptype = chemfp.get_fingerprint_type("RDKit-MACCS166")
openbabel_maccs_fptype =chemfp.get_fingerprint_type("OpenBabel-MACCS")
assert rdkit_maccs_fptype.num_bits == openbabel_maccs_fptype.num_bits == 166

filename = "/Users/dalke/databases/ChEBI_lite.sdf.gz"

total_rdkit_popcount = total_openbabel_popcount = 0
only_rdkit_popcount = only_openbabel_popcount = 0
num_skipped = 0

with text_toolkit.read_sdf_ids_and_records(filename) as sdf_reader:
    for recno, (id, sdf_record) in enumerate(sdf_reader, 1):
        # Convert the record into a fingerprint. If there is a problem,
        # keep track of the skip count and go on to the next record.
        try:
            rdkit_fp = rdkit_maccs_fptype.parse_molecule_fingerprint(sdf_record, "sdf")
            openbabel_fp = openbabel_maccs_fptype.parse_molecule_fingerprint(sdf_record, "sdf")
        except chemfp.ParseError:
            num_skipped += 1
            continue
        # Figure out the number of bits in each fingerprint and number of bits in common
        rdkit_popcount = byte_popcount(rdkit_fp)
        openbabel_popcount = byte_popcount(openbabel_fp)
        intersect_popcount = byte_intersect_popcount(rdkit_fp, openbabel_fp)

        # Keep track of overall statistics for each molecule.
        total_rdkit_popcount += rdkit_popcount
        total_openbabel_popcount += openbabel_popcount
        only_rdkit_popcount += rdkit_popcount - intersect_popcount
        only_openbabel_popcount += openbabel_popcount - intersect_popcount
        
# Report the results
print "#records %d  #skipped %d" % (recno, num_skipped)
num_valid_records = recno - num_skipped
print "RDKit popcount: %d  RDKit only: %d average popcount: %.1f  average density: %.3f" % (
    total_rdkit_popcount, only_rdkit_popcount, total_rdkit_popcount/float(num_valid_records),
    total_rdkit_popcount/float(num_valid_records)/166)
print "Open Babel popcount: %d  OB only: %d average popcount: %.1f  average density: %.3f" % (
    total_openbabel_popcount, only_openbabel_popcount, total_openbabel_popcount/float(num_valid_records),
    total_openbabel_popcount/float(num_valid_records)/166)
#records 19640  #skipped 712
RDKit popcount: 637448  RDKit only: 9034 average popcount: 33.7  average density: 0.203
Open Babel popcount: 629173  OB only: 759 average popcount: 33.2  average density: 0.200

That's interesting. RDKit has a slightly higher popcount than Open Babel, and both have unique bits set which aren't in the other. Which bits are different? For that I'll use a decoder function which returns the list "on" bit positions in a fingerprint:

>>> from chemfp import encodings
>>> encodings.to_on_bitlist("\0\0\1\xf1")
[16, 24, 28, 29, 30, 31]
>>> 
(Remember that chemfp's uses a mixed endian encoding, where bytes are little endian and bits in a byte are big endian.)

The following will find where the two MACCS implementations have different fingerprints, compare which bits are set in one implementation which aren't set in the other, and report a few of the records which are different. (ChEBI stores the record identifier in the "ChEBI ID" tag instead of the title, so I use the 'id_tag' to extract the actual record id.)

import chemfp
from chemfp import text_toolkit
from chemfp.bitops import byte_popcount, byte_intersect_popcount
from chemfp.encodings import to_on_bitlist

from collections import defaultdict

rdkit_maccs_fptype = chemfp.get_fingerprint_type("RDKit-MACCS166")
openbabel_maccs_fptype = chemfp.get_fingerprint_type("OpenBabel-MACCS")
assert rdkit_maccs_fptype.num_bits == openbabel_maccs_fptype.num_bits == 166

filename = "/Users/dalke/databases/ChEBI_lite.sdf.gz"

extra_rdkit = defaultdict(list)
extra_openbabel = defaultdict(list)

with text_toolkit.read_sdf_ids_and_records(filename, id_tag="ChEBI ID") as sdf_reader:
    for recno, (id, sdf_record) in enumerate(sdf_reader, 1):
        try:
            rdkit_fp = rdkit_maccs_fptype.parse_molecule_fingerprint(sdf_record, "sdf")
            openbabel_fp = openbabel_maccs_fptype.parse_molecule_fingerprint(sdf_record, "sdf")
        except chemfp.ParseError:
            continue
        if rdkit_fp == openbabel_fp:
            continue
        in_rdkit = set(to_on_bitlist(rdkit_fp))
        in_openbabel = set(to_on_bitlist(openbabel_fp))

        # Record the identifier for each bit in one fingerprint which is not in the other
        for bitno in (in_rdkit - in_openbabel):
            extra_rdkit[bitno].append(id)
        for bitno in (in_openbabel - in_rdkit):
            extra_openbabel[bitno].append(id)
            
        
print "Extra in RDKit:", sum(len(ids) for ids in extra_rdkit.values())
for bitno, ids in sorted(extra_rdkit.items()):
    examples = repr(ids[:5])[1:-1]
    if len(ids) > 5:
        examples += " ..."
    print "%3d: %d extra = %s" % (bitno, len(ids), examples)

print
print "Extra in Open Babel:", sum(len(ids) for ids in extra_openbabel.values())
for bitno, ids in sorted(extra_openbabel.items()):
    examples = repr(ids[:5])[1:-1]
    if len(ids) > 5:
        examples += " ..."
    print "%3d: %d extra = %s" % (bitno, len(ids), examples)

The results is a bit lengthy - good thing I don't report all of the identifiers for each one!

Extra in RDKit: 9034
  2: 31 extra = 'CHEBI:30444', 'CHEBI:33313', 'CHEBI:37340', 'CHEBI:37341', 'CHEBI:37342' ...
 20: 4 extra = 'CHEBI:27581', 'CHEBI:51461', 'CHEBI:51467', 'CHEBI:51474'
 23: 12 extra = 'CHEBI:15724', 'CHEBI:17735', 'CHEBI:3611', 'CHEBI:3612', 'CHEBI:29811' ...
 42: 13 extra = 'CHEBI:21143', 'CHEBI:33633', 'CHEBI:33722', 'CHEBI:33723', 'CHEBI:33724' ...
 43: 3036 extra = 'CHEBI:7963', 'CHEBI:11230', 'CHEBI:11714', 'CHEBI:12194', 'CHEBI:15431' ...
 44: 34 extra = 'CHEBI:27581', 'CHEBI:32713', 'CHEBI:32715', 'CHEBI:32724', 'CHEBI:32726' ...
 48: 29 extra = 'CHEBI:15378', 'CHEBI:15724', 'CHEBI:17045', 'CHEBI:17735', 'CHEBI:3611' ...
 49: 10 extra = 'CHEBI:23066', 'CHEBI:32713', 'CHEBI:32715', 'CHEBI:32724', 'CHEBI:32726' ...
 52: 12 extra = 'CHEBI:21143', 'CHEBI:31196', 'CHEBI:33722', 'CHEBI:33723', 'CHEBI:33724' ...
 53: 1 extra = 'CHEBI:33633'
 58: 12 extra = 'CHEBI:228275', 'CHEBI:51895', 'CHEBI:52273', 'CHEBI:52298', 'CHEBI:52299' ...
 64: 4 extra = 'CHEBI:57255', 'CHEBI:38600', 'CHEBI:57718', 'CHEBI:58571'
 67: 6 extra = 'CHEBI:33130', 'CHEBI:29229', 'CHEBI:31197', 'CHEBI:33633', 'CHEBI:52699' ...
 68: 33 extra = 'CHEBI:47402', 'CHEBI:49464', 'CHEBI:21143', 'CHEBI:29230', 'CHEBI:29275' ...
 73: 2 extra = 'CHEBI:51730', 'CHEBI:51736'
 75: 7 extra = 'CHEBI:27581', 'CHEBI:32713', 'CHEBI:32715', 'CHEBI:32724', 'CHEBI:32726' ...
 77: 6 extra = 'CHEBI:38010', 'CHEBI:51595', 'CHEBI:52729', 'CHEBI:52748', 'CHEBI:52749' ...
 83: 29 extra = 'CHEBI:27899', 'CHEBI:21549', 'CHEBI:29275', 'CHEBI:30027', 'CHEBI:30312' ...
 85: 2 extra = 'CHEBI:51730', 'CHEBI:51736'
 89: 5 extra = 'CHEBI:31197', 'CHEBI:32715', 'CHEBI:32726', 'CHEBI:32734', 'CHEBI:58416'
 90: 7 extra = 'CHEBI:32713', 'CHEBI:32715', 'CHEBI:32724', 'CHEBI:32726', 'CHEBI:32729' ...
 92: 2 extra = 'CHEBI:51730', 'CHEBI:51736'
 98: 8 extra = 'CHEBI:27581', 'CHEBI:32713', 'CHEBI:32715', 'CHEBI:32724', 'CHEBI:32726' ...
100: 1 extra = 'CHEBI:60153'
107: 2 extra = 'CHEBI:23066', 'CHEBI:51736'
112: 33 extra = 'CHEBI:15342', 'CHEBI:52781', 'CHEBI:57255', 'CHEBI:33826', 'CHEBI:33831' ...
113: 1 extra = 'CHEBI:51736'
114: 5 extra = 'CHEBI:23066', 'CHEBI:51736', 'CHEBI:55429', 'CHEBI:55504', 'CHEBI:55506'
115: 2 extra = 'CHEBI:51736', 'CHEBI:55504'
117: 1 extra = 'CHEBI:51736'
118: 6 extra = 'CHEBI:38010', 'CHEBI:51595', 'CHEBI:52729', 'CHEBI:52748', 'CHEBI:52749' ...
121: 1 extra = 'CHEBI:60153'
124: 4500 extra = 'CHEBI:1734', 'CHEBI:2303', 'CHEBI:2914', 'CHEBI:3013', 'CHEBI:3048' ...
125: 3 extra = 'CHEBI:41981', 'CHEBI:29374', 'CHEBI:29375'
127: 1 extra = 'CHEBI:51736'
128: 1 extra = 'CHEBI:51736'
130: 18 extra = 'CHEBI:21143', 'CHEBI:29229', 'CHEBI:32715', 'CHEBI:32726', 'CHEBI:32734' ...
134: 86 extra = 'CHEBI:15852', 'CHEBI:15982', 'CHEBI:16048', 'CHEBI:16304', 'CHEBI:16379' ...
137: 1 extra = 'CHEBI:51736'
140: 3 extra = 'CHEBI:51730', 'CHEBI:51736', 'CHEBI:55429'
143: 63 extra = 'CHEBI:15342', 'CHEBI:15955', 'CHEBI:16379', 'CHEBI:16848', 'CHEBI:17679' ...
146: 1 extra = 'CHEBI:51736'
147: 1 extra = 'CHEBI:60153'
148: 5 extra = 'CHEBI:23066', 'CHEBI:51730', 'CHEBI:51736', 'CHEBI:55429', 'CHEBI:55504'
150: 2 extra = 'CHEBI:29275', 'CHEBI:58416'
152: 1 extra = 'CHEBI:51736'
156: 17 extra = 'CHEBI:52781', 'CHEBI:51222', 'CHEBI:51224', 'CHEBI:51225', 'CHEBI:51899' ...
157: 12 extra = 'CHEBI:16531', 'CHEBI:17675', 'CHEBI:50375', 'CHEBI:2311', 'CHEBI:11573' ...
159: 7 extra = 'CHEBI:23066', 'CHEBI:36771', 'CHEBI:48079', 'CHEBI:51730', 'CHEBI:51736' ...
161: 21 extra = 'CHEBI:57255', 'CHEBI:30858', 'CHEBI:33134', 'CHEBI:33668', 'CHEBI:33826' ...
165: 934 extra = 'CHEBI:3243', 'CHEBI:3385', 'CHEBI:3395', 'CHEBI:3567', 'CHEBI:3637' ...

Extra in Open Babel: 759
  1: 1 extra = 'CHEBI:33397'
  2: 11 extra = 'CHEBI:49920', 'CHEBI:30437', 'CHEBI:30439', 'CHEBI:30440', 'CHEBI:30442' ...
 25: 58 extra = 'CHEBI:15430', 'CHEBI:15436', 'CHEBI:18023', 'CHEBI:27484', 'CHEBI:27609' ...
 42: 7 extra = 'CHEBI:29779', 'CHEBI:30163', 'CHEBI:32060', 'CHEBI:35836', 'CHEBI:36143' ...
 44: 62 extra = 'CHEBI:15430', 'CHEBI:15436', 'CHEBI:18023', 'CHEBI:27484', 'CHEBI:27609' ...
 49: 39 extra = 'CHEBI:15430', 'CHEBI:15436', 'CHEBI:18023', 'CHEBI:27484', 'CHEBI:27609' ...
 52: 5 extra = 'CHEBI:9016', 'CHEBI:22634', 'CHEBI:32060', 'CHEBI:50953', 'CHEBI:50958'
 53: 9 extra = 'CHEBI:22634', 'CHEBI:30003', 'CHEBI:30004', 'CHEBI:30598', 'CHEBI:32743' ...
 62: 11 extra = 'CHEBI:15724', 'CHEBI:17735', 'CHEBI:3611', 'CHEBI:3612', 'CHEBI:29811' ...
 64: 34 extra = 'CHEBI:15852', 'CHEBI:15982', 'CHEBI:16304', 'CHEBI:16379', 'CHEBI:17439' ...
 67: 24 extra = 'CHEBI:9016', 'CHEBI:22634', 'CHEBI:29278', 'CHEBI:30101', 'CHEBI:30104' ...
 68: 37 extra = 'CHEBI:16144', 'CHEBI:44951', 'CHEBI:9016', 'CHEBI:22634', 'CHEBI:29278' ...
 75: 62 extra = 'CHEBI:15430', 'CHEBI:15436', 'CHEBI:18023', 'CHEBI:27484', 'CHEBI:27609' ...
 77: 67 extra = 'CHEBI:15430', 'CHEBI:15436', 'CHEBI:18023', 'CHEBI:27484', 'CHEBI:27609' ...
 81: 9 extra = 'CHEBI:22634', 'CHEBI:30003', 'CHEBI:30004', 'CHEBI:32060', 'CHEBI:32743' ...
 83: 18 extra = 'CHEBI:161680', 'CHEBI:9016', 'CHEBI:29278', 'CHEBI:29422', 'CHEBI:30104' ...
 89: 1 extra = 'CHEBI:37179'
 90: 2 extra = 'CHEBI:32060', 'CHEBI:37178'
 98: 58 extra = 'CHEBI:15436', 'CHEBI:18023', 'CHEBI:27484', 'CHEBI:27609', 'CHEBI:28421' ...
103: 2 extra = 'CHEBI:36143', 'CHEBI:37179'
105: 1 extra = 'CHEBI:55504'
112: 17 extra = 'CHEBI:40071', 'CHEBI:595389', 'CHEBI:8247', 'CHEBI:32014', 'CHEBI:36603' ...
118: 76 extra = 'CHEBI:15430', 'CHEBI:15436', 'CHEBI:15724', 'CHEBI:17735', 'CHEBI:18023' ...
130: 32 extra = 'CHEBI:29278', 'CHEBI:29779', 'CHEBI:30003', 'CHEBI:30004', 'CHEBI:30101' ...
134: 4 extra = 'CHEBI:34921', 'CHEBI:51597', 'CHEBI:53328', 'CHEBI:60153'
135: 27 extra = 'CHEBI:49706', 'CHEBI:29221', 'CHEBI:29270', 'CHEBI:29271', 'CHEBI:29420' ...
143: 10 extra = 'CHEBI:5835', 'CHEBI:33089', 'CHEBI:36618', 'CHEBI:36645', 'CHEBI:51962' ...
150: 3 extra = 'CHEBI:29240', 'CHEBI:29278', 'CHEBI:30227'
156: 2 extra = 'CHEBI:51204', 'CHEBI:53327'
157: 62 extra = 'CHEBI:15430', 'CHEBI:15436', 'CHEBI:18023', 'CHEBI:27484', 'CHEBI:27609' ...
159: 4 extra = 'CHEBI:29360', 'CHEBI:29434', 'CHEBI:29437', 'CHEBI:29440'
161: 4 extra = 'CHEBI:36603', 'CHEBI:36618', 'CHEBI:36645', 'CHEBI:60153'

Why are there big differences?

MACCS key 44

The second biggest difference is bit 43, where RDKit has 3036 extra matches and Open Babel has 0 extra matches. This corresponds to MACCS key 44, which is the "OTHER". The difference is due to timing. In last month's essay I tracked down the definition for OTHER, which had been missing for Open Babel, RDKit, and CDK. I reported it to the repsective projects, and the updated definition is now part of the main code base.

RDKit had a new release between last month and now, which I'm using. Open Babel has not. I'll demonstrate by asking the chemfp toolkit API for information about the underlying toolkit version:
>>> from chemfp import rdkit_toolkit, openbabel_toolkit
>>> rdkit_toolkit.software
'RDKit/2014.09.1'
>>> openbabel_toolkit.software
'OpenBabel/2.3.2'

MACCS key 125

The biggest difference is bit 124/key 125. RDKit has 4500 matches that weren't found in Open Babel, while Open Babel has no extra matches. This key is defined as "Aromatic Ring > 1".

This definition cannot be defined by a SMARTS pattern. RDKit has special code to handle that case. Open Babel does not. The current Open Babel code assumes that a fingerprint can be defined by a set of SMARTS patterns, with one SMARTS per bit/key, and there's no provision to indicate an alternative.

As a result, Open Babel will always set a 0 for that bit.

Open Babel MACCS bit count distribution

This got me to wonder about the number of records set by each bit, so I wrote a short program for that.

import chemfp
from chemfp.bitops import byte_contains_bit
from chemfp.encodings import to_on_bitlist

fptype = chemfp.get_fingerprint_type("OpenBabel-MACCS")

filename = "/Users/dalke/databases/ChEBI_lite.sdf.gz"

# Initialize a list of bit counts to 0.
counts = [0 for i in range(fptype.num_bits)]

with fptype.read_molecule_fingerprints(filename) as fp_reader:
    for (id, fp) in fp_reader:
        # Keep track of the total bit count
        for bitno in to_on_bitlist(fp):
            counts[bitno] += 1

# Report the result
for bitno, count in enumerate(counts):
    print "%3d: %6d" % (bitno, count)
Here is the Open Babel MACCS bit use distribution:
  0:      0
  1:      2
  2:    405
  3:     31
  4:     14
  5:     44
  6:    209
  7:    407
  8:    306
  9:    125
 10:    452
 11:    160
 12:    115
 13:    144
 14:     35
 15:    232
 16:    131
 17:    204
 18:    622
 19:    102
 20:     97
 21:    438
 22:    192
 23:    666
 24:    687
 25:    870
 26:    200
 27:    242
 28:   2654
 29:    422
 30:    402
 31:    153
 32:    214
 33:    701
 34:    311
 35:    730
 36:    740
 37:   1508
 38:    706
 39:    753
 40:    275
 41:    583
 42:   2069
 43:      0
 44:    665
 45:    226
 46:    590
 47:   3519
 48:   5666
 49:   2429
 50:    621
 51:    452
 52:   5505
 53:   6337
 54:    916
 55:    248
 56:   4981
 57:   1013
 58:    551
 59:    969
 60:   1012
 61:   2398
 62:    335
 63:    565
 64:   3477
 65:   3222
 66:   1315
 67:    211
 68:   3119
 69:    490
 70:    754
 71:   6860
 72:   1111
 73:   3312
 74:   2734
 75:   3010
 76:   2988
 77:   1245
 78:   2880
 79:   3308
 80:   1801
 81:   5228
 82:   4477
 83:   3579
 84:   3668
 85:   2435
 86:   1060
 87:   2839
 88:   7561
 89:   8478
 90:   7778
 91:   4308
 92:   2861
 93:   1967
 94:   6416
 95:   5504
 96:   5064
 97:   6145
 98:   4558
 99:   4120
100:   5355
101:   4878
102:   1340
103:   7468
104:   5827
105:   5600
106:   1540
107:   4975
108:   5852
109:   4748
110:   5774
111:   7248
112:   3699
113:   2494
114:   5190
115:   5297
116:   4970
117:   7356
118:   1667
119:   5755
120:   5254
121:   4356
122:   9127
123:   6298
124:      0
125:   6236
126:   8584
127:   6237
128:   6218
129:   4972
130:  10503
131:   9283
132:   4075
133:   2230
134:   2722
135:   7841
136:   8606
137:   5036
138:  11348
139:   9332
140:   3954
141:   6373
142:   8584
143:   4266
144:   6504
145:  11568
146:   7385
147:   7141
148:   6480
149:   8560
150:   7651
151:  10272
152:   9338
153:  11775
154:  11310
155:   9060
156:  13650
157:   9125
158:  14337
159:  10031
160:  10299
161:   8148
162:  11358
163:  16302
164:  12939
165:      0
You can see that bits 0, 43, 124, and 165 (or keys 1, 44, 125, and 166) are not set. I've already discussed 44 and 125. Key 1 is "Isotope" and key 166 is "Fragment", neither of which have a SMARTS definition. (I believe the SMARTS definitions should be "[!*0]" and "(*).(*)", but I'll need to test if those are sufficiently cross-toolkit ... Nope, RDKit doesn't seem to support component-level SMARTS.)

RDMACCS - a more cross-toolkit MACCS implementation

Chemfp includes the "RDMACCS" fingerprints, which is my attempt at a cross-toolkit implementation of the MACCS keys. It starts with the same SMARTS definitions as RDKit (though I haven't yet added key 44) and includes toolkit-specific implementations for the keys which can't be expressed in SMARTS.

When I redo the same comparisons using the following fingerprint types:

rdkit_maccs_fptype = chemfp.get_fingerprint_type("RDMACCS-RDKit")
openbabel_maccs_fptype = chemfp.get_fingerprint_type("RDMACCS-OpenBabel")
I get answers which are much closer together
#records 19640  #skipped 712
RDKit popcount: 634388  RDKit only: 528 average popcount: 33.5  average density: 0.202
Open Babel popcount: 634698  OB only: 838 average popcount: 33.5  average density: 0.202
I haven't yet done the research to figure out the source of all of the differences, but from work I did in 2011 on Dealing with SSSR I know that aromaticity perception is likely an important factor.

I'll leave you with a bitwise comparison so you can get a sense of the differences for yourself:

Extra in RDKit: 528
 20: 4 extra = 'CHEBI:27581', 'CHEBI:51461', 'CHEBI:51467', 'CHEBI:51474'
 23: 12 extra = 'CHEBI:15724', 'CHEBI:17735', 'CHEBI:3611', 'CHEBI:3612', 'CHEBI:29811' ...
 42: 13 extra = 'CHEBI:21143', 'CHEBI:33633', 'CHEBI:33722', 'CHEBI:33723', 'CHEBI:33724' ...
 44: 34 extra = 'CHEBI:27581', 'CHEBI:32713', 'CHEBI:32715', 'CHEBI:32724', 'CHEBI:32726' ...
 48: 18 extra = 'CHEBI:15724', 'CHEBI:17045', 'CHEBI:17735', 'CHEBI:3611', 'CHEBI:3612' ...
 49: 10 extra = 'CHEBI:23066', 'CHEBI:32713', 'CHEBI:32715', 'CHEBI:32724', 'CHEBI:32726' ...
 52: 12 extra = 'CHEBI:21143', 'CHEBI:31196', 'CHEBI:33722', 'CHEBI:33723', 'CHEBI:33724' ...
 53: 1 extra = 'CHEBI:33633'
 58: 12 extra = 'CHEBI:228275', 'CHEBI:51895', 'CHEBI:52273', 'CHEBI:52298', 'CHEBI:52299' ...
 64: 4 extra = 'CHEBI:57255', 'CHEBI:38600', 'CHEBI:57718', 'CHEBI:58571'
 67: 6 extra = 'CHEBI:33130', 'CHEBI:29229', 'CHEBI:31197', 'CHEBI:33633', 'CHEBI:52699' ...
 68: 33 extra = 'CHEBI:47402', 'CHEBI:49464', 'CHEBI:21143', 'CHEBI:29230', 'CHEBI:29275' ...
 73: 2 extra = 'CHEBI:51730', 'CHEBI:51736'
 75: 7 extra = 'CHEBI:27581', 'CHEBI:32713', 'CHEBI:32715', 'CHEBI:32724', 'CHEBI:32726' ...
 77: 6 extra = 'CHEBI:38010', 'CHEBI:51595', 'CHEBI:52729', 'CHEBI:52748', 'CHEBI:52749' ...
 83: 29 extra = 'CHEBI:27899', 'CHEBI:21549', 'CHEBI:29275', 'CHEBI:30027', 'CHEBI:30312' ...
 85: 2 extra = 'CHEBI:51730', 'CHEBI:51736'
 89: 5 extra = 'CHEBI:31197', 'CHEBI:32715', 'CHEBI:32726', 'CHEBI:32734', 'CHEBI:58416'
 90: 7 extra = 'CHEBI:32713', 'CHEBI:32715', 'CHEBI:32724', 'CHEBI:32726', 'CHEBI:32729' ...
 92: 2 extra = 'CHEBI:51730', 'CHEBI:51736'
 98: 8 extra = 'CHEBI:27581', 'CHEBI:32713', 'CHEBI:32715', 'CHEBI:32724', 'CHEBI:32726' ...
107: 2 extra = 'CHEBI:23066', 'CHEBI:51736'
112: 33 extra = 'CHEBI:15342', 'CHEBI:52781', 'CHEBI:57255', 'CHEBI:33826', 'CHEBI:33831' ...
113: 1 extra = 'CHEBI:51736'
114: 5 extra = 'CHEBI:23066', 'CHEBI:51736', 'CHEBI:55429', 'CHEBI:55504', 'CHEBI:55506'
115: 2 extra = 'CHEBI:51736', 'CHEBI:55504'
117: 1 extra = 'CHEBI:51736'
118: 6 extra = 'CHEBI:38010', 'CHEBI:51595', 'CHEBI:52729', 'CHEBI:52748', 'CHEBI:52749' ...
124: 12 extra = 'CHEBI:18023', 'CHEBI:35895', 'CHEBI:36303', 'CHEBI:37372', 'CHEBI:52583' ...
127: 1 extra = 'CHEBI:51736'
128: 1 extra = 'CHEBI:51736'
130: 18 extra = 'CHEBI:21143', 'CHEBI:29229', 'CHEBI:32715', 'CHEBI:32726', 'CHEBI:32734' ...
134: 86 extra = 'CHEBI:15852', 'CHEBI:15982', 'CHEBI:16048', 'CHEBI:16304', 'CHEBI:16379' ...
137: 1 extra = 'CHEBI:51736'
140: 3 extra = 'CHEBI:51730', 'CHEBI:51736', 'CHEBI:55429'
143: 63 extra = 'CHEBI:15342', 'CHEBI:15955', 'CHEBI:16379', 'CHEBI:16848', 'CHEBI:17679' ...
146: 1 extra = 'CHEBI:51736'
148: 5 extra = 'CHEBI:23066', 'CHEBI:51730', 'CHEBI:51736', 'CHEBI:55429', 'CHEBI:55504'
150: 2 extra = 'CHEBI:29275', 'CHEBI:58416'
152: 1 extra = 'CHEBI:51736'
156: 17 extra = 'CHEBI:52781', 'CHEBI:51222', 'CHEBI:51224', 'CHEBI:51225', 'CHEBI:51899' ...
157: 12 extra = 'CHEBI:16531', 'CHEBI:17675', 'CHEBI:50375', 'CHEBI:2311', 'CHEBI:11573' ...
159: 7 extra = 'CHEBI:23066', 'CHEBI:36771', 'CHEBI:48079', 'CHEBI:51730', 'CHEBI:51736' ...
161: 21 extra = 'CHEBI:57255', 'CHEBI:30858', 'CHEBI:33134', 'CHEBI:33668', 'CHEBI:33826' ...

Extra in Open Babel: 838
 25: 58 extra = 'CHEBI:15430', 'CHEBI:15436', 'CHEBI:18023', 'CHEBI:27484', 'CHEBI:27609' ...
 42: 7 extra = 'CHEBI:29779', 'CHEBI:30163', 'CHEBI:32060', 'CHEBI:35836', 'CHEBI:36143' ...
 44: 62 extra = 'CHEBI:15430', 'CHEBI:15436', 'CHEBI:18023', 'CHEBI:27484', 'CHEBI:27609' ...
 48: 4 extra = 'CHEBI:29344', 'CHEBI:29940', 'CHEBI:30147', 'CHEBI:30148'
 49: 39 extra = 'CHEBI:15430', 'CHEBI:15436', 'CHEBI:18023', 'CHEBI:27484', 'CHEBI:27609' ...
 52: 5 extra = 'CHEBI:9016', 'CHEBI:22634', 'CHEBI:32060', 'CHEBI:50953', 'CHEBI:50958'
 53: 9 extra = 'CHEBI:22634', 'CHEBI:30003', 'CHEBI:30004', 'CHEBI:30598', 'CHEBI:32743' ...
 62: 11 extra = 'CHEBI:15724', 'CHEBI:17735', 'CHEBI:3611', 'CHEBI:3612', 'CHEBI:29811' ...
 64: 34 extra = 'CHEBI:15852', 'CHEBI:15982', 'CHEBI:16304', 'CHEBI:16379', 'CHEBI:17439' ...
 67: 24 extra = 'CHEBI:9016', 'CHEBI:22634', 'CHEBI:29278', 'CHEBI:30101', 'CHEBI:30104' ...
 68: 37 extra = 'CHEBI:16144', 'CHEBI:44951', 'CHEBI:9016', 'CHEBI:22634', 'CHEBI:29278' ...
 75: 62 extra = 'CHEBI:15430', 'CHEBI:15436', 'CHEBI:18023', 'CHEBI:27484', 'CHEBI:27609' ...
 77: 67 extra = 'CHEBI:15430', 'CHEBI:15436', 'CHEBI:18023', 'CHEBI:27484', 'CHEBI:27609' ...
 81: 9 extra = 'CHEBI:22634', 'CHEBI:30003', 'CHEBI:30004', 'CHEBI:32060', 'CHEBI:32743' ...
 83: 17 extra = 'CHEBI:161680', 'CHEBI:9016', 'CHEBI:29278', 'CHEBI:30104', 'CHEBI:30227' ...
 89: 1 extra = 'CHEBI:37179'
 90: 2 extra = 'CHEBI:32060', 'CHEBI:37178'
 98: 58 extra = 'CHEBI:15436', 'CHEBI:18023', 'CHEBI:27484', 'CHEBI:27609', 'CHEBI:28421' ...
103: 2 extra = 'CHEBI:36143', 'CHEBI:37179'
105: 1 extra = 'CHEBI:55504'
112: 17 extra = 'CHEBI:40071', 'CHEBI:595389', 'CHEBI:8247', 'CHEBI:32014', 'CHEBI:36603' ...
118: 76 extra = 'CHEBI:15430', 'CHEBI:15436', 'CHEBI:15724', 'CHEBI:17735', 'CHEBI:18023' ...
124: 88 extra = 'CHEBI:15852', 'CHEBI:15955', 'CHEBI:15982', 'CHEBI:16304', 'CHEBI:16379' ...
130: 32 extra = 'CHEBI:29278', 'CHEBI:29779', 'CHEBI:30003', 'CHEBI:30004', 'CHEBI:30101' ...
134: 4 extra = 'CHEBI:34921', 'CHEBI:51597', 'CHEBI:53328', 'CHEBI:60153'
135: 27 extra = 'CHEBI:49706', 'CHEBI:29221', 'CHEBI:29270', 'CHEBI:29271', 'CHEBI:29420' ...
143: 10 extra = 'CHEBI:5835', 'CHEBI:33089', 'CHEBI:36618', 'CHEBI:36645', 'CHEBI:51962' ...
150: 3 extra = 'CHEBI:29240', 'CHEBI:29278', 'CHEBI:30227'
156: 2 extra = 'CHEBI:51204', 'CHEBI:53327'
157: 62 extra = 'CHEBI:15430', 'CHEBI:15436', 'CHEBI:18023', 'CHEBI:27484', 'CHEBI:27609' ...
159: 4 extra = 'CHEBI:29360', 'CHEBI:29434', 'CHEBI:29437', 'CHEBI:29440'
161: 4 extra = 'CHEBI:36603', 'CHEBI:36618', 'CHEBI:36645', 'CHEBI:60153'
As far as I'm aware, no one has done an extensive analysis of the differences between the available MACCS implementations. Are you up for the challenge?



MACCS key 44 #

The MACCS 166 keys are one of the mainstay fingerprints of cheminformatics, especially regarding molecular similarity. It's rather odd, really, since they were developed for substructure screening and not similarity. I suppose that Jaccard would agree that any relatively diverse feature vector can likely be used to measure similarity, whether it be Alpine biomes or chemical structures.

Here's a bit of dirty laundry that you'll not read in the literature. There are a lot of MACCS implementations, and they don't agree fully with each other. The differences are likely small, but as far as I can tell, no one has really investigated if it's a problem, or noted that it might be a problem.

I'll structure the explanation around key #44. What is definition for key 44?

To start, there is no publication describing the MACCS 166 public keys. All of the citations for it either say a variation of "MDL did it" or cite the 2002 paper which reoptimized the keys for similarity ([PDF]). Thing is, just about everyone uses the "unoptimized" definitions, so this is, technically, the wrong citation. (Why do people use it? Tradition, perhaps, or because it feels better to have a real citation rather than a nebulous one.)

Instead, the definitions appear to have come from ISIS/Base, and have been passed around from person to person through informal means. I haven't used the MDL software and can't verify the source myself. There's a relatively recent whitepaper from Accelrys titled "The Keys to Understanding MDL Keyset Technology" which says they are defined in the file "eksfil.dat". A Google search finds 8 results for "eksfil.dat". All are tied to that white paper. The PDF has creation and modification dates of 31 August 2011, and Archive.org first saw that URL on 11 October 2011.

It's easy to see that the reoptimization fingerprint is not the same as the 166 keys that everyone uses. You'll find that many places say that key 44 is defined as "OTHER". Table 5 of the reoptimization paper has an entry for '"other" atom type', but there's nothing which assigns it to key 44. You can't even try to infer some sort of implicit ordering because the previous entry in table 5 is "isotope", which is key 1 in the MACCS 166 keys, and two entries later is "halogen", which is key 134.

If you cite Durant, Leland, Henry, and Nourse (2002) as your reference to the MACCS 166 bit public keys then you are doing your readers a disservice. Those define different fingerprints than you used. Just go ahead and cite "MACCS keys. MDL Information Systems" and if the reviewer complains that it's a bad citation, point them to this essay and ask them for the correct one. Then tell me what they said. If Accelrys complains then they need to suggest the correct citation and put it in their white paper. Even better would be a formal publication and a validation suite. (I can dream, can't I?)

In practice, many people use the MACCS keys as interpreted by the implementers of some piece of software. I used "interepreted by" because "implemented by" is too strong. There are ambiguities in the definition, mistakes in the implementations, and differences in chemical interpretation, compounded by a lack of any sort of comprehensive validation suite.

Let's take key 44, "OTHER". Remember how the definition comes from an internal MDL data file? What does "OTHER" mean? RDKit defines it as '?' in MACCSkeys.py to indicate that it has no definition for that key. That line has a commit date of 2006-05-06. RDKit's lack of a definition is notable because Open Babel, CDK, a user contributed implementation for ChemAxon and many others reuse the RDKit SMARTS definitions. All of them omit key 44.

Others have implemented key 44. TJ O'Donnell, in "Design and Use of Relational Databases in Chemistry" (2009) defines it as the SMARTS [!#6!#7!#8!#15!#16!#9!#17!#35]. MayaChemTools defines it in code as an atom with element number in "1|6|7|8|9|14|15|16|17|35|53". (See _IsOtherAtom.)

These are the ones where I have access to the source and could investigate without much effort.

Both the whitepaper and the reoptimization paper define what "other" means, and the whitepaper does so specifically in the context of the MACCS 166 keys. It says:

"Other" atoms include any atoms other than H, C, N, O, Si, P, S, F, Cl, Br, and I, and is abbreviated "Z".
This appears definite and final. Going back to the three different implementation geneologies, RDKit and its many spinoffs don't have a definition so by definition isn't correct. O'Donnell's is close, but the SMARTS pattern omits hydrogen, silicon, and iodine. And MayaChemTools gets it exactly correct.

Good job, Manish Sud!

Are these MACCS variations really a problem?

No. Not really. Well, maybe. It depends on who you are.

When used for similarity, a dead bit just makes things more similar because there are fewer ways to distinguish between molecules. In this case too, key 44 is rare. Only a handful of molecules contain "other" atoms (like the gold in auranofin) so when characterizing a database it's likely fine.

You don't need to trust my own gut feeling. You can read the RDKit documentation and see "The MACCS keys were critically evaluated and compared to other MACCS implementations in Q3 2008. In cases where the public keys are fully defined, things looked pretty good."

Okay, so you're hesistent about the keys which aren't "fully defined"? No need to despair. Roger Sayle ported the RDKit patterns (and without key 44) over to ChemAxon, and reported:

This work is heavily based upon the previous implementation by Miklos Vargyas, and the SMARTS definitions developed and refined by Greg Landrum and Andrew Dalke. This implementation achieves ~65% on the standard Briem and Lessel benchmark, i.e. almost identical to the expected value for MACCS keys reported in the literature by MDL and others.

(NB: All I did was proofread the RDKit SMARTS and find a few places that needed fixing.)

The MACCS 166 keys are a blunt tool, designed for substructure search and repurposed for similarity more because it was already present and easy to generate. 2D similarity search is another blunt tool. That's not to say they are horrible or worthless! A rock is a blunt tool for making an ax, but we used stone axes quite effectively throughout the Neolith.

Just don't treat the MACCS 166 keys as a good luck charm, or as some sort of arcane relic passed down by the ancients. There are limitations in the definition and limitations in the implementation. Different tools will give different answers, and if you don't understand your tools they may turn on you.

And when you write a paper, be honest to your readers. If you are using the RDKit implementation of the MACCS keys or derived version in another toolkit (and assuming they haven't been changed since I wrote this essay), point out that you are only using 164 of those 166 bits.

Homework assignment

For a warmup exerecise, what is the other unimplemented bit in the RDKit MACCS definition?

For your homework assignment, use two different programs to compute the MACCS keys for a large data set and see 1) how many bits are different? (eg, sum of the Manhattan distance between the fingerprints for each record, or come up with a better measure), 2) how many times does the nearest neighbor change?, and 3) (bonus points) characterize how often those differences are because of differences in how to interpret a key and how often it's because of different toolkit aromaticity/chemistry perception methods.

I expect a paper in a journal by the end of next year. :).

(Then again, for all I know this is one of those negative results papers that's so hard to publish. "9 different MACCS key implementations produce identical MACCS keys!" doesn't sound exciting, does it?)



chemfp's format API #

This is part of a series of essays describing the reasons behind chemfp's new APIs. This one, about the new format API, is a lot shorter than the previous one on parse_molecule() and parse_id_and_molecule().

It's sometimes useful to know which cheminformatics formats are available, if only to display a help message or pulldown menu. The get_formats() toolkit function returns a list of available formats, as 'Format' instances.

>>> from chemfp import rdkit_toolkit as T
>>> T.get_formats()
[Format('rdkit/canstring'), Format('rdkit/inchikey'),
Format('rdkit/usmstring'), Format('rdkit/smistring'),
Format('rdkit/molfile'), Format('rdkit/usm'),
Format('rdkit/inchikeystring'), Format('rdkit/sdf'),
Format('rdkit/can'), Format('rdkit/smi'), Format('rdkit/inchi'),
Format('rdkit/rdbinmol'), Format('rdkit/inchistring')]
(The next version of chemfp will likely support RDKit's relatively new PDB reader.)

You can ask a format for its name, or see if it is an input format or output format by checking respectively "is_input" and "is_output". If you just want the list of input formats or output formats, use get_input_formats() or get_output_formats().

Here's an example to show which output formats are not also input formats:

>>> [format.name for format in T.get_output_formats() if not format.is_input_format]
['inchikey', 'inchikeystring']

You may recall that some formats are record-based and others are, for lack of a better word, "string-based". The latter include "smistring", "inchistring", and "inchikeystring". These are not records in their own right, so can't be read or written to a file.

I really couldn't come up with a good predicate which described those formats. This closest was "is_a_record". I ended up with "supports_io". I'm not happy with the name. If true, the format can be used in file I/O.

The RDKit input formats which do not support I/O are the expected ones ... and rdbinmol.

>>> [format.name for format in T.get_input_formats() if not format.supports_io]
['canstring', 'usmstring', 'smistring', 'molfile', 'rdbinmol', 'inchistring']
(The "rdbinmol" is an experimental format. It's the byte string from calling an RDKit molecule's "ToBinary()" method, which is also the basis for its pickle support.)

get_format() and compression

You can get a specific format by name using get_format(). This can also be used to specify a compressed format:

>>> T.get_format("sdf")
Format('rdkit/sdf')
>>> T.get_format("smi.gz")
Format('rdkit/smi.gz')
>>> format = T.get_format("smi.gz")
>>> format
Format('rdkit/smi.gz')
>>> format.name
'smi'
>>> format.compression
'gz'

Default reader and writer arguments

Toolkit- and format-specific arguments were a difficult challenge. I want chemfp to support multiple toolkits, because I know people work with fingerprints from multiple toolkits. Each of the toolkit has its own way to parse and generate records. I needed some way to have a common API but with a mechanism to control the underlying toolkit options.

The result are reader_args, which I discussed in the previous essay, and the writer_args complement for turning a molecule into a record.

A Format instance can be toolkit specific; the "rdkit/smi.gz" is an RDKit format. (The toolkit name is available from the aptly named attribute 'toolkit_name'.) Each Format has a way to get the default reader_args and writer_args for the format:

>>> format = T.get_format("smi.gz")
>>> format
Format('rdkit/smi.gz')
>>> format.get_default_reader_args()
{'delimiter': None, 'has_header': False, 'sanitize': True}
>>> format.get_default_writer_args()
{'isomericSmiles': True, 'delimiter': None, 'kekuleSmiles': False, 'allBondsExplicit': False, 'canonical': True}
This is especially useful if you are on the interactive prompt and have forgotten the option names.

Convert text settings into arguments

The -R command-line options for the chemfp tools rdkit2fps, ob2fps, and oe2fps let users set the reader_args. If your target molecules are in a space-delimited SMILES file then you can set the 'delimiter' option to 'space':

oe2fps -R delimiter=space targets.smi.gz -o targets.fps
or ask RDKit to disable sanitization using:
rdkit2fps -R sanitize=false targets.smi.gz -o targets.fps
The -R takes string keys and values. On the other hand reader_args take a dictionary with string keys but possibly integers and booleans as values. You could write the converter yourself, but that gets old very quickly. Instead, I included it the format's get_reader_args_from_text_settings(). (The *2fps programs don't generate structure output, but if they did the equivalent command-like flag would be -W, and the equivalent format method is get_writer_args_from_text_settings().)

Yes, I agree that get_..._settings() is a very long name. I couldn't think of a better one. I decided that "text settings" are the reader_args and writer_args expressed as a dictionary with string names and string values.

I'll use that long named function to convert some text settings into proper reader_args:

>>> format.get_reader_args_from_text_settings({
...    "delimiter": "tab",
...    "sanitize": "false",
... })
{'delimiter': 'tab', 'sanitize': False}
You can see that the text "false" was converted into the Python False value.

Namespaces

Names like "delimiter" and "sanitize" are 'unqualified' and apply for every toolkit and every format which accept them. This makes sense for "delimiter" because it's pointless to have OEChem parse a SMILES file using a different delimiter style than RDKit. It's acceptable for "sanitize" because only RDKit knows what it means, and the other toolkits will ignore unknown names. For many cases then you could simply do something like:

reader_args = {
  "delimiter": "tab",       # for SMILES files
  "strictParsing": False,   # for RDKit SDF
  "perceive_stereo": True,  # for Open Babel SDF
  "aromaticity": "daylight, # for all OEChem readers
}

At the moment the toolkits all have different names for option names for the same format, so there's no conflict there. But toolkits do use the same name for options on different formats, and there can be a good reason for why the value for a SMILES output is different than a value for an SDF record output.

The best example is OEChem, which uses a "flavor" flag to specify the input and output options for all formats. (For chemfp I decided to split OEChem's flavor into 'flavor' and 'aromaticity' reader and writer arguments. I leave that discussion for elsewhere.) I'll start by making an OEGraphMol.

from chemfp import openeye_toolkit

phenol = "c1ccccc1[16OH]"
oemol = openeye_toolkit.parse_molecule(phenol, "smistring")
Even though "smistring" output by default generates the canonical isomorphic SMILES for the record, I can ask it to generate a different output flavor. For convience, the flavor value can be an integer, which is treated as the flavor bitmask, or it can be a string of "|" or "," separated bitmask names. Usually the bitmask names are or'ed together, but a leading "-" means to unset the corresponding bits for that flag.
>>> openeye_toolkit.create_string(oemol, "smistring")
'c1ccc(cc1)[16OH]'
>>> openeye_toolkit.create_string(oemol, "smistring",
...      writer_args={"flavor": "Default"})
'c1ccc(cc1)[16OH]'
>>> openeye_toolkit.create_string(oemol, "smistring",
...      writer_args={"flavor": "Default,-Isotopes"})
'c1ccc(cc1)O'
>>> openeye_toolkit.create_string(oemol, "smistring",
...      writer_args={"flavor": "Canonical|Kekule|Isotopes"})
'C1=CC=C(C=C1)[16OH]'
Here I'll ask for the SDF record output in V3000 format. (In the future I expect to have a special "sdf3" or "sdf3000" format, to make it easier to specify V3000 output across all toolkits.)
>>> print(openeye_toolkit.create_string(oemol, "sdf",
...        writer_args={"flavor": "Default|MV30"}))

  -OEChem-09261411132D

  0  0  0     0  0            999 V3000
M  V30 BEGIN CTAB
M  V30 COUNTS 7 7 0 0 0
M  V30 BEGIN ATOM
M  V30 1 C 0 0 0 0
M  V30 2 C 0 0 0 0
M  V30 3 C 0 0 0 0
M  V30 4 C 0 0 0 0
M  V30 5 C 0 0 0 0
M  V30 6 C 0 0 0 0
M  V30 7 O 0 0 0 0 MASS=16
M  V30 END ATOM
M  V30 BEGIN BOND
M  V30 1 2 1 6
M  V30 2 1 1 2
M  V30 3 2 2 3
M  V30 4 1 3 4
M  V30 5 2 4 5
M  V30 6 1 5 6
M  V30 7 1 6 7
M  V30 END BOND
M  V30 END CTAB
M  END
$$$$

What's the problem?

One problem comes when I want to configure chemfp so that if the output is SMILES then use one flavor, and if the output is SDF then use another flavor. You could construct a table of format-specific writer_args, like this:

writer_args_by_format = {
  "smi": {"flavor": "Canonical|Kekule|Isotopes", "aromaticity": "openeye"},
  "sdf": {"flavor": "Default|MV30", "aromaticity": "openeye"},
    ...
}

record = T.create_string(mol, format,
           writer_args = writer_args_by_format[format])
but not only is that tedious, it doesn't handle toolkit-specific options. Nor is there an easy way to turn the text settings into this data structure.

Qualified names

Instead, the reader_args and writer_args accept "qualified" names, which can be format-specific like "sdf.flavor", toolkit-specific like "openeye.*.aromaticity", or both, like "openeye.sdf.aromaticity".

A cleaner way to write the previous example is:

writer_args = {
  "smi.flavor": "Canonical|Kekule|Isotopes",
  "sdf.flavor": "Default|MV30",
  "aromaticity": "openeye",   # Use the openeye aromaticity model for all formats
    ...
}

record = T.create_string(mol, format, writer_args = writer_args)
or if you want to be toolkit-specific, use "openeye.smi.flavor", "openeye.sdf.flavor" and "openeye.*.aromaticity", etc.

Precendence

You probably noticed there are many ways to specify the same setting, as in the following:

reader_args = {
  "delimiter": "tab",
  "openeye.*.delimiter": "whitespace",
  "smi.delimiter": "space",
}
The chemfp precedence goes from most-qualified name to least-qualified, so for this case the search order is:
openeye.smi.delimiter
openeye.*.delimiter
smi.delimiter
delimiter

How to convert qualified names into unqualified names

The Format object's get_unqualified_reader_args() converts a complicated reader_args dictionary which may contain qualified names into a simpler reader_args dictionary with only unqualified names and only the names appropriate for the format. It's used internally to simplify the search for the right name, and it's part of the public API so you can help debug if your qualifiers are working correctly. I'll give an example of debugging in a moment.

Here's an example which shows that the previous 'reader_args' example, with several delimiter specification, is resolved to using the 'whitespace' delimiter style.

>>> from chemfp import openeye_toolkit
>>> 
>>> reader_args = {
...   "delimiter": "tab",
...   "openeye.*.delimiter": "whitespace",
...   "smi.delimiter": "space",
... }
>>> 
>>> format = openeye_toolkit.get_format("smi")
>>> format.get_unqualified_reader_args(reader_args)
{'delimiter': 'whitespace', 'flavor': None, 'aromaticity': None}
You can see that it also fills in the default values for unspecified arguments. Note that this function does not validate values. It's only concerned with resolving the names.

The equivalent method for writer_args is get_unqualified_writer_args() - I try to be predictable in my APIs.

This function is useful for debugging because it helps you spot typos. Readers ignore unknown arguments, so if you type "opneye" instead of "openeye" then it just assumes that you were talking about some other toolkit.

If you can't figure out why your reader_args or writer_args aren't being accepted, pass them through the 'unqualified' method and see what it gives:

>>> format.get_unqualified_reader_args({"opneye.*.aromaticity": "daylight"})
{'delimiter': None, 'flavor': None, 'aromaticity': None}

Qualified names and text settings

The Format object also supports qualifiers in the reader and writer text_settings and applies the same search order to give the unqualified reader_args.

>>> format.get_reader_args_from_text_settings({
...    "sanitize": "true",
...    "rdkit.*.sanitize": "false",
... })
{'sanitize': False}

Errors in the text settings

The get_reader_args_from_text_settings() and get_writer_args_from_text_settings() will validate the values as much as it can, and raise a ValueError with a helpful message if that fails.

>>> from chemfp import openeye_toolkit
>>> sdf_format = openeye_toolkit.get_format("sdf")
>>> sdf_format.get_writer_args_from_text_settings({
...   "flavor": "bland",
... })
Traceback (most recent call last):
  File "", line 2, in 
  File "chemfp/base_toolkit.py", line 407, in get_writer_args_from_text_settings
    return self._get_args_from_text_settings(writer_settings, self._format_config.output)
  File "chemfp/base_toolkit.py", line 351, in _get_args_from_text_settings
    % (self.toolkit_name, name, value, err))
ValueError: Unable to parse openeye setting flavor ('bland'): OEChem sdf format does not support the 'bland' flavor option. Available flavors are: CurrentParity, MCHG, MDLParity, MISO, MRGP, MV30, NoParity

File format detection based on extension

All of the above assumes you know the file format. Sometimes you only know the filename, and want to determine (or "guess") the format based on its extension. The file "abc.smi" is a SMILES file, the file "xyz.sdf" is an SD file, and "xyz.sdf.gz" is a gzip-compressed SD file.

The toolkit function get_input_format_from_source() will try to determine the format for an input file, given the source filename:

>>> from chemfp import openbabel_toolkit as T
>>> T.get_input_format_from_source("example.smi")
Format('openbabel/smi')
>>> T.get_input_format_from_source("example.sdf.gz")
Format('openbabel/sdf.gz')
>>> format = T.get_input_format_from_source("example.sdf.gz")
>>> format.get_default_reader_args()
{'implementation': None, 'perceive_0d_stereo': False, 'perceive_stereo': False, 'options': None}
The equivalent for output files is get_output_format_from_destination().

The main difference between the two is get_input_format_from_source() will raise an exception if the format is known but not supported as an input format, and get_input_format_from_destination() will raise an exception if the format is known but not supported as an output format.

>>> T.get_input_format_from_source("example.inchikey")
Traceback (most recent call last):
  File "", line 1, in 
  File "chemfp/openbabel_toolkit.py", line 109, in get_input_format_from_source
    return _format_registry.get_input_format_from_source(source, format)
  File "chemfp/base_toolkit.py", line 606, in get_input_format_from_source
    format_config = self.get_input_format_config(register_name)
  File "chemfp/base_toolkit.py", line 530, in get_input_format_config
    % (self.external_name, register_name))
ValueError: Open Babel does not support 'inchikey' as an input format

The format detection functions actually take two arguments, where the second is the format name.

>>> T.get_input_format_from_source("example.inchikey", "smi.gz")
Format('openbabel/smi.gz')
This is meant to simplify the logic that would otherwise lead to code like:
if format is not None:
    format = T.get_input_format(format)
else:
    format = T.get_input_format_from_source(source)

By the way, the source and destination can be None. This tells chemfp to read from stdin or write to stdout. Since stdin and stdout don't have a file extension, what format do they have? My cheminformatics roots started with Daylight, so I decided that the default format is "smi".



chemfp's parse_molecule() #

I used several use cases to guide chemfp-1.2 development. One, under the category of "web service friendly", was to develop a web service which takes a structure record and optional format name as input for a k=3 nearest neighbor search of a pre-loaded target fingerprint file. That doesn't sound so hard, except I'll let the fingerprint file determine the fingerprint type based on its 'type' header, which might specify Open Babel, RDKit, or OpenEye's toolkits.

Those toolkit all have different APIs, but I would prefer not to write different code for each case. Chemfp-1.1 can be hacked to handle most of what's needed, because read_structure_fingerprints() will read a structure file and computes fingerprints for each record. The hack part is save the XML-RPC query to a file, since read_structure_fingerprints() only works from a file, not a string.

That isn't a full solution. Chemfp-1.1 doesn't support any sort of structure output, so there would need to be toolkit-specific code to set the tag data and convert the molecule to a record.

For chemfp I decided on the classic approach and make my own toolkit-independent API, with a back-end implementation for each of the supported toolkits. I think it's a great API, and I've been using it a lot for my other internal projects. My hope is that some of the ideas I developed go into other toolkits, or at least influence the design of the next generation of toolkits.

To see a more concrete example, here's that use case implemented as an XML-RPC service using the new chemfp-1.2 API.

from SimpleXMLRPCServer import SimpleXMLRPCServer

import chemfp

# Load the target fingerprints and figure out the fingerprint type and toolkit
targets = chemfp.load_fingerprints("databases/chembl_18_FP2.fps")
fptype = targets.get_fingerprint_type()  # uses the 'type' field from the FPS header
toolkit = fptype.toolkit

print("Loaded %d fingerprints of type %r" % (len(targets), fptype.get_type()))

def search(record, format="sdf"):
    # Parse the molecule and report any failures
    try:
        mol = toolkit.parse_molecule(record, format)
    except ValueError as err:
        return {"status": "error", "msg": str(err)}

    # Compute its fingerprint, search the targets, and get the nearest 3 neighbors
    fp = fptype.compute_fingerprint(mol)
    nearest = targets.knearest_tanimoto_search_fp(fp, k=3, threshold=0.0)

    # Add a tag value like "CHEMBL14060 1.00 CHEMBL8020 0.92 CHEMBL7970 0.92"
    (id1, score1), (id2, score2), (id3, score3) = nearest.get_ids_and_scores()
    toolkit.add_tag(mol, "NEAREST", "%s %.2f %s %.2f %s %.2f" %
                    (id1, score1, id2, score2, id3, score3))

    return {"status": "ok",
            "record": toolkit.create_string(mol, "sdf")}

server = SimpleXMLRPCServer(("localhost", 8000))
server.register_introspection_functions()
server.register_function(search)

if __name__ == "__main__":
    server.serve_forever()

I think this is a clean API, and a bit easier to understand and use than the native toolkit APIs. It's very similar to cinfony, though I think at this level cinfony is bit easier to understand because it wraps its own Molecule object around the native toolkit molecules, while I leave them as native molecule objects. I have to use helper functions where cinfony can use methods. I did this because I don't want the performance overhead of wrapping and unwrapping for the normal use case of converting a structure file to fingerprints.

I also have more stand-alone objects than cinfony, like my fingerprint type object for fingerprint generation, where cinfony uses a method of the molecule.

parse_molecule()

That example, while illustrative of the new API, isn't significantly better than existing systems. For that, I need to delve into the details, starting with parse_molecule() .

Chemfp has three "toolkit" APIs, one for each of the supported toolkits. The usual way to get them is through one of the following:

from chemfp import rdkit_toolkit
from chemfp import openeye_toolkit
from chemfp import openbabel_toolkit
Most of the examples are toolkit independent, so I'll use "T" rather than stress a toolkit. I'll use RDKit as the back-end for these examples:
>>> from chemfp import rdkit_toolkit as T

The type signature is:

parse_molecule(content, format, id_tag=None, reader_args=None, errors='strict')
In the simplest case I'll parse a SMILES record to get an RDKit molecule object:
>>> mol = T.parse_molecule("c1ccccc1[16OH] phenol", "smi")
>>> mol
<rdkit.Chem.rdchem.Mol object at 0x104521360>
If you use the openeye_toolkit you would get an OEGraphMol, and openbabel_toolkit returns an OBMol.

You might have noticed that the first two parameters are in reverse order from cinfony's readstring(), which take the format in the first position instead of the second. This is because the parse_molecules() parameters parallel chemfp's read_molecules() parameters:

read_molecules(source=None, format=None, id_tag=None, reader_args=None, errors='strict', location=None)
The format there is second because the default of None means to auto-detect the format based on the source. (If the source is a filename then detection is based on the extension. I'll go into more details in another essay.)

In comparison, cinfony readfile() takes (format, filename), and doesn't auto-detect the format. (A future version could do that, but it would require a format like "auto" or None, which I thought was less elegant than being able to use the default format.) I wanted read_molecules() to support auto-detection, which meant the format had to be in the second position, which is why parse_molecule() takes the format in the second position.

parse_molecule() always returns a new molecule

This function promises to return a new molecule object each time. Compare to OEChem or Open Babel, which parse into an existing molecule object.

Those two toolkits reuse the molecule for performance reasons; clearing is faster than deallocating and reallocating a molecule object. My experience is that in practice molecule reuse is error-prone because it's easy to forget to clear the molecule, or save multiple results to a list and be surprised that they all end up with the same molecule.

I agree that performance is important. I chose a different route to get there. I noticed that even if the molecule were reused, there would still be overhead in calling parse_molecule() because it has to validate or at least interpret the function parameters. These parameters rarely change, that validation is unneeded overhead.

What I ended up doing was making a new function, make_id_and_molecule_parser(), which take the same parameters as parse_molecule(), except leaving out 'content'. It returns a specialized parser function which only takes one parameter, the content, and returns the (id, molecule) pair. (In a future release I may also have a make_molecule_parser() which only returns the molecule, but that's too much work for now.)

This new function is designed for performance, and is free to reuse the molecule object. Functions which return functions are relatively unusual, and I think only more advanced programmers will use it, which makes it less likely that people will experience the negative problems.

It's still possible, so in the future I may add an option to make_molecule_parser() to require a new molecule each time.

Format names

The second parameter is the format name or Format object. For now I'll only consider format names.

mol = T.parse_molecule("c1ccccc1[16OH] phenol", "smi")

This was a surprisingly tricky to get right, because a SMILES record and a SMILES string are different things, and because toolkits differ in what a SMILES record means. When someone says "takes a SMILES as input", does that mean to take a record or a string?

To clarify, a SMILES file contains SMILES records. A SMILES record is a SMILES string followed by a whitespace, followed by a title/id. Some toolkits take Daylight's lead and say that the title goes to the end of the line. Others interpret the file as a whitespace or space or tab separated file; or even an arbitrary character delimited file. Some also support a header in the first line. Finally, SMILES records are newline terminated, although that's not always required for the last record. (I'll come back to this in a bit.)

I decided to use different format names for these two cases. The "smi" format refers to a SMILES record, and the "smistring" format refers to just the SMILES string.

Björn Grüning pointed out that a similar issue exists with InChI. There's the InChI string, but Open Babel and some other tools also support an "InChI file" with the InChI string as the first term, followed by a whitespace, followed by some title or identifier, as well as an "InChIKey file", using the InChIKey instead of the InChI.

Thus, chemfp has "inchistring" and "inchi", and "inchikeystring" and "inchikey", in parallel with the "smistring"/"smi" distinction.

The other formats, like "sdf" and "pdb", are only record-oriented and don't have the same dual meaning as SMILES and InChI.

Compressed formats

A longer term chemfp idea is to extend the binary format to store record data. My current experiments use SQLite for the records and FPB files for the fingerprints.

Uncompressed SDF records take up a lot of space, and compress well using standard zlib compression. The file-based I/O function support format names like "smi.gz". I extended the idea to support zlib-compressed records, like "sdf.zlib".

Output format names: different types of SMILES

The SMILES output format names are also tricky. This needs a bit of history to understand fully. Daylight toolkit introduced SMILES strings, but the original syntax did not support isotopes or chirality. These were added latter, as so-called "isomeric SMILES". Daylight and nearly every toolkit since maintained that separation, where a "SMILES" output string (either canonical or non-canonical) was not isomeric, and something different needed to be done to get an isomeric SMILES.

This was a usability mistake. Most people expect that when the input is 238U then the output SMILES will be "[238U]". I know, because I've probably made that mistake a dozen times in my own code. On the plus side, it's usually very easy to detect and fix. On the minus side, I've only rarely needed canonical non-isomeric SMILES, so the default ends up encouraging mistakes.

OEChem 2.0 decided to break the tradition and say that "smi" refers to canonical isomeric SMILES, which is what most expect but didn't get, that "can" refers to canonical non-isomeric SMILES (this is unchanged), and that "usm" is the new term for non-canonical, non-isomeric SMILES.

It's a brillantly simple solution to a usability problem I hadn't really noticed before they solved it. This made so much sense to me that I changed chemfp's new I/O API to use those format names and meanings. I hope others also follow their lead.

That's why "smi" as a chemfp output format means canonical isomeric SMILES, "can" means canonical non-isomeric, and "usm" means non-canonical non-isomeric. The corresponding string formats are "smistring", "canstring", and "usmstring". Here they are in action:

>>> mol = T.parse_molecule("c1ccccc1[16OH] phenol", "smi")
>>> T.create_string(mol, "smi")
'[16OH]c1ccccc1 phenol\n'
>>> T.create_string(mol, "can")
'Oc1ccccc1 phenol\n'
>>> T.create_string(mol, "usm")
'c1ccccc1O phenol\n'
>>> T.create_string(mol, "smistring")
'[16OH]c1ccccc1'

id_tag, parse_id_and_molecule(), and titles

The parse_molecule() function only really uses the "id_tag" parameter to improve error reporting - the error message will use the 'id_tag's value rather than the default title for a record.

The id_tag parameter exists because developers in Hinxton decided to put the record identifier in a tag of an SD file instead of placing it in the record title like the rest of the world. As a result, many cheminformatics tools stumble a bit with the ChEBI and ChEMBL datasets, which either have a blank title or the occasional non-blank but useless title like:

"tetrahydromonapterin"
(1E)-1-methylprop-1-ene-1,2,3-tricarboxylate
S([O-])([O-])(=O)=O.[Ni+2].O.O.O.O.O.O.O
Structure #1
Untitled Document-1
XTP
I've decided that this is a bug, but despite years of complaints, they haven't changed, so chemfp has to work around it.

Here's an SD record from ChEBI that you can copy and paste:


  Marvin  02030822342D          

  2  1  0  0  0  0            999 V2000
   -0.4125    0.0000    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    0.4125    0.0000    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  2  0  0  0  0
M  STY  1   1 SUP
M  SAL   1  2   1   2
M  SMT   1 O2
M  END
> <ChEBI ID>
CHEBI:15379

> <ChEBI Name>
dioxygen

> <Star>
3

$$$$
I'll assign that to the variable 'sdf_record', parse it, and show that the title is empty, though the identifier is available as the "ChEBI ID" tag.
>>> from chemfp import rdkit_toolkit as T
>>> sdf_record = """\
...   
...   Marvin  02030822342D          
... 
...   2  1  0  0  0  0            999 V2000
...    -0.4125    0.0000    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
              ... many lines deleted ...
... $$$$
... """
>>> mol= T.parse_molecule(sdf_record, "sdf")
[00:49:59]  S group SUP ignored on line 8
>>> mol
<rdkit.Chem.rdchem.Mol object at 0x1043de4b0>
>>> T.get_id(mol)
''
>>> T.get_tag(mol, "ChEBI ID")
'CHEBI:15379'
(The "get_id" and "get_tag" methods are part of the new molecule API. I'll discuss them in a future essay.)

I found that I often want to get both the identifier and the molecule. Rather than use a combination of parse_molecule() and get_id()/get_tag() to get that information, I created the parse_id_and_molecule() function, which returns the 2-element tuple of (id, molecule), whose id_tag parameter specifies where to find the id.

> > > T.parse_id_and_molecule(sdf_record, "sdf", id_tag="ChEBI ID")
[00:50:37]  S group SUP ignored on line 8
('CHEBI:15379', <rdkit.Chem.rdchem.Mol object at 0x104346a60>)
If the id_tag is None, then it uses the appropriate default for the given format; the title for SD records. Otherwise it contains the name of the tag with the title. (If there are multiple tags with the same name then the choice of which tag to use is made arbitrarily.)

I can imagine that some people might place an identifier in a non-standard location for other formats. If that happens, then there will likely be an id_tag syntax for that format.

For example, the closest I've come up with is a SMILES variant where the id you want is in the third column. In that case the id_tag might be "#3". If the column is labeled "SMILES" then an alternate would be "@SMILES", or if you know the column name doesn't start with a '#' or '@' then simply "SMILES".

However, that's hypothetical. I haven't seen it in real life. What I have seen are CSV files, where the SMILES and id columns are arbtirary, and which may have Excel-specific delimiter and quoting rules. Chemfp doesn't currently support CSV files and that will likely be handled through reader_args.

Handling whitespace in a SMILES record

The SMILES record format is not portable across toolkits. According to Daylight, and followed by OEChem, the format is the SMILES string, a single whitespace, and the rest of the line is the title. This title might be a simple alphanumeric id, or an IUPAC name with spaces in it, or anything else.

Other toolkits treat it as a character delimited set of columns. For example, RDKit by default uses the space character as the delimiter, though you can change that to tab, comma, or other character.

This is a problem because you might want to generate fingerprints from a SMILES file containing the record:

C1C(C)=C(C=CC(C)=CC=CC(C)=CCO)C(C)(C)C1 vitamin a
In one toolkit you'll end up with the identifier "vitamin" and with another toolkit get the identifier "vitamin A".

My experience is that most people treat a SMILES file as a space or tab delimited set of columns, with a strong perference for the space character. This isn't universally true. Roger Sayle pointed out to me that the formal IUPAC name is a perfectly reasonable unique and canonical identifer, which can have a space in it. The Daylight interpretation supports IUPAC names, while the space delimited version does not.

There is no perfect automatic solution to this ambiguity. Instead, chemfp lets you specify the appropriate delimiter type using the "delimiter" reader argument. The supported delimiter type names are "whitespace", "tab", "space", and "to-eol", as well as the literal characters " " and "\t". The default delimiter is "whitespace", because most of the people I work with think of a SMILES file more like a whitespace delimited file.

Here's an example of how the default doesn't like "vitamin A", but where the "to-eol" delimiter handles it correctly:

>>> smiles = "C1C(C)=C(C=CC(C)=CC=CC(C)=CCO)C(C)(C)C1 vitamin a\n"
>>> T.parse_id_and_molecule(smiles, "smi")
('vitamin', <rdkit.Chem.rdchem.Mol object at 0x10bbf39f0>)
>>> T.parse_id_and_molecule(smiles, "smi",
...              reader_args={"delimiter": "to-eol"})
('vitamin a', <rdkit.Chem.rdchem.Mol object at 0x10bc8c2f0>)
as well as an example of the difference between the "tab" and "to-eol" delimiter types:
>>> T.parse_id_and_molecule("C\tA B\tC\t", "smi",
...              reader_args={"delimiter": "tab"})
('A B', <rdkit.Chem.rdchem.Mol object at 0x10bc8c2f0>)
>>> T.parse_id_and_molecule("C\tA B\tC\t", "smi",
...              reader_args={"delimiter": "to-eol"})
('A B\tC\t', <rdkit.Chem.rdchem.Mol object at 0x10bbf39f0>)

It was a bit of work to make the different toolkits work the same way, and my best attempt isn't perfect. For example, if you are daft and try to interpret the record "C Q\tA" as a tab-delimited set of columns, then OEChem will see this as methane with an id of "Q" while RDKit and Open Babel will say they can't parse the SMILES "C Q".

So don't do that!

reader_args

As you saw, reader_args is a Python dictionary. All of the SMILES parsers accept the 'delimiter' argument, and the RDKit and Open Babel reader_args also support the "has_header" argument. If true, the first line of the file contains a header. (I couldn't think of a good implementation of this for OEChem.)

There are also toolkit-specific reader_args. Here I'll disable RDKit's sanity checker for SMILES, and show that it accepts a SMILES that it would otherwise reject.

>>> T.parse_molecule("c1ccccc1#N", "smistring", reader_args={"sanitize": False})
<rdkit.Chem.rdchem.Mol object at 0x104407440>
>>> T.parse_molecule("c1ccccc1#N", "smistring", reader_args={"sanitize": True})
[22:09:40] Can't kekulize mol 

Traceback (most recent call last):
  File "<stdin>", line 1, in 
  File "chemfp/rdkit_toolkit.py", line 251, in parse_molecule
    return _toolkit.parse_molecule(content, format, id_tag, reader_args, errors)
  File "chemfp/base_toolkit.py", line 986, in parse_molecule
    id_tag, reader_args, error_handler)
  File "chemfp/base_toolkit.py", line 990, in _parse_molecule_impl
    id, mol = format_config.parse_id_and_molecule(content, id_tag, reader_args, error_handler)
  File "chemfp/_rdkit_toolkit.py", line 1144, in parse_id_and_molecule
    error_handler.error("RDKit cannot parse the SMILES string %r" % (terms[0],))
  File "chemfp/io.py", line 77, in error
    raise ParseError(msg, location)
chemfp.ParseError: RDKit cannot parse the SMILES string 'c1ccccc1#N'

The SMILES parsers use different reader_args than the SDF parser. You can see the default reader_args by using the toolkit's format API:

>>> from chemfp import rdkit_toolkit
>>> rdkit_toolkit.get_format("smi").get_default_reader_args()
{'delimiter': None, 'has_header': False, 'sanitize': True}
>>> rdkit_toolkit.get_format("sdf").get_default_reader_args()
{'strictParsing': True, 'removeHs': True, 'sanitize': True}
Also, the different toolkits may use different reader_args for the same format.
>>> from chemfp import openeye_toolkit
>>> openeye_toolkit.get_format("sdf").get_default_reader_args()
{'flavor': None, 'aromaticity': None}
>>> from chemfp import openbabel_toolkit
>>> openbabel_toolkit.get_format("sdf").get_default_reader_args()
{'implementation': None, 'perceive_0d_stereo': False, 'perceive_stereo': False, 'options': None}
I'll cover more about the format API in another essay.

Namespaces

This can lead to a problem. You saw earlier how to get the correct toolkit for a given fingerprint type. Once you have the toolkit you can parse a record into a toolkit-specific molecule. But what if you want toolkit-specific and format-specific settings?

First off, parsers ignore unknown reader_arg names, and the reader_arg names for the different toolkits are different, except for "delimiter" and "has_header" where it doesn't make sense for them to be different. That means you could do:

reader_args = {
  "delimiter": "tab",       # for SMILES files
  "strictParsing": False,   # for RDKit SDF
  "perceive_stereo": True,  # for Open Babel SDF
  "aromaticity": "daylight, # for all OEChem readers
}
and have everything work.

Still, it's possible that you want OEChem to parse a SMILES using "daylight" aromaticity and an SD file using "openeye" aromaticity.

The reader_args are namespaced, so for that case you could use a format qualifier, like this:

reader_args = {
  "smi.aromaticity": "daylight",
  "can.aromaticity": "daylight",
  "usm.aromaticity": "daylight",
  "sdf.aromaticity": "openeye",
}
There's also a toolkit qualifier. In this daft example, the OpenEye reader uses the whitespace delimiter option for SMILES files, the RDKit SMILES and InChI readers uses tab, and the Open Babel SMILES and InChI readers use the space character.
reader_args = {
  "openeye.*.delimiter": "whitespace",
  "rdkit.*.delimiter": "tab",
  "openbabel.*.delimiter": "space",
}
Fully qualified names, like "openeye.smi.delimiter" are also allowed.

The remaining problem is how to configure the reader_args. It's no problem to write the Python dictionary yourself, but what if you want people to pass them in as command-line arguments or in a configuration file? I'll cover that detail when I talk about the format API.

Errors

What happens if the SMILES isn't valid? I'll use my favorite invalid SMILES:

>>> T.parse_id_and_molecule("Q q-ane", "smi")
[22:01:37] SMILES Parse Error: syntax error for input: Q
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "chemfp/rdkit_toolkit.py", line 264, in parse_id_and_molecule
    return _toolkit.parse_id_and_molecule(content, format, id_tag, reader_args, errors)
  File "chemfp/base_toolkit.py", line 1014, in parse_id_and_molecule
    id_tag, reader_args, error_handler)
  File "chemfp/base_toolkit.py", line 1018, in _parse_id_and_molecule_impl
    return format_config.parse_id_and_molecule(content, id_tag, reader_args, error_handler)
  File "chemfp/_rdkit_toolkit.py", line 249, in parse_id_and_molecule
    error_handler.error("RDKit cannot parse the SMILES %r" % (smiles,))
  File "chemfp/io.py", line 77, in error
    raise ParseError(msg, location)
chemfp.ParseError: RDKit cannot parse the SMILES 'Q'
A chemfp.ParseError is also a ValueError, so you can check for this exception like this:
try:
    mol = T.parse_molecule("Q", "smistring")
except ValueError as err:
    print("Not a valid SMILES")

I think an exception is the right default action when there's an error, but various toolkits disagree. Some will return a None object in that case, and I can see the reasoning. It's sometimes easier to write things more like:

>>> mol = T.parse_molecule("Q", "smistring", errors="ignore")
[00:04:26] SMILES Parse Error: syntax error for input: Q
>>> print(mol)
None

The default errors value is "strict", which raises an exception. The "ignore" behavior is to return None when the function can't parse a SMILES.

In the above example, the "SMILES Parse Error" messages come from the the underlying RDKit C++ library. It's difficult to capture this message from Python. Chemfp doesn't even try. OEChem and Open Babel have similar error messages. Chemp doesn't try to capture those either.

Error handling control for a single record isn't so important. It's more important when parsing multiple records, where "ignore" skips a record and keeps processing and "strict" raises an exception and stops processing once a record cannot be parsed. I'll cover that in the essay where I talk about reading molecules.

Other error handling - experimental!

The errors parameter can be two other strings: "report" and "log". The "report" option writes the error information to stderr, like this:

>>> T.parse_molecule("Q", "smi", errors="report")
[03:05:52] SMILES Parse Error: syntax error for input: Q
ERROR: RDKit cannot parse the SMILES 'Q'. Skipping.
The "SMILES Parse Error", again, is generated by the RDKit C++ level going directly to stderr. The "RDKit cannot parse the SMILES" line, however, is chemfp sending information to Python's sys.stderr.

I don't think the "report" option is all that useful for this case since it's pretty much a duplicate of the underlying C++ toolkit. It does know about the id_tag, so it gives better error message for ChEBI and ChEMBL files.

The "log" option is the same as "report" except that it sends the message to the "chemfp" logger using Python's built-in logging system. I have very little experience with logging, so this is even more experimental than "report".

Finally, the "errors" parameter can take an object to use as the error handler. The idea is that the handler's error() is called when there's an error. See chemfp/io.py for how it works, but consider this to be undocumented, experiemental, and almost certain to change.

I have the idea that the error handler can have a warn() method, which would be called for warnings. The current generation of toolkits uses a global error handler or sends the message directly to stderr. In a multi-threaded system, which might parse two molecules at the same time in different threads, it's hard to know which molecule caused the error.

I haven't done this because ... it's hard using the current generation of tools to get that warning information. I'm also unsure about the error handler protocol. How does it know that it's collected all of the warnings for a given molecule. Is there a special "end_of_warnings()" method? Are the warning accumulated and sent in one call? What if there are both warnings and errors?

That's why this subsection it titled "experimental". :)



New chemfp APIs #

In the new few essays I'll cover chemfp's new APIs. I'll have a different focus here than will be in the official documentation. I will be writing for the people interested in chemistry toolkit API development, and with a focus on the I/O subsystem in such a toolkit. I've spent about 15 years experimenting and evaluating different API ideas across a range of use cases. This new API is a synthesis of those ideas. I hope it proves influential.

Why does chemfp have these new APIs?

Chemfp is a multi-platform Python library for cheminformatics fingerprint generation and search. It depends on OEChem/OEGraphSim, RDKit, or Open Babel to parse a structure file and generate fingerprints, and has its own highly optimized search algorithms.

Chemfp supports multiple toolkits because many of the people I know who use fingerprints use fingerprints from multiple toolkits. While this is biased by the people I know, in general, I've seen that most people who uses a similarity-based search as part of their science spend some time to figure out which fingerprint type, and which parameters, are most appropriate for that research.

Unfortunately, this may mean a lot of fiddling around with different toolkits, since each has a different API and different take on the problem. My belief is that more people will use chemfp (and pay me money) if it really makes it easier to handle fingerprints across different toolkits. The hard part is that no one wants to learn yet another system except perhaps if it's significantly better. I think I've done that.

Example

Here's an example program using new API. It reads in a SMILES file, compute OpenEye's circular fingerprints (with the defaults settings) and save the results to an FPS file.

import chemfp

fptype = "OpenEye-Circular"
source = "benzodiazepine.smi"
destination = "benzodiazepine.fps"

with chemfp.read_molecule_fingerprints(fptype, source) as reader:
    with chemfp.open_fingerprint_writer(destination,
                                        metadata=reader.metadata) as  writer:
        writer.write_fingerprints(reader)
I wrote that in a more formal style, using Python's "with" statement to ensure that the files are closed when no longer needed. While this gets all of the details correct, it is verbose. You might even call it excessive, since you can get the basic result using:
import chemfp

chemfp.open_fingerprint_writer("benzodiazepine.fps").write_fingerprints(
    chemfp.read_molecule_fingerprints("OpenEye-Circular", "benzodiazepine.smi"))
The only major difference between the two is that the output fingerprint file in the latter doesn't contain any metadata.

Fingerprint family and fingerprint type API

Object-oriented programming uses the words "class" and "instance". A class is the more abstract concept. It describes what's possible. An instance fills in the details. For example, "integers" is the class of integer numbers, and "38" is an instance of that class.

Chemfp-1.2 makes a similar distinction between a "fingerprint family" and a "fingerprint type". A family might be "circular fingerprints" and a type is "circular fingerprints of size 2".

Internally, chemfp has a fingerprint family registry. To get information about all of the available fingerprint families, use get_fingerprint_families(), and use get_fingerprint_family() to get a specific family by name:

>>> import chemfp
>>> chemfp.get_fingerprint_families()
[FingerprintFamily(<ChemFP-Substruct-OpenBabel/1>),
FingerprintFamily(<ChemFP-Substruct-OpenEye/1>),
FingerprintFamily(<ChemFP-Substruct-RDKit/1>),
FingerprintFamily(<OpenBabel-FP2/1>),
  ...
FingerprintFamily(<RDMACCS-OpenBabel/1>),
FingerprintFamily(<RDMACCS-OpenEye/1>),
FingerprintFamily(<RDMACCS-RDKit/1>)]
>>> family = chemfp.get_fingerprint_family("OpenEye-Circular")
>>> family
FingerprintFamily(<OpenEye-Circular/2>)
(This make take a while because the registry uses a lazy resolver which doesn't load or probe the underlying toolkits until first requested.)

You can ask a fingerprint family for its default values, and make a new fingerprint type using those defaults.

>>> fptype = family()
>>> family.get_defaults()
{'numbits': 4096, 'minradius': 0, 'atype': 4239, 'maxradius': 5, 'btype': 1}
>>> fptype
<chemfp.openeye_types.OpenEyeCircularFingerprintType_v2 object at 0x10dd0ee90>
Every fingerprint type has a type string, which is the canonical string representation of the fingerprint family name, the chemfp version number for that family, and the fingerprint type parameters.
>>> fptype.get_type()
'OpenEye-Circular/2 numbits=4096 minradius=0 maxradius=5 
 atype=Arom|AtmNum|Chiral|EqHalo|FCharge|HCount btype=Order'
(This is really one long string. I added a newline to make it easier to read.)

You aren't restricted to the defaults. I'll use 1024 bits/fingerprint and a maximum radius of 3 instead of the defaults of 4096 and 5:

>>> fptype = family(numbits=1024, maxradius=3)
>>> fptype.get_type()
'OpenEye-Circular/2 numbits=1024 minradius=0 maxradius=3
 atype=Arom|AtmNum|Chiral|EqHalo|FCharge|HCount btype=Order'

If you have a type string then you can get the fingerprint type directly using chemfp's get_fingerprint_type(). The fingerprint type string does not need to be in canonical form nor include all of the parameters.

>>> fp2type = chemfp.get_fingerprint_type("OpenBabel-FP2")
>>> fp2type.get_type()
'OpenBabel-FP2/1'
>>> 
>>> rdktype = chemfp.get_fingerprint_type("RDKit-Fingerprint/2")
>>> rdktype.get_type()
'RDKit-Fingerprint/2 minPath=1 maxPath=7 fpSize=2048 nBitsPerHash=2 useHs=1'
>>> rdktype = chemfp.get_fingerprint_type("RDKit-Fingerprint/2 fpSize=512 maxPath=5")
>>> rdktype.get_type()
'RDKit-Fingerprint/2 minPath=1 maxPath=5 fpSize=512 nBitsPerHash=2 useHs=1'

Fingerprint writer API

Chemfp has always been able to write fingerprints to an FPS file, but I thought the API was too clumsy for public use so I left it as part of the undocumented internals.

I'm much happier about the new fingerprint writer API, which is based on a writer object with a method to save a single fingerprint at a time, and another method to save multiple fingerprints. In this example I asked it write to None, which means to write to stdout, so you could see the output from each methods call.

>>> writer = chemfp.open_fingerprint_writer(None)
#FPS1
>>> writer.write_fingerprint("zeros", "\0\0\0\0")
00000000        	zeros
>>> writer.write_fingerprints( [("fp1", "\0\0\0\1"), ("fp2", "\0\0\0\2")] )
00000001        	fp1
00000002        fp2
>>> writer.close()

Version 1.2 also introduces a new binary file format, the FPB format. The writer API knows how to write an FPB file and will do so if the format ends with ".fpb" or told specifically to write an FPB file.

Toolkit APIs

The toolkit API is the collective name for the format API, the molecule I/O API, and the molecule API. There is one toolkit API implementation for each supported chemistry toolkit, which is basically a wrapper from the chemfp API to the underlying toolkit API. Use get_toolkit_names() to get a list of the available chemistry-based toolkits:

>>> chemfp.get_toolkit_names()
set(['openeye', 'rdkit', 'openbabel'])
>>> T = chemfp.get_toolkit("openeye")
>>> T.is_licensed()
True
Instead of get_toolkit() you can import a toolkit directly in Python:
>>> from chemfp import openeye_toolkit as T
>>> T.is_licensed()
True
>>> T.name
'openeye'

In addition to the chemistry-based toolkits, the "text" toolkit implements just enough of the toolkit API to process SMILES and SDF records. It doesn't not know chemistry; it stores the contents of each record as a string. I wrote it to handle a few use cases that couldn't be handled with the molecule centered toolkit. I'll discuss these cases in the future.

Format API

The get_formats() function lists the available formats for a given toolkit.

>>> from chemfp import rdkit_toolkit
>>> rdkit_toolkit.get_formats()
[Format('rdkit/canstring'), Format('rdkit/usmstring'),
Format('rdkit/smistring'), Format('rdkit/molfile'),
Format('rdkit/usm'), Format('rdkit/sdf'), Format('rdkit/can'),
Format('rdkit/smi'), Format('rdkit/rdbinmol')]
There's also a get_input_formats() and get_output_formats(), or you can use attributes of the format determine if the toolkit understand the format as an input or output, as well as a few other properties.

You can use the format API to figure out the toolkit's (or at least chemfp's) best guess for the format for a given filename. This includes compression detection.

>>> rdkit_toolkit.get_input_format_from_source("example.smi.gz")
Format('rdkit/smi.gz')

Molecule input API

This is by far the most extensive of the new APIs in chemfp-1.2. Previously there was read_molecule_fingerprints() which took a fingerprint type name and the source and returned fingerprints for each record. Suppose though that you want to read a structure file and generate fingerprints using several three RDKit fingerprint types. The older API requires reading the file three different types, which means parsing the same record three different times, even though the same RDKit molecule could be reused for each.

The new API lets you separate the process of reading molecules from generating fingerprints, which can be a nice performance boost.

The old API could only take input from files, or file-like sources. The new API has more extensive support for reading from a string. This is useful in a web application, where the input might be text field from a submitted form.

In practice, there several variations on how to read an input, depending on if it comes from a file or a string or if you want to read a single record (the first one) or all records. Sometimes you want the really molecule id when reading a molecule and sometimes you really don't that overhead; I ended up implementing both of those readers. Finally, there are also cases where you want to convert a lot of records into a molecule, but each record is stored into its own string. The make_id_and_molecule_parser() creates a specialized parser for the case.

Different toolkit reader API functions
  ...from a file ...from a string
Read 0 or more molecules ... read_molecules() read_molecules_from_string()
Read 0 or more (id, molecule) pairs ... read_ids_and_molecules() read_ids_and_molecules_from_string()
Parse the first molecule ...   parse_molecule()
Parse the first molecule and its id...   parse_id_and_molecule()
Make an optimized function to parse the
first molecule and its id...
  make_id_and_molecule_parser()

Molecule output API

Chemfp-1.2 adds new APIs to save molecules to a structure file or convert a molecule into a record. Like the reader, there are several variations in the writer.

The open_molecule_writer() and open_molecule_writer_to_string() functions return a writer object, though in the first case the writer saves to a file and in the latter it saves to memory that can be accessed as a string. The writer object has methods to save a single molecule at a time, to save multiple molecules at once, and to save a molecules using an alternate id.

To demonstrate, here's a program which reads the compressed ChEBI SD file and saves it to a compressed SMILES file. I specify the id tag because ChEBI stores its identifiers in the "ChEBI ID" SD tag, and not the title:

from chemfp import openeye_toolkit as T

with T.read_ids_and_molecules("ChEBI_complete.sdf.gz", id_tag="ChEBI ID") as reader:
    with T.open_molecule_writer("chebi.smi.gz") as writer:
        writer.write_ids_and_molecules(reader)

When you only need to turn a molecule into a record, use the create_string() function. If you need to do a lot of these conversions, all with the same parameters, then use make_string_creator(). This function returns returns a specialized function which is better optimized to turn a molecule into a string.

Molecule API

Chemfp does not try to make a common molecule API across the differnet toolkits. For that, see Noel O'Boyle's "cinfony". In fact, it's quite the opposite. The readers, writers, and fingerprint generation functions take native molecule types and not wrapped objects.

All chemfp offers are a small set of functions to get what the toolkit considers to be a molecule's title, to do some simple manipuation of the SD tags for a record, and to make a copy of a molecule.

Connecting fingerprint types and toolkits to make a fingerprint

If you have a fingerprint type string, and a SMILES string, how to you generate the fingerprint for it? Each toolkit fingerprint type has a "compute" function, which takes a toolkit molecule and returns the fingerprint string, so that's in the right direction. You still need to know which toolkit to use to convert the SMILES into the right molecule.

Well, actually you don't. The fingerprint type's parse_fingerprint() will parse a molecule and generate the fingerprint for you:

>>> fptype = chemfp.get_fingerprint_type("OpenBabel-FP2")
>>> fptype.parse_fingerprint("c1ccccc1O", "smi")
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x02
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x08\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x80\x00\x00\x00
\x00\x00\x00@\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x08
\x00\x00\x00\x00\x00\x00\x00'
This helper function works because the fingerprint type knows its toolkit, through the "toolkit" attribute. This attribute is part of the public API, so you could do the two steps yourself if you want to:
>>> fptype.toolkit
<module 'chemfp.openbabel_toolkit' from 'chemfp/openbabel_toolkit.pyc'<
>>> mol = fptype.toolkit.parse_molecule("c1ccccc1O", "smistring")
>>> fptype.compute_fingerprint(mol)
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x02
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x08\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x80\x00\x00\x00
\x00\x00\x00@\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x08
\x00\x00\x00\x00\x00\x00\x00'



Summary of the /etc/passwd reader API #

Next month I will be in Finland for Pycon Finland 2014. It will be my first time in Finland. I'm especially looking forward to the pre-conference sauna on Sunday evening.

My presentation, "Iterators, Generators, and Coroutines", will cover much of the same ground as my earlier essay. In that essay, I walked through the steps to make an /etc/passwd parser API which can be used like this:

from passwd_reader import read_passwd_entries

with read_passwd_entries("/etc/passwd", errors="strict") as reader:
    location = reader.location
    for entry in reader:
        print("%d:%s" % (location.lineno, entry.name))
I think the previous essay was a bit too detailed to understand the overall points, so in this essay I'll summarize what I did and why I did it. Hopefully it will also help me prepare for Finland.

The /etc/passwd parser is built around a generator, which is a pretty standard practice. Another standard approach is to build a parser object, as a class instance which implements the iterator protocol. The main difference is that the generator uses local variables in the generator's execution frame where the class approach uses instance variables.

Since the parser API can open a file, when the first parameter is a filename string instead of a file object, I want it to implement the context manager protocol and implement deterministic resource handling. If it always created a file then I could use contextlib.closing() or contextlib.contextmanager() to convert an iterator into a self-closing context manager, but my read_passwd_entries reader is polymorphic in that first parameter, so I can't use a standard solution.

I instead wrapped the generator inside of a PasswdReader which implements the appropriate __enter__ and __exit__ methods.

I also want the parser to track location information about current record; in this case the line number of the current record but in general it could include byte position or other information about the record's provenance. I store this in a Location instance accessed via the PasswdReader's "location" attribute.

The line number is stored as a local variable in the iterator's execution frame. While this could be accessed through the generator's ".gi_frame.f_locals", the documentation says that frames are internal types whose definitions may change with future versions of the interpreter. That doesn't sound like something I want to depend upon.

Instead, I used an uncommon technique where the generator registers a callback function that the Location can use to get the line number. This function is defined inside of the generator's scope so can access the local variables. This isn't quite as simple as I would like, because exception handling in a generator, including the StopIteration from calling a generator's close(), is a bit tricky, but it does work.

The more common technique is to rewrite the generator as a class which implements the iterator protocol, where each instance stores its state information as instance variables. It's easy to access instance variables, but it's a different sort of tricky to read and write the state information at the respectively start and end of each iteration step.

A good software design balances many factors, including performance and maintability. The weights for each factor depend on the expected use cases. An unsual alternate design can be justified when it's a better match to the use cases, which I think is the case with my uncommon technique.

In most cases, API users don't want the line number of each record. For the /etc/passwd parser I think it's only useful for error reporting. More generally, it could be used to build a record index, or a syntax highlighter, but those are relatively rare needs.

The traditional class-based solution is, I think, easier to understand and implement, though it's a bit tedious to save and restore the parser state for each entry and exit point. This synchronization adds a lot of overhead to the parser, which isn't neeed for the common case where that information is ignored.

By comparison, my alternative generator solution has a larger overhead - two function calls instead of an attribute lookup - to access location information, but it doesn't need the explicit save/restore for each step because those are maintained by the generator's own execution frame. I think it's a better match for my use cases.

(Hmm. I think I've gone the other way. I think few will understand this without the examples or context. So, definitely lots of code examples for Pycon Finland.)



Lausanne Cheminformatics workshop and contest #

Dietrich Rordorf from MDPI sent an announcement to the CHMINF mailing list about the upcoming 9th Workshop in Chemical Information. It will be on 12 September 2014 in Lausanne, Switzerland. It seems like it will be a nice meeting, so I thought to forward information about it here. They also have a software contest, with a 2,000 CHF prize, which I think will interest some of my readers.

The workshop has been around for 10 years, so I was a bit suprised that I hadn't heard of it before. Typically between 30 and 50 people attend, which I think is a nice size. The preliminary program is structured around 20 minute presentations, including:

If you know the authors, you might recognize that one is from Strasbourg and another London, and the rest from Switzerland. I can understand. From where I live in Sweden it will cost over US $300 in order to get there, and Lausanne doesn't have its own commercial airport so I would need to fly into Geneva or Bern, while my local air hub doesn't fly there directly.

But I live in a corner of Europe, and my constraints aren't yours.

Source code contest

I had an email conversation with Luc Patiny about an interesting aspect of the workshop. They are running a contest to identify the best open source cheminformatics tool of the year, with a prize of 2000 CHF. That's 1650 EUR, 2200 USD, 1300 GBP, or 15000 SEK, which is plenty enough for someone in Europe or even the US to be able to travel there! They have a time slot set aside for the winner of the contest to present the work. The main requirement is that contest participants are selected from submissions (1-2 pages long) to the open access journal Challenges. (And no, there are no journal page fees for this contest, so it doesn't seem like a sneaky revenue generating trick.)

The other requirement is that the submission be "open source". I put that in quotes because much of my conversation with Luc was to understand what they mean. They want people to be able to download the (unobsfucated) software source code for no cost and be able to read and review it to gain insight.

I think this is very much in line with classical peer review thought, even though it can include software which are neither open source nor free software. For example, software submissions for this contest could be "for research purposes only" or "not for use in nuclear reactors", or "redistributions of modified versions are not permitted."

Instead, I think their definition is more in line with Microsoft terms shared source.

In my case, my latest version of chemfp is 'commercial open source', meaning that those who pay me money for it get a copy of it under the MIT open source license. It's both "free software" and "open source", but it's not eligible for this workshop because it costs money to download it.

But I live in a corner of open source, and my constraints aren't yours. ;) If you have a free software project, open source software project, or shared source software project, then you might be interested in submitting it to this workshop and journal. If you win, think of it as an all-expenses paid trip to Switzerland. If you don't win, think of it as a free publication.



Copyright © 2001-2013 Andrew Dalke Scientific AB