Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2014/11/27/maccs_in_rdkit_and_open_babel

MACCS in RDKit and Open Babel

In this essay I'll compare the RDKit and Open Babel MACCS implementations at a somewhat coarse level. The main goal of this will be to show how some of the new chemfp-1.2 features make it easier to do cross-toolkit fingerprint analysis.

Fingerprint type API

It's hard to make things easy. I want to be able to generate a fingerprint given the fingerprint type string and a structure record as a string. Here are some examples of fingerprint type strings:

RDKit-Fingerprint fpSize=1024
OpenEye-Path numbits=1024 maxbonds=6
OpenBabel-FP2
You can look at the names and figure out that they use different cheminformatics toolkits, each with its own API for parsing a chemical record and generating a fingerprint. That's complicated.

Chemfp solves it in the usual way, with an extra layer of indirection. There's a common fingerprint type API, with back-end implementations for OEChem, Open Babel, and RDKit. Here's how it looks in action:

import chemfp

smiles = "c1ccccc1O"

for typestring in (
     # Iterate through fingerprints from three different toolkits
     "RDKit-Fingerprint fpSize=1024",
     "OpenEye-Path numbits=1024 maxbonds=6",
     "OpenBabel-FP2"):
    
    # Use the fingerprint type string to get a FingerprintType object
    fptype = chemfp.get_fingerprint_type(typestring)
    
    # The fingerprint type can parse a SMILES string and return a fingerprint
    fp = fptype.parse_molecule_fingerprint(smiles, "smistring")
    
    # The fingerprint type also has a way to get the canonical type string.
    print fptype.get_type()
    
    # Display the fingerprint in hex, and wrapped over several lines
    hex_fp = fp.encode("hex")
    for i in range(0, len(hex_fp), 60):
        print hex_fp[i:i+60]
    print
This generates the following output:
RDKit-Fingerprint/2 minPath=1 maxPath=7 fpSize=1024 nBitsPerHash=2 useHs=1
040000000000100000000000800000000000000002040000040041000000
000080000000400000400004080800000001000000000008000000208020
000000000000000000000800000000000100000000112000000000000000
040000000001000000010000000220040200020008000000000200004085
0000000000000000

OpenEye-Path/2 numbits=1024 minbonds=0 maxbonds=6 atype=Arom|AtmNum|Chiral|EqHalo|FCharge|HvyDeg|Hyb btype=Order|Chiral
200000000000000000000000008090000000000000000000800010000000
00800020000000100200000200008042000000000a000000000000100000
000000000400020000000001000000000000000000000000000004000010
00000000200802000000004008010000100000000000008a002184000000
2000000000000004

OpenBabel-FP2/1
000000000000000000000200000000000000000000000000000000000000
000000000000000000000000000800000000000002000000000000000000
000000000800000000000000000000000200000000800000000000004008
000000000000000000000000000200000000000000000002000000000020
0800000000000000

Use parse_molecule() to parse the record once

How does RDKit's fingerprint density change as a function of its length? It's not hard using the above example, though instead of creating a type string containing the size information this time I'll pass in the fingerprint parameters as the second argument to get_fingerprint_type().

import chemfp
from chemfp import bitops

smiles = "CC(=O)Oc1ccccc1C(=O)O"

print "fpSize popcount density"

for size in (256, 512, 1024, 2048, 4096):
    fptype = chemfp.get_fingerprint_type("RDKit-Fingerprint", {"fpSize": size})
    # The next line converts the SMILES to a molecule each time through the loop.
    fp = fptype.parse_molecule_fingerprint(smiles, "smistring")
    popcount = bitops.byte_popcount(fp)
    print "%6d   %4d    %.3f" % (size, popcount, float(popcount)/size)
For the curious, the output is:
fpSize popcount density
   256    208    0.812
   512    271    0.529
  1024    319    0.312
  2048    354    0.173
  4096    375    0.092

If you look at the code for a bit you'll see that it parses the same SMILES string 5 times. This seems a bit excessive. Instead, I would like to parse the SMILES string once, to get an RDKit molecule, and pass that molecule to the different fingerprint types. I can do this directly with the RDKit API, as in:

from rdkit import Chem
rdmol = Chem.MolFromSmiles(smiles)
I instead created a common cross-toolkit API, where "rdkit_toolkit", "openeye_toolkit", and "openbabel_toolkit", share the same toolkit API. Here I use chemfp.rdkit_toolkit to make a molecule:
from chemfp import rdkit_toolkit
rdmol = rdkit_toolkit.parse_molecule(smiles, "smistring")
and the same code would work with the other toolkits.

At this point it looks more complicated, or at least longer, to call the chemfp toolkit API than use RDKit's native API. I hope to convince you in a bit why it's useful.

I'll rewrite the density code to use the toolkit API and parse the SMILES string only once. This time I call the compute_fingerprint() method of the fingerprint type to compute the fingerprint given a toolkit molecule.

import chemfp
from chemfp import bitops
from chemfp import rdkit_toolkit

smiles = "CC(=O)Oc1ccccc1C(=O)O"

print "fpSize popcount density"

# Parse the SMILES once, into an RDKit molecule
rdmol = rdkit_toolkit.parse_molecule(smiles, "smistring")

for size in (256, 512, 1024, 2048, 4096):
    fptype = chemfp.get_fingerprint_type("RDKit-Fingerprint", {"fpSize": size})
    fp = fptype.compute_fingerprint(rdmol)
    popcount = bitops.byte_popcount(fp)
    print "%6d   %4d    %.3f" % (size, popcount, float(popcount)/size)
As expected, the output doesn't change, so I won't show it again.

How to use the right toolkit for a given fingerprint type.

The above code worked because I knew to use the rdkit_toolkit for the "RDKit-Fingerprint" type. What if I want to compare the fingerprint density for different sizes and different fingerprint types, including ones from other toolkits? I could keep track of them manually, but in chemfp each fingerprint type knows its toolkit, which makes things easier.

import chemfp
from chemfp import bitops

smiles = "CC(=O)Oc1ccccc1C(=O)O"

print "fpSize popcount density"

for typename, sizefield in (
        ("RDKit-Fingerprint", "fpSize"),
        ("RDKit-Morgan radius=1", "fpSize"),
        ("RDKit-Morgan radius=2", "fpSize"),
        ("RDKit-Morgan radius=3", "fpSize"),
        ("OpenEye-Path", "numbits"),
        ("OpenEye-Circular maxradius=1", "numbits"),
        ("OpenEye-Circular maxradius=2", "numbits"),
        ("OpenEye-Circular maxradius=3", "numbits")):

    # Each fingerprint type knows its toolkit.
    # Use that to create the toolkit-appropriate molecule
    fptype = chemfp.get_fingerprint_type(typename)
    
    toolkit = fptype.toolkit
    mol = toolkit.parse_molecule(smiles, "smistring")

    print "="*23, typename
    for size in (256, 512, 1024, 2048, 4096):
        fptype = chemfp.get_fingerprint_type(typename, {sizefield: size})
        fp = fptype.compute_fingerprint(mol)
        popcount = bitops.byte_popcount(fp)
        print "%6d   %4d    %.3f" % (size, popcount, float(popcount)/size)
The output of this is:
fpSize popcount density
======================= RDKit-Fingerprint
   256    208    0.812
   512    271    0.529
  1024    319    0.312
  2048    354    0.173
  4096    375    0.092
======================= RDKit-Morgan radius=1
   256     16    0.062
   512     16    0.031
  1024     16    0.016
  2048     16    0.008
  4096     17    0.004
======================= RDKit-Morgan radius=2
   256     24    0.094
   512     24    0.047
  1024     24    0.023
  2048     24    0.012
  4096     25    0.006
======================= RDKit-Morgan radius=3
   256     31    0.121
   512     31    0.061
  1024     31    0.030
  2048     31    0.015
  4096     32    0.008
======================= OpenEye-Path
   256    124    0.484
   512    142    0.277
  1024    153    0.149
  2048    161    0.079
  4096    164    0.040
======================= OpenEye-Circular maxradius=1
   256     15    0.059
   512     16    0.031
  1024     16    0.016
  2048     16    0.008
  4096     16    0.004
======================= OpenEye-Circular maxradius=2
   256     23    0.090
   512     24    0.047
  1024     24    0.023
  2048     24    0.012
  4096     24    0.006
======================= OpenEye-Circular maxradius=3
   256     29    0.113
   512     31    0.061
  1024     31    0.030
  2048     31    0.015
  4096     31    0.008
This new code once again parses the SMILES more often than it needs to. I only need to parse the string once for all of the OpenEye toolkit fingerprint calculations and once for RDKit's. I'll make a little cache where the key is "toolkit.name" (which is the string "openeye" or "rdkit" or "openbabel") and the value is the correct toolkit molecule. The following is a truncated of the result, since the other details aren't so important and get in the way.
# The cache key is the toolkit name
mol_cache = {}
for typename, sizefield in (
        ("RDKit-Fingerprint", "fpSize"),
              # ... removed many lines
        ("OpenEye-Circular maxradius=3", "numbits")):

    # Each fingerprint type knows its toolkit.
    # Use that to create the toolkit-appropriate molecule
    fptype = chemfp.get_fingerprint_type(typename)
    toolkit = fptype.toolkit
    if toolkit.name in mol_cache:
        mol = mol_cache[toolkit.name]
    else:
        mol = mol_cache[toolkit.name] = toolkit.parse_molecule(smiles, "smistring")

    # ... compute and display the fingerprints

Use a fingerprint type to read a structrue file

Now for something a bit different. Every wondered if the RDKit and Open Babel MACCS implementations give the same answers? They should be close, since they use the same SMARTS definitions, but there will likely be differences in chemistry. It's very easy to generate the MACCS fingerprints given a single fingerprint type. The fingerprint type has a "read_molecule_fingerprints()" method which takes the structure filename and returns the (id, fp) pairs. Here's what that looks like when I analyze a copy of ChEBI:

import chemfp
from chemfp.bitops import byte_popcount

# Choose one or the other, depending on your preference.
fptype = chemfp.get_fingerprint_type("RDKit-MACCS166")
#fptype = chemfp.get_fingerprint_type("OpenBabel-MACCS")

filename = "/Users/dalke/databases/ChEBI_lite.sdf.gz"
total_popcount = 0
with fptype.read_molecule_fingerprints(filename, errors="ignore") as reader:
    for recno, (id, fp) in enumerate(reader, 1):
        total_popcount += byte_popcount(fp)

print "#records %d  total popcount: %d  average popcount: %.1f  average density: %.3f" % (
    recno, total_popcount, total_popcount/float(recno), total_popcount/float(recno)/fptype.num_bits)
This gives the output for RDKit of:
#records 18928  total popcount: 637448  average popcount: 33.7  average density: 0.203
When I run it for Open Babel's MACCS I get:
#records 19640  total popcount: 649935  average popcount: 33.1  average density: 0.199
These are close, but not identical. It's not clear why there's a difference. Part of the reason is surely because RDKit and Open Babel aren't able to process all of the records. It's also likely that they have differences in aromaticity. But there may be other reasons.

It may just that the records that RDKit can't handle vs. those that Open Babel can handle, so the real analysis should only include the ones that both can handle.

Read SD records using the text_toolkit

I need some way to only analyze the records that both RDKit and Open Babel can understand. I can't use the fingerprint type's own parser API because it's difficult to keep the two parsers synchronized; I don't know if the RDKit skipped a record or the Open Bable one did.

Chemfp includes a "text_toolkit" module, with functions to parse individual SMILES and SDF records from a the corresponding files. I can use that to get the SDF records as a string, then use parse_molecule_fingerprint() to get the fingerprints from that string. The functions will raise a chemfp.ParseError exception (which is also a ValueError) if there was a problem. If there is an exception, I'll skip that record.

The code to read SDF records can be as simple as:

for id, sdf_record in text_toolkit.read_sdf_ids_and_records(filename):
    ... sdf_record contains the record as a string ...
though in practice I'll use a context-manager so the file is closed automatically. Here's the full code:
import chemfp
from chemfp import text_toolkit
from chemfp.bitops import byte_popcount, byte_intersect_popcount

rdkit_maccs_fptype = chemfp.get_fingerprint_type("RDKit-MACCS166")
openbabel_maccs_fptype =chemfp.get_fingerprint_type("OpenBabel-MACCS")
assert rdkit_maccs_fptype.num_bits == openbabel_maccs_fptype.num_bits == 166

filename = "/Users/dalke/databases/ChEBI_lite.sdf.gz"

total_rdkit_popcount = total_openbabel_popcount = 0
only_rdkit_popcount = only_openbabel_popcount = 0
num_skipped = 0

with text_toolkit.read_sdf_ids_and_records(filename) as sdf_reader:
    for recno, (id, sdf_record) in enumerate(sdf_reader, 1):
        # Convert the record into a fingerprint. If there is a problem,
        # keep track of the skip count and go on to the next record.
        try:
            rdkit_fp = rdkit_maccs_fptype.parse_molecule_fingerprint(sdf_record, "sdf")
            openbabel_fp = openbabel_maccs_fptype.parse_molecule_fingerprint(sdf_record, "sdf")
        except chemfp.ParseError:
            num_skipped += 1
            continue
        # Figure out the number of bits in each fingerprint and number of bits in common
        rdkit_popcount = byte_popcount(rdkit_fp)
        openbabel_popcount = byte_popcount(openbabel_fp)
        intersect_popcount = byte_intersect_popcount(rdkit_fp, openbabel_fp)

        # Keep track of overall statistics for each molecule.
        total_rdkit_popcount += rdkit_popcount
        total_openbabel_popcount += openbabel_popcount
        only_rdkit_popcount += rdkit_popcount - intersect_popcount
        only_openbabel_popcount += openbabel_popcount - intersect_popcount
        
# Report the results
print "#records %d  #skipped %d" % (recno, num_skipped)
num_valid_records = recno - num_skipped
print "RDKit popcount: %d  RDKit only: %d average popcount: %.1f  average density: %.3f" % (
    total_rdkit_popcount, only_rdkit_popcount, total_rdkit_popcount/float(num_valid_records),
    total_rdkit_popcount/float(num_valid_records)/166)
print "Open Babel popcount: %d  OB only: %d average popcount: %.1f  average density: %.3f" % (
    total_openbabel_popcount, only_openbabel_popcount, total_openbabel_popcount/float(num_valid_records),
    total_openbabel_popcount/float(num_valid_records)/166)
#records 19640  #skipped 712
RDKit popcount: 637448  RDKit only: 9034 average popcount: 33.7  average density: 0.203
Open Babel popcount: 629173  OB only: 759 average popcount: 33.2  average density: 0.200

That's interesting. RDKit has a slightly higher popcount than Open Babel, and both have unique bits set which aren't in the other. Which bits are different? For that I'll use a decoder function which returns the list "on" bit positions in a fingerprint:

>>> from chemfp import encodings
>>> encodings.to_on_bitlist("\0\0\1\xf1")
[16, 24, 28, 29, 30, 31]
>>> 
(Remember that chemfp's uses a mixed endian encoding, where bytes are little endian and bits in a byte are big endian.)

The following will find where the two MACCS implementations have different fingerprints, compare which bits are set in one implementation which aren't set in the other, and report a few of the records which are different. (ChEBI stores the record identifier in the "ChEBI ID" tag instead of the title, so I use the 'id_tag' to extract the actual record id.)

import chemfp
from chemfp import text_toolkit
from chemfp.bitops import byte_popcount, byte_intersect_popcount
from chemfp.encodings import to_on_bitlist

from collections import defaultdict

rdkit_maccs_fptype = chemfp.get_fingerprint_type("RDKit-MACCS166")
openbabel_maccs_fptype = chemfp.get_fingerprint_type("OpenBabel-MACCS")
assert rdkit_maccs_fptype.num_bits == openbabel_maccs_fptype.num_bits == 166

filename = "/Users/dalke/databases/ChEBI_lite.sdf.gz"

extra_rdkit = defaultdict(list)
extra_openbabel = defaultdict(list)

with text_toolkit.read_sdf_ids_and_records(filename, id_tag="ChEBI ID") as sdf_reader:
    for recno, (id, sdf_record) in enumerate(sdf_reader, 1):
        try:
            rdkit_fp = rdkit_maccs_fptype.parse_molecule_fingerprint(sdf_record, "sdf")
            openbabel_fp = openbabel_maccs_fptype.parse_molecule_fingerprint(sdf_record, "sdf")
        except chemfp.ParseError:
            continue
        if rdkit_fp == openbabel_fp:
            continue
        in_rdkit = set(to_on_bitlist(rdkit_fp))
        in_openbabel = set(to_on_bitlist(openbabel_fp))

        # Record the identifier for each bit in one fingerprint which is not in the other
        for bitno in (in_rdkit - in_openbabel):
            extra_rdkit[bitno].append(id)
        for bitno in (in_openbabel - in_rdkit):
            extra_openbabel[bitno].append(id)
            
        
print "Extra in RDKit:", sum(len(ids) for ids in extra_rdkit.values())
for bitno, ids in sorted(extra_rdkit.items()):
    examples = repr(ids[:5])[1:-1]
    if len(ids) > 5:
        examples += " ..."
    print "%3d: %d extra = %s" % (bitno, len(ids), examples)

print
print "Extra in Open Babel:", sum(len(ids) for ids in extra_openbabel.values())
for bitno, ids in sorted(extra_openbabel.items()):
    examples = repr(ids[:5])[1:-1]
    if len(ids) > 5:
        examples += " ..."
    print "%3d: %d extra = %s" % (bitno, len(ids), examples)

The results is a bit lengthy - good thing I don't report all of the identifiers for each one!

Extra in RDKit: 9034
  2: 31 extra = 'CHEBI:30444', 'CHEBI:33313', 'CHEBI:37340', 'CHEBI:37341', 'CHEBI:37342' ...
 20: 4 extra = 'CHEBI:27581', 'CHEBI:51461', 'CHEBI:51467', 'CHEBI:51474'
 23: 12 extra = 'CHEBI:15724', 'CHEBI:17735', 'CHEBI:3611', 'CHEBI:3612', 'CHEBI:29811' ...
 42: 13 extra = 'CHEBI:21143', 'CHEBI:33633', 'CHEBI:33722', 'CHEBI:33723', 'CHEBI:33724' ...
 43: 3036 extra = 'CHEBI:7963', 'CHEBI:11230', 'CHEBI:11714', 'CHEBI:12194', 'CHEBI:15431' ...
 44: 34 extra = 'CHEBI:27581', 'CHEBI:32713', 'CHEBI:32715', 'CHEBI:32724', 'CHEBI:32726' ...
 48: 29 extra = 'CHEBI:15378', 'CHEBI:15724', 'CHEBI:17045', 'CHEBI:17735', 'CHEBI:3611' ...
 49: 10 extra = 'CHEBI:23066', 'CHEBI:32713', 'CHEBI:32715', 'CHEBI:32724', 'CHEBI:32726' ...
 52: 12 extra = 'CHEBI:21143', 'CHEBI:31196', 'CHEBI:33722', 'CHEBI:33723', 'CHEBI:33724' ...
 53: 1 extra = 'CHEBI:33633'
 58: 12 extra = 'CHEBI:228275', 'CHEBI:51895', 'CHEBI:52273', 'CHEBI:52298', 'CHEBI:52299' ...
 64: 4 extra = 'CHEBI:57255', 'CHEBI:38600', 'CHEBI:57718', 'CHEBI:58571'
 67: 6 extra = 'CHEBI:33130', 'CHEBI:29229', 'CHEBI:31197', 'CHEBI:33633', 'CHEBI:52699' ...
 68: 33 extra = 'CHEBI:47402', 'CHEBI:49464', 'CHEBI:21143', 'CHEBI:29230', 'CHEBI:29275' ...
 73: 2 extra = 'CHEBI:51730', 'CHEBI:51736'
 75: 7 extra = 'CHEBI:27581', 'CHEBI:32713', 'CHEBI:32715', 'CHEBI:32724', 'CHEBI:32726' ...
 77: 6 extra = 'CHEBI:38010', 'CHEBI:51595', 'CHEBI:52729', 'CHEBI:52748', 'CHEBI:52749' ...
 83: 29 extra = 'CHEBI:27899', 'CHEBI:21549', 'CHEBI:29275', 'CHEBI:30027', 'CHEBI:30312' ...
 85: 2 extra = 'CHEBI:51730', 'CHEBI:51736'
 89: 5 extra = 'CHEBI:31197', 'CHEBI:32715', 'CHEBI:32726', 'CHEBI:32734', 'CHEBI:58416'
 90: 7 extra = 'CHEBI:32713', 'CHEBI:32715', 'CHEBI:32724', 'CHEBI:32726', 'CHEBI:32729' ...
 92: 2 extra = 'CHEBI:51730', 'CHEBI:51736'
 98: 8 extra = 'CHEBI:27581', 'CHEBI:32713', 'CHEBI:32715', 'CHEBI:32724', 'CHEBI:32726' ...
100: 1 extra = 'CHEBI:60153'
107: 2 extra = 'CHEBI:23066', 'CHEBI:51736'
112: 33 extra = 'CHEBI:15342', 'CHEBI:52781', 'CHEBI:57255', 'CHEBI:33826', 'CHEBI:33831' ...
113: 1 extra = 'CHEBI:51736'
114: 5 extra = 'CHEBI:23066', 'CHEBI:51736', 'CHEBI:55429', 'CHEBI:55504', 'CHEBI:55506'
115: 2 extra = 'CHEBI:51736', 'CHEBI:55504'
117: 1 extra = 'CHEBI:51736'
118: 6 extra = 'CHEBI:38010', 'CHEBI:51595', 'CHEBI:52729', 'CHEBI:52748', 'CHEBI:52749' ...
121: 1 extra = 'CHEBI:60153'
124: 4500 extra = 'CHEBI:1734', 'CHEBI:2303', 'CHEBI:2914', 'CHEBI:3013', 'CHEBI:3048' ...
125: 3 extra = 'CHEBI:41981', 'CHEBI:29374', 'CHEBI:29375'
127: 1 extra = 'CHEBI:51736'
128: 1 extra = 'CHEBI:51736'
130: 18 extra = 'CHEBI:21143', 'CHEBI:29229', 'CHEBI:32715', 'CHEBI:32726', 'CHEBI:32734' ...
134: 86 extra = 'CHEBI:15852', 'CHEBI:15982', 'CHEBI:16048', 'CHEBI:16304', 'CHEBI:16379' ...
137: 1 extra = 'CHEBI:51736'
140: 3 extra = 'CHEBI:51730', 'CHEBI:51736', 'CHEBI:55429'
143: 63 extra = 'CHEBI:15342', 'CHEBI:15955', 'CHEBI:16379', 'CHEBI:16848', 'CHEBI:17679' ...
146: 1 extra = 'CHEBI:51736'
147: 1 extra = 'CHEBI:60153'
148: 5 extra = 'CHEBI:23066', 'CHEBI:51730', 'CHEBI:51736', 'CHEBI:55429', 'CHEBI:55504'
150: 2 extra = 'CHEBI:29275', 'CHEBI:58416'
152: 1 extra = 'CHEBI:51736'
156: 17 extra = 'CHEBI:52781', 'CHEBI:51222', 'CHEBI:51224', 'CHEBI:51225', 'CHEBI:51899' ...
157: 12 extra = 'CHEBI:16531', 'CHEBI:17675', 'CHEBI:50375', 'CHEBI:2311', 'CHEBI:11573' ...
159: 7 extra = 'CHEBI:23066', 'CHEBI:36771', 'CHEBI:48079', 'CHEBI:51730', 'CHEBI:51736' ...
161: 21 extra = 'CHEBI:57255', 'CHEBI:30858', 'CHEBI:33134', 'CHEBI:33668', 'CHEBI:33826' ...
165: 934 extra = 'CHEBI:3243', 'CHEBI:3385', 'CHEBI:3395', 'CHEBI:3567', 'CHEBI:3637' ...

Extra in Open Babel: 759
  1: 1 extra = 'CHEBI:33397'
  2: 11 extra = 'CHEBI:49920', 'CHEBI:30437', 'CHEBI:30439', 'CHEBI:30440', 'CHEBI:30442' ...
 25: 58 extra = 'CHEBI:15430', 'CHEBI:15436', 'CHEBI:18023', 'CHEBI:27484', 'CHEBI:27609' ...
 42: 7 extra = 'CHEBI:29779', 'CHEBI:30163', 'CHEBI:32060', 'CHEBI:35836', 'CHEBI:36143' ...
 44: 62 extra = 'CHEBI:15430', 'CHEBI:15436', 'CHEBI:18023', 'CHEBI:27484', 'CHEBI:27609' ...
 49: 39 extra = 'CHEBI:15430', 'CHEBI:15436', 'CHEBI:18023', 'CHEBI:27484', 'CHEBI:27609' ...
 52: 5 extra = 'CHEBI:9016', 'CHEBI:22634', 'CHEBI:32060', 'CHEBI:50953', 'CHEBI:50958'
 53: 9 extra = 'CHEBI:22634', 'CHEBI:30003', 'CHEBI:30004', 'CHEBI:30598', 'CHEBI:32743' ...
 62: 11 extra = 'CHEBI:15724', 'CHEBI:17735', 'CHEBI:3611', 'CHEBI:3612', 'CHEBI:29811' ...
 64: 34 extra = 'CHEBI:15852', 'CHEBI:15982', 'CHEBI:16304', 'CHEBI:16379', 'CHEBI:17439' ...
 67: 24 extra = 'CHEBI:9016', 'CHEBI:22634', 'CHEBI:29278', 'CHEBI:30101', 'CHEBI:30104' ...
 68: 37 extra = 'CHEBI:16144', 'CHEBI:44951', 'CHEBI:9016', 'CHEBI:22634', 'CHEBI:29278' ...
 75: 62 extra = 'CHEBI:15430', 'CHEBI:15436', 'CHEBI:18023', 'CHEBI:27484', 'CHEBI:27609' ...
 77: 67 extra = 'CHEBI:15430', 'CHEBI:15436', 'CHEBI:18023', 'CHEBI:27484', 'CHEBI:27609' ...
 81: 9 extra = 'CHEBI:22634', 'CHEBI:30003', 'CHEBI:30004', 'CHEBI:32060', 'CHEBI:32743' ...
 83: 18 extra = 'CHEBI:161680', 'CHEBI:9016', 'CHEBI:29278', 'CHEBI:29422', 'CHEBI:30104' ...
 89: 1 extra = 'CHEBI:37179'
 90: 2 extra = 'CHEBI:32060', 'CHEBI:37178'
 98: 58 extra = 'CHEBI:15436', 'CHEBI:18023', 'CHEBI:27484', 'CHEBI:27609', 'CHEBI:28421' ...
103: 2 extra = 'CHEBI:36143', 'CHEBI:37179'
105: 1 extra = 'CHEBI:55504'
112: 17 extra = 'CHEBI:40071', 'CHEBI:595389', 'CHEBI:8247', 'CHEBI:32014', 'CHEBI:36603' ...
118: 76 extra = 'CHEBI:15430', 'CHEBI:15436', 'CHEBI:15724', 'CHEBI:17735', 'CHEBI:18023' ...
130: 32 extra = 'CHEBI:29278', 'CHEBI:29779', 'CHEBI:30003', 'CHEBI:30004', 'CHEBI:30101' ...
134: 4 extra = 'CHEBI:34921', 'CHEBI:51597', 'CHEBI:53328', 'CHEBI:60153'
135: 27 extra = 'CHEBI:49706', 'CHEBI:29221', 'CHEBI:29270', 'CHEBI:29271', 'CHEBI:29420' ...
143: 10 extra = 'CHEBI:5835', 'CHEBI:33089', 'CHEBI:36618', 'CHEBI:36645', 'CHEBI:51962' ...
150: 3 extra = 'CHEBI:29240', 'CHEBI:29278', 'CHEBI:30227'
156: 2 extra = 'CHEBI:51204', 'CHEBI:53327'
157: 62 extra = 'CHEBI:15430', 'CHEBI:15436', 'CHEBI:18023', 'CHEBI:27484', 'CHEBI:27609' ...
159: 4 extra = 'CHEBI:29360', 'CHEBI:29434', 'CHEBI:29437', 'CHEBI:29440'
161: 4 extra = 'CHEBI:36603', 'CHEBI:36618', 'CHEBI:36645', 'CHEBI:60153'

Why are there big differences?

MACCS key 44

The second biggest difference is bit 43, where RDKit has 3036 extra matches and Open Babel has 0 extra matches. This corresponds to MACCS key 44, which is the "OTHER". The difference is due to timing. In last month's essay I tracked down the definition for OTHER, which had been missing for Open Babel, RDKit, and CDK. I reported it to the repsective projects, and the updated definition is now part of the main code base.

RDKit had a new release between last month and now, which I'm using. Open Babel has not. I'll demonstrate by asking the chemfp toolkit API for information about the underlying toolkit version:
>>> from chemfp import rdkit_toolkit, openbabel_toolkit
>>> rdkit_toolkit.software
'RDKit/2014.09.1'
>>> openbabel_toolkit.software
'OpenBabel/2.3.2'

MACCS key 125

The biggest difference is bit 124/key 125. RDKit has 4500 matches that weren't found in Open Babel, while Open Babel has no extra matches. This key is defined as "Aromatic Ring > 1".

This definition cannot be defined by a SMARTS pattern. RDKit has special code to handle that case. Open Babel does not. The current Open Babel code assumes that a fingerprint can be defined by a set of SMARTS patterns, with one SMARTS per bit/key, and there's no provision to indicate an alternative.

As a result, Open Babel will always set a 0 for that bit.

Open Babel MACCS bit count distribution

This got me to wonder about the number of records set by each bit, so I wrote a short program for that.

import chemfp
from chemfp.bitops import byte_contains_bit
from chemfp.encodings import to_on_bitlist

fptype = chemfp.get_fingerprint_type("OpenBabel-MACCS")

filename = "/Users/dalke/databases/ChEBI_lite.sdf.gz"

# Initialize a list of bit counts to 0.
counts = [0 for i in range(fptype.num_bits)]

with fptype.read_molecule_fingerprints(filename) as fp_reader:
    for (id, fp) in fp_reader:
        # Keep track of the total bit count
        for bitno in to_on_bitlist(fp):
            counts[bitno] += 1

# Report the result
for bitno, count in enumerate(counts):
    print "%3d: %6d" % (bitno, count)
Here is the Open Babel MACCS bit use distribution:
  0:      0
  1:      2
  2:    405
  3:     31
  4:     14
  5:     44
  6:    209
  7:    407
  8:    306
  9:    125
 10:    452
 11:    160
 12:    115
 13:    144
 14:     35
 15:    232
 16:    131
 17:    204
 18:    622
 19:    102
 20:     97
 21:    438
 22:    192
 23:    666
 24:    687
 25:    870
 26:    200
 27:    242
 28:   2654
 29:    422
 30:    402
 31:    153
 32:    214
 33:    701
 34:    311
 35:    730
 36:    740
 37:   1508
 38:    706
 39:    753
 40:    275
 41:    583
 42:   2069
 43:      0
 44:    665
 45:    226
 46:    590
 47:   3519
 48:   5666
 49:   2429
 50:    621
 51:    452
 52:   5505
 53:   6337
 54:    916
 55:    248
 56:   4981
 57:   1013
 58:    551
 59:    969
 60:   1012
 61:   2398
 62:    335
 63:    565
 64:   3477
 65:   3222
 66:   1315
 67:    211
 68:   3119
 69:    490
 70:    754
 71:   6860
 72:   1111
 73:   3312
 74:   2734
 75:   3010
 76:   2988
 77:   1245
 78:   2880
 79:   3308
 80:   1801
 81:   5228
 82:   4477
 83:   3579
 84:   3668
 85:   2435
 86:   1060
 87:   2839
 88:   7561
 89:   8478
 90:   7778
 91:   4308
 92:   2861
 93:   1967
 94:   6416
 95:   5504
 96:   5064
 97:   6145
 98:   4558
 99:   4120
100:   5355
101:   4878
102:   1340
103:   7468
104:   5827
105:   5600
106:   1540
107:   4975
108:   5852
109:   4748
110:   5774
111:   7248
112:   3699
113:   2494
114:   5190
115:   5297
116:   4970
117:   7356
118:   1667
119:   5755
120:   5254
121:   4356
122:   9127
123:   6298
124:      0
125:   6236
126:   8584
127:   6237
128:   6218
129:   4972
130:  10503
131:   9283
132:   4075
133:   2230
134:   2722
135:   7841
136:   8606
137:   5036
138:  11348
139:   9332
140:   3954
141:   6373
142:   8584
143:   4266
144:   6504
145:  11568
146:   7385
147:   7141
148:   6480
149:   8560
150:   7651
151:  10272
152:   9338
153:  11775
154:  11310
155:   9060
156:  13650
157:   9125
158:  14337
159:  10031
160:  10299
161:   8148
162:  11358
163:  16302
164:  12939
165:      0
You can see that bits 0, 43, 124, and 165 (or keys 1, 44, 125, and 166) are not set. I've already discussed 44 and 125. Key 1 is "Isotope" and key 166 is "Fragment", neither of which have a SMARTS definition. (I believe the SMARTS definitions should be "[!*0]" and "(*).(*)", but I'll need to test if those are sufficiently cross-toolkit ... Nope, RDKit doesn't seem to support component-level SMARTS.)

RDMACCS - a more cross-toolkit MACCS implementation

Chemfp includes the "RDMACCS" fingerprints, which is my attempt at a cross-toolkit implementation of the MACCS keys. It starts with the same SMARTS definitions as RDKit (though I haven't yet added key 44) and includes toolkit-specific implementations for the keys which can't be expressed in SMARTS.

When I redo the same comparisons using the following fingerprint types:

rdkit_maccs_fptype = chemfp.get_fingerprint_type("RDMACCS-RDKit")
openbabel_maccs_fptype = chemfp.get_fingerprint_type("RDMACCS-OpenBabel")
I get answers which are much closer together
#records 19640  #skipped 712
RDKit popcount: 634388  RDKit only: 528 average popcount: 33.5  average density: 0.202
Open Babel popcount: 634698  OB only: 838 average popcount: 33.5  average density: 0.202
I haven't yet done the research to figure out the source of all of the differences, but from work I did in 2011 on Dealing with SSSR I know that aromaticity perception is likely an important factor.

I'll leave you with a bitwise comparison so you can get a sense of the differences for yourself:

Extra in RDKit: 528
 20: 4 extra = 'CHEBI:27581', 'CHEBI:51461', 'CHEBI:51467', 'CHEBI:51474'
 23: 12 extra = 'CHEBI:15724', 'CHEBI:17735', 'CHEBI:3611', 'CHEBI:3612', 'CHEBI:29811' ...
 42: 13 extra = 'CHEBI:21143', 'CHEBI:33633', 'CHEBI:33722', 'CHEBI:33723', 'CHEBI:33724' ...
 44: 34 extra = 'CHEBI:27581', 'CHEBI:32713', 'CHEBI:32715', 'CHEBI:32724', 'CHEBI:32726' ...
 48: 18 extra = 'CHEBI:15724', 'CHEBI:17045', 'CHEBI:17735', 'CHEBI:3611', 'CHEBI:3612' ...
 49: 10 extra = 'CHEBI:23066', 'CHEBI:32713', 'CHEBI:32715', 'CHEBI:32724', 'CHEBI:32726' ...
 52: 12 extra = 'CHEBI:21143', 'CHEBI:31196', 'CHEBI:33722', 'CHEBI:33723', 'CHEBI:33724' ...
 53: 1 extra = 'CHEBI:33633'
 58: 12 extra = 'CHEBI:228275', 'CHEBI:51895', 'CHEBI:52273', 'CHEBI:52298', 'CHEBI:52299' ...
 64: 4 extra = 'CHEBI:57255', 'CHEBI:38600', 'CHEBI:57718', 'CHEBI:58571'
 67: 6 extra = 'CHEBI:33130', 'CHEBI:29229', 'CHEBI:31197', 'CHEBI:33633', 'CHEBI:52699' ...
 68: 33 extra = 'CHEBI:47402', 'CHEBI:49464', 'CHEBI:21143', 'CHEBI:29230', 'CHEBI:29275' ...
 73: 2 extra = 'CHEBI:51730', 'CHEBI:51736'
 75: 7 extra = 'CHEBI:27581', 'CHEBI:32713', 'CHEBI:32715', 'CHEBI:32724', 'CHEBI:32726' ...
 77: 6 extra = 'CHEBI:38010', 'CHEBI:51595', 'CHEBI:52729', 'CHEBI:52748', 'CHEBI:52749' ...
 83: 29 extra = 'CHEBI:27899', 'CHEBI:21549', 'CHEBI:29275', 'CHEBI:30027', 'CHEBI:30312' ...
 85: 2 extra = 'CHEBI:51730', 'CHEBI:51736'
 89: 5 extra = 'CHEBI:31197', 'CHEBI:32715', 'CHEBI:32726', 'CHEBI:32734', 'CHEBI:58416'
 90: 7 extra = 'CHEBI:32713', 'CHEBI:32715', 'CHEBI:32724', 'CHEBI:32726', 'CHEBI:32729' ...
 92: 2 extra = 'CHEBI:51730', 'CHEBI:51736'
 98: 8 extra = 'CHEBI:27581', 'CHEBI:32713', 'CHEBI:32715', 'CHEBI:32724', 'CHEBI:32726' ...
107: 2 extra = 'CHEBI:23066', 'CHEBI:51736'
112: 33 extra = 'CHEBI:15342', 'CHEBI:52781', 'CHEBI:57255', 'CHEBI:33826', 'CHEBI:33831' ...
113: 1 extra = 'CHEBI:51736'
114: 5 extra = 'CHEBI:23066', 'CHEBI:51736', 'CHEBI:55429', 'CHEBI:55504', 'CHEBI:55506'
115: 2 extra = 'CHEBI:51736', 'CHEBI:55504'
117: 1 extra = 'CHEBI:51736'
118: 6 extra = 'CHEBI:38010', 'CHEBI:51595', 'CHEBI:52729', 'CHEBI:52748', 'CHEBI:52749' ...
124: 12 extra = 'CHEBI:18023', 'CHEBI:35895', 'CHEBI:36303', 'CHEBI:37372', 'CHEBI:52583' ...
127: 1 extra = 'CHEBI:51736'
128: 1 extra = 'CHEBI:51736'
130: 18 extra = 'CHEBI:21143', 'CHEBI:29229', 'CHEBI:32715', 'CHEBI:32726', 'CHEBI:32734' ...
134: 86 extra = 'CHEBI:15852', 'CHEBI:15982', 'CHEBI:16048', 'CHEBI:16304', 'CHEBI:16379' ...
137: 1 extra = 'CHEBI:51736'
140: 3 extra = 'CHEBI:51730', 'CHEBI:51736', 'CHEBI:55429'
143: 63 extra = 'CHEBI:15342', 'CHEBI:15955', 'CHEBI:16379', 'CHEBI:16848', 'CHEBI:17679' ...
146: 1 extra = 'CHEBI:51736'
148: 5 extra = 'CHEBI:23066', 'CHEBI:51730', 'CHEBI:51736', 'CHEBI:55429', 'CHEBI:55504'
150: 2 extra = 'CHEBI:29275', 'CHEBI:58416'
152: 1 extra = 'CHEBI:51736'
156: 17 extra = 'CHEBI:52781', 'CHEBI:51222', 'CHEBI:51224', 'CHEBI:51225', 'CHEBI:51899' ...
157: 12 extra = 'CHEBI:16531', 'CHEBI:17675', 'CHEBI:50375', 'CHEBI:2311', 'CHEBI:11573' ...
159: 7 extra = 'CHEBI:23066', 'CHEBI:36771', 'CHEBI:48079', 'CHEBI:51730', 'CHEBI:51736' ...
161: 21 extra = 'CHEBI:57255', 'CHEBI:30858', 'CHEBI:33134', 'CHEBI:33668', 'CHEBI:33826' ...

Extra in Open Babel: 838
 25: 58 extra = 'CHEBI:15430', 'CHEBI:15436', 'CHEBI:18023', 'CHEBI:27484', 'CHEBI:27609' ...
 42: 7 extra = 'CHEBI:29779', 'CHEBI:30163', 'CHEBI:32060', 'CHEBI:35836', 'CHEBI:36143' ...
 44: 62 extra = 'CHEBI:15430', 'CHEBI:15436', 'CHEBI:18023', 'CHEBI:27484', 'CHEBI:27609' ...
 48: 4 extra = 'CHEBI:29344', 'CHEBI:29940', 'CHEBI:30147', 'CHEBI:30148'
 49: 39 extra = 'CHEBI:15430', 'CHEBI:15436', 'CHEBI:18023', 'CHEBI:27484', 'CHEBI:27609' ...
 52: 5 extra = 'CHEBI:9016', 'CHEBI:22634', 'CHEBI:32060', 'CHEBI:50953', 'CHEBI:50958'
 53: 9 extra = 'CHEBI:22634', 'CHEBI:30003', 'CHEBI:30004', 'CHEBI:30598', 'CHEBI:32743' ...
 62: 11 extra = 'CHEBI:15724', 'CHEBI:17735', 'CHEBI:3611', 'CHEBI:3612', 'CHEBI:29811' ...
 64: 34 extra = 'CHEBI:15852', 'CHEBI:15982', 'CHEBI:16304', 'CHEBI:16379', 'CHEBI:17439' ...
 67: 24 extra = 'CHEBI:9016', 'CHEBI:22634', 'CHEBI:29278', 'CHEBI:30101', 'CHEBI:30104' ...
 68: 37 extra = 'CHEBI:16144', 'CHEBI:44951', 'CHEBI:9016', 'CHEBI:22634', 'CHEBI:29278' ...
 75: 62 extra = 'CHEBI:15430', 'CHEBI:15436', 'CHEBI:18023', 'CHEBI:27484', 'CHEBI:27609' ...
 77: 67 extra = 'CHEBI:15430', 'CHEBI:15436', 'CHEBI:18023', 'CHEBI:27484', 'CHEBI:27609' ...
 81: 9 extra = 'CHEBI:22634', 'CHEBI:30003', 'CHEBI:30004', 'CHEBI:32060', 'CHEBI:32743' ...
 83: 17 extra = 'CHEBI:161680', 'CHEBI:9016', 'CHEBI:29278', 'CHEBI:30104', 'CHEBI:30227' ...
 89: 1 extra = 'CHEBI:37179'
 90: 2 extra = 'CHEBI:32060', 'CHEBI:37178'
 98: 58 extra = 'CHEBI:15436', 'CHEBI:18023', 'CHEBI:27484', 'CHEBI:27609', 'CHEBI:28421' ...
103: 2 extra = 'CHEBI:36143', 'CHEBI:37179'
105: 1 extra = 'CHEBI:55504'
112: 17 extra = 'CHEBI:40071', 'CHEBI:595389', 'CHEBI:8247', 'CHEBI:32014', 'CHEBI:36603' ...
118: 76 extra = 'CHEBI:15430', 'CHEBI:15436', 'CHEBI:15724', 'CHEBI:17735', 'CHEBI:18023' ...
124: 88 extra = 'CHEBI:15852', 'CHEBI:15955', 'CHEBI:15982', 'CHEBI:16304', 'CHEBI:16379' ...
130: 32 extra = 'CHEBI:29278', 'CHEBI:29779', 'CHEBI:30003', 'CHEBI:30004', 'CHEBI:30101' ...
134: 4 extra = 'CHEBI:34921', 'CHEBI:51597', 'CHEBI:53328', 'CHEBI:60153'
135: 27 extra = 'CHEBI:49706', 'CHEBI:29221', 'CHEBI:29270', 'CHEBI:29271', 'CHEBI:29420' ...
143: 10 extra = 'CHEBI:5835', 'CHEBI:33089', 'CHEBI:36618', 'CHEBI:36645', 'CHEBI:51962' ...
150: 3 extra = 'CHEBI:29240', 'CHEBI:29278', 'CHEBI:30227'
156: 2 extra = 'CHEBI:51204', 'CHEBI:53327'
157: 62 extra = 'CHEBI:15430', 'CHEBI:15436', 'CHEBI:18023', 'CHEBI:27484', 'CHEBI:27609' ...
159: 4 extra = 'CHEBI:29360', 'CHEBI:29434', 'CHEBI:29437', 'CHEBI:29440'
161: 4 extra = 'CHEBI:36603', 'CHEBI:36618', 'CHEBI:36645', 'CHEBI:60153'
As far as I'm aware, no one has done an extensive analysis of the differences between the available MACCS implementations. Are you up for the challenge?


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2013 Andrew Dalke Scientific AB