Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2008/06/27/generating_fingerprints_with_openbabel

Generating molecular fingerprints with OpenBabel

This is part 2 of an on-going series dealing with molecular fingerprints:

  1. Molecular fingerprints, background
  2. Generating molecular fingerprints with OpenBabel
  3. Computing Tanimoto scores, quickly
  4. Block Tanimoto calculations
  5. Faster block Tanimoto calculations
  6. HAKMEM 169 and other popcount implementations
  7. Testing the bitslice algorithm for popcount
  8. 2011 update: Faster population counts
Comments?


I am not yet going to discuss implementation details of how to generate molecular fingerprints. I'll assume you have access to a program which generates them for you. This might be a commercial program like from Daylight or from Mesa Analytics. For those following along from home (or don't have license for those, or want to dig into the step-by-step details of how to handle one of those program), install OpenBabel. It implements three different fingerprint types:

Why are there two different file formats?

The FP3 and FP4 fingerprints are substructure fingerprints based on SMARTS patterns listed in text files. Sadly, while I think the two files should be in the same format, they aren't. The first looks like:

[+]     1
[-]     2
[#6][CX3](=O)   3 aldehyde or ketone
[CX3H1](=O)[#6] 4 aldehyde
[#6][CX3](=O)[#6]       5 ketone
while the second looks like:
# I.1.1 Alkanes:

Primary_carbon: [CX4H3][#6]

Secondary_carbon: [CX4H2]([#6])[#6]

Tertiary_carbon: [CX4H1]([#6])([#6])[#6]

Quartary_carbon: [CX4]([#6])([#6])([#6])[#6]

In looking through the source I found this comment in the method "ReadPatternFile" in the "fingerprints/finger3.cpp".

// Christian Laggner's format: SMARTS at end of line
This is the code which parses the FP3 and FP4 formats. (Yes, even though it's named 'finger3.cpp' it also parses the FP4 format.) Specially this comment comes from the section which figures out the specific file format. My most likely conjecture is that the FP3 format existed then someone decided to add the FP4 definitions, using SMARTS patterns that someone else defined.

There's a tradeoff here: should code be changed to fit data or data changed to fit code? How malleable is code, and how much should you change someone else's data? In this case the implementer decided the easiest solution was to keep the original format unchanged; perhaps because it would be easier to import updates in the future. I would have reformated the file to fit the expected format, perhaps keeping the conversion program around to handle any updates.

Is it better to implement more features but haphazardly, or fewer features but more cleanly? I'm biased much futher towards the second than most people seem to be, and I think that the combined solution of "more features and cleanly" is generally posssible. Perhaps scientific software is worse than average because most people's explicit goal is to get some scientific project done, and not to support the needs of unknown others in the future. Or because most scientists aren't trained programmers? Or perhaps I have an unrealistic expectation of how green the grass might be on the other side of the fence.

Generating fingerprints with Babel

Babel generates two types of fingerprint formats. One is an index file used to store precomputed fingerprints when doing a substructure search. It's a binary file format that's a bit too complex to talk about here. The other is, I'm told, more of a debugging aid that saves the fingerprints as hexadecimal strings. It's much easier to show and understand so I'll use it.

I happen to have a small set of pharmacologically popular compounds that I got from JJ some years back. It was for an example data set he put together while at Daylight.

N12CCC36C1CC(C(C2)=CCOC4CC5=O)C4C3N5c7ccccc76 Strychnine
c1ccccc1C(=O)OC2CC(N3C)CCC3C2C(=O)OC cocaine
COc1cc2c(ccnc2cc1)C(O)C4CC(CC3)C(C=C)CN34 quinine
OC(=O)C1CN(C)C2CC3=CCNc(ccc4)c3c4C2=C1 lyseric acid
CCN(CC)C(=O)C1CN(C)C2CC3=CNc(ccc4)c3c4C2=C1 LSD
C123C5C(O)C=CC2C(N(C)CC1)Cc(ccc4O)c3c4O5 morphine
C123C5C(OC(=O)C)C=CC2C(N(C)CC1)Cc(ccc4OC(=O)C)c3c4O5 heroin
c1ncccc1C1CCCN1C nicotine
CN1C(=O)N(C)C(=O)C(N(C)C=N2)=C12 caffeine
C1C(C)=C(C=CC(C)=CC=CC(C)=CCO)C(C)(C)C1 vitamin a
I'll convert that to the text fingerprint format ".fpt" file using FP2-style fingerprints. Here I'll specify "-xfFP2" but that's optional and if I left it out I would still get the FP2 fingerprints. The "-xh" option says to include the hex fingerprints in the output file:
% babel drugs.smi drugs.fpt -xh -xfFP2
10 molecules converted
192 audit log messages 
% cat drugs.fpt
>Strychnine   183 bits set 
00054062 81009600 00100102 81010700 200c0000 00202850 
03102000 28000404 10200882 001849e4 483c0024 50039002 
1402801a 01b90000 00020540 010c100a 22a4c000 82000300 
2ac02018 00002201 60102120 183c9630 2100a080 81500019 
00041401 80008c00 00484810 90001080 040c8202 0006081b 
24020080 a2042610 
>cocaine   Tanimoto from Strychnine = 0.353234
00010000 01000800 00100100 00010700 000c0050 00204010 
00000000 20000000 00000100 010008c0 00200000 50019000 
1c00801a 01890000 00000000 02081008 00a20000 02000000 
01c02010 00000201 00000020 18020a30 01000080 00500014 
00001401 80002000 00088000 10001000 00000200 0006000e 
c0020000 02002200 
>quinine   Tanimoto from Strychnine = 0.330317
00000002 09000808 2010410a 40010600 000c1080 0000c000 
020c0000 20000400 00400980 001808c8 08300004 50018000 
98000806 00e00000 01080441 000c000c 10000000 02000000 
00612000 00020201 40000100 18208038 23800000 81500010 
00002000 80000000 00520000 90000000 04000202 00060002 
64060000 00002620 
>lyseric acid   Tanimoto from Strychnine = 0.401487
00084212 29008208 41901b02 00010600 22010024 00002040 
830800c0 a400040b 104008c0 285828d8 08300204 71c19008 
08029102 08311800 00063446 420c034a 20248600 02080140 
0a80201a 00800203 44900504 58200a20 2102c0c0 81400010 
01241001 c3002801 00486804 d0011280 04488002 00060c1a 
04000080 22006600 
>LSD   Tanimoto from Strychnine = 0.332046
00000012 2910cc00 01900302 41910610 00041100 00000000 
800820c0 04000403 004109c0 0019096a 08300004 51019000 
98028902 00610800 00040042 400e024c 20000600 02000144 
00222412 00000201 40800500 1a308608 20808040 91400018 
01243000 80000801 0848284c d0000000 04404000 00260c13 
24040200 02012620 
>morphine   Tanimoto from Strychnine = 0.303571
04001000 eb000228 20100d0a 80310700 00181061 c2000002 
02081000 38000643 00004890 001818ca 08300308 51818040 
00824002 21a10084 112c0425 120ca018 08800200 26008060 
0e492010 0002a201 00000100 18210010 0500e040 89540410 
00201400 c0002401 04524200 90800000 04000302 0007cc0a 
64020000 00012f40 
>heroin   Tanimoto from Strychnine = 0.317406
04001000 eb000228 20100d0a 80310700 00181061 c2000802 
020c1000 38000643 00006890 001818ca 48300308 51818040 
0082c202 21a10084 112d0425 120cb018 08820200 26008060 
2e492012 0002a201 00000120 18210a10 1500e040 89540410 
02201400 c0002401 04524200 90800024 04000302 0007cc0a 
64020000 10812f50 
>nicotine   Tanimoto from Strychnine = 0.220096
10000002 0100c000 00104102 00800610 00041100 00000010 
00006000 02001000 00500800 000800c0 01300004 40010000 
00000802 00200800 00000400 00040004 00020000 02000004 
00200400 00000201 40000100 18008800 20800000 00500010 
00040000 40000000 08000000 90000000 00400100 00060002 
20000000 80000600 
>caffeine   Tanimoto from Strychnine = 0.124444
0000002a 00801800 00080000 02010700 00080040 0000c000 
00008000 00000000 06400100 800000a0 0c00800c 00000000 
00009000 00400000 00280080 000c000c 00800000 00088000 
00400000 00000000 80000100 10000040 21800041 00300004 
00040000 c0200000 60000000 00040000 00004000 00460000 
c0000000 00000280 
>vitamin a   Tanimoto from Strychnine = 0.1
00000400 01040000 00000000 00000000 40000000 00000001 
00000000 00000402 00000000 00100000 00000000 40000000 
00020000 00020000 00000400 0008000c 00080000 00000000 
00002090 02000001 00000000 00000000 00408000 00000012 
00201000 00200000 00404000 80000000 00000020 01000000 
24820000 00000000

Reverse engineering the format

The format isn't described anywhere but it's easy to reverse engineer. Each input compound has an output record. The first compound is special. I'll call it the query compound and the result the query report. The query report lists the compound name, the number of bits set in it, and the fingerprint itself. I'll call the others the target compounds and the target reports. Every target report lists the compound name, the Tanimoto score between the target fingerprint and the query fingerprint, and the target fingerprint.

If you experiment some with this some more (or read the documentation) you'll occasionally see target reports which look like:

>Compound-589   Tanimoto from 1 = 0.409639
Possible superstructure of 1
00000002 00000028 20000308 00590200 40001000 04000000 
82040080 02000402 01800800 00180842 08100000 40009000 
01028002 00000000 41000413 00000008 00000400 02000020 
0021a000 00020001 00000101 08000280 00008021 00400810 
01200000 80000000 00522000 c0000000 00400200 00020002 
04002000 00010400 
The "Possible superstructure" line means that every on bit in the query fingerprint is also on in the target fingerprint. This can be useful when doing substructure searches. A subgraph search is slower than comparing bits in a fingerprint. With the right type of fingerprint (like the CACTVS substructure keys; note that not all fingerprints work for this) then if a bit is set in the query but not in the target fingerprint then there's no way that the target can be a superset of the query, and there's no need to do the more complicated subgraph isomorphism search.

Back to the output. I was a bit surprised to see how much extra work was being done. I only expected the fingerprints and not the Tanimoto nor the possible superstructure test. These aren't hard to do, especially given the input parsing and molecular perception that had to be done, but it wasn't something I expected.

Probably there's history which explains why this happened. My theory is that the code started as a quick and easy way to do single compound tanimoto searches of a structure file, without making an index or other intermediate file. After time someone wanted to get the actual fingerprint values and figured the easiest solution was to add an option to include that in the output.

The output for each record contains a fingerprint with 5 lines of 6 groups[1], plus 2 other groups in each fingerprint. A group has 8 hex values and each hex value is 4 bits, so there are (5*6+2)*8*4 = 1024 bits in each fingerprint. The documentation (and the code, in fingerprints/finger2.cpp:CalcHash) says there are only 1021 bits set. Therefore my awsome math skill tell me there must be 3 bits in the output representation which are always 0. Which 3? How are the abstract fingerprints represented in memory? How are they represented in the output?

Endianness

Most people these days deal with Intel processors and don't have to worry about different endian machines. It's something you should at least know about in case you want to write portable software or to understand the chain of operations from abstract fingerprint bit set to actual bit set in hardware.

Endianness can refer to bit endianness and to byte endianness. Yes, it's complicated, but important enough that I'll go into some details.

First, bit endianness. How do you think about fingerprints? Most likely your model is of the bits all in a row. Is the first bit (bit 0) on the left of the row or the right? I think of fingerprints as an array, with 0 being the left-most digit. This is a little-endian view, and in a 64 bit fingerprint it looks like:

  0         1         2         3         4         5         6   
  0123456789012345678901234567890123456789012345678901234567890123
while if you think of fingerprints as a huge number in base-2, with the 1s digit on the right, then the 10s digit, then the 100s, etc. then you are a big-endian person.
  6  6         5         4         3         2         1         0
  3210987654321098765432109876543210987654321098765432109876543210
Both are equally valid, which is why historically there were such religious wars over which was best.

How does OpenBabel represent the fingerprint? There are two forms, one for in-memory use and the other when it writes to this text. (The index file probably has its own format, but I haven't looked into that yet.) Here's the relevant OpenBabel code:

  void OBFingerprint::SetBit(vector<unsigned int>& vec, const unsigned int n)
  {
    vec[n/Getbitsperint()] |= (1 << (n % Getbitsperint()));
  }
The data is stored in a "vector<unsigned int> vec". Bit 0 is in vec[0]:0, bit 1 is in vec[0]:1, bit 32 is in vec[1]:0, and so on. Arrays are laid out in little endian order, so the 8 bits of vec[0] come before the 8 bits of vec[1], which are before those of vec[2].

Here's where byte enddianness comes into play. Intel CPUs use little-endian byte order in a word, although in each byte the bits are in big endian order. (I think.) That is, in a 32 bit integer the actual bit order is:

   byte 0    byte 1    byte 2    byte 3
  00000000  11111100  22221111  33222222
  76543210  54321098  32109876  10987654
Little endian is sometimes refered to as "1234" order because the smallest byte goes first, then the the 2nd smallest, then the 3rd, then the 4th. The equivalent notation for big endian is "4321".

Does my machine do what I think it does?

I said that my machine is little-endian and that it arranges data in a certain way. As Reagan would say, "trust, but verify. How do I verify that my machine does what I think it does? The easiest way is to write a block of memory to disk then look at the bytes. I'm a Python programmer but I'm going to do this in C because then I'm more certain I know exactly what the progam is doing.

% cat endian.c
#include <stdio.h>
#include <stdlib.h>
int main() {
  FILE *f;
  unsigned int data[] = {0, 1, 1<<8, 1<<16, 1<<24, 1<<30};
  if (!(f = fopen("endian.dat", "wb"))) exit(1);
  fwrite(data, sizeof(unsigned int), 6, f);
  fclose(f);
}
% cc endian.c
% ./a.out 
% od -t xC endian.dat
0000000    00  00  00  00  01  00  00  00  00  01  00  00  00  00  01  00
0000020    00  00  00  01  00  00  00  40                                
0000030
% 

The "od" program (short for "octal dump") has various options to describe how to convert the bytes into something readable. In this case I'm displaying each character as a hex value, where "20" is the hex value for the binary value "00100000".

The "00 00 00 00" is the value 0. The "01 00 00 00" is the 1, and you can see that the 1 went into the first byte, as expected on a little endian machine. The remainder of the test cases show that all bytes are in the expected order.

How OpenBabel saves in-memory fingerprints to disk

I've figured out the in-memory representation, but that's not the same as the fingerprint format babel generates with the "-xh" option. I could reverse engineer it, because there's only a few sane ways to write a fingerprint, but since I've got the code I might as well use it. The conversion is done in formats/fingerprintformat.cpp in WriteMolecule:

    if(hexoutput)
      {
        for(i=fptvec.size()-1;i>=0;i--)
          {
            ofs << hex << setfill('0') << setw(8) << fptvec[i] << " " ;
            if((fptvec.size()-i)%6==0)
              ofs <<endl;
          }
        ofs << dec << endl;
      }
This appears to scramble the order up even more, but actually converts the entire fingerprint into big-endian format at the bit level! I said that arrays are laid out in little endian order, with vec[0] first. The for-loop here goes in reverse order, so it the output will be in big endian order at the word level.

Then because it's using C++ stream formatting to convert each unsigned int into hex words of at least size 8 (note that there will be a format change on a system where unsigned int takes 64 bits). The hex representation of an unsigned int is bitwise big endian, so the entire fingerprint is in big endian order.

Perhaps this helps a bit. Given a 64 bit fingerprint, here are some different representations:

  == fingerprint index (the way I think of them) ==

 00000000 00111111 11112222 22222233  33333333 44444444 44555555 55556666
 01234567 89012345 67890123 45678901  23456789 01234567 89012345 67890123

  == bits (in memory, Intel little-endian order) ==
                word 0                             word 1
  byte 0   byte 1   byte 2   byte 3    byte 4   byte 5   byte 6   byte 7
[00000000 11111100 22221111 33222222][33333333 44444444 55555544 66665555]
[76543210 54321098 32109876 10987654][98765432 76543210 54321098 32109876]

  == bits (on disk, big endian order in hex) ==
  byte 0   byte 1   byte 2   byte 3    byte 4   byte 5   byte 6   byte 7
 66665555 55555444 44444444 33333333  33222222 22221111 11111100 00000000
 32109876 54321098 76543210 98765432  10987654 32109876 54321098 76543210
I'm not sure that helped, but perhaps it did. Notice how the end result has the bits in reverse order from what they are in abstract space? Isn't that cool? I wonder if it was designed to be that way.

Does OpenBabel do what I think it does?

Again, how can I verify that the on-disk fingerprint representation is what I think it is? I generated a bunch of FP2 fingerprint earlier, and I know 3 bits are never set. By looking at the code, which does a "%1021" (modulo 1021), those are going to be bits 1021, 1022, and 1023, which means the first byte of the hex fingerprint can only be 0 or 1, because the first three bits will always be 0. Inspecting the generated fingerprints, nicotine is the only compound which has a 1 there, and all others have a 0. I might be right, but 10 compounds aren't enough to be comfortable that that observation is meaningful. For example, note that the last hex digit in 10 cases was 0, which is compatible with a little-endian output format.

I need more data. I've got a subset of the venerable NCI data set hanging around as an SD file so I converted that into "nci_subset.fpt" and wrote a program to extract all of the hex fingerprints from it. That was really easy. Python supports arbitrary length hex integers, in big-endian order, so when I get a fingerprint like

08000002 00000800 00000102 c0010200 00000000 00008000 
10000440 01000001 00400000 00080840 08000004 40008000 
00008000 00200000 00000040 10080000 00000200 02001000 
00102000 00000000 40000000 08020240 20084000 00400014 
00840000 a0000001 00000000 40000000 02400080 00060002 
80000000 00000400 
I can convert that into an integer with the "int()" function, where the "16" means the string is in hex/base 16.
value = int(
  "080000020000080000000102c00102000000000000008000" +
  "100004400100000100400000000808400800000440008000" +
  "000080000020000000000040100800000000020002001000" +
  "001020000000000040000000080202402008400000400014" +
  "00840000a000000100000000400000000240008000060002" +
  "8000000000000400", 16)
In case you are wondering, the decimal value of that is
56177911301563671064843091046153118492063251553909
52326950831995905374647824383919181380532552815799
66155722769584340043150756553786638472560752986690
67699947332386097831452711135062918243615986194710
86993414964749368593524593930193460639231942458691
20471870584312582697063998279693941606655598014346
1925888
I'll accumulate all of these values by doing a successive bitwise-'or' of each number. This works because all numbers are internally represented in binary. By doing a bitwise-or a given bit will be set if any of the input fingerprints was set in that position.

Before doing that, another example might help. First, some moral support. If you're jumping into this from chemistry, with no experience in binary representation or byte order, then you're having to learn a different way of thinking about numbers. I hope this level of detail helps by pointing out how things work step-by-step. While those who have a strong CS background probably skipped this whole section.

I'll bitwise-union a few numbers (74, 35, and 76), as expressed in binary, and show that that result is equal to the binary number which has 1 where at least one of the original numbers had a 1. Here I'm using the interactive prompt in Python. I call "int()" with "2" to tell it that the string is in base-2 format.

>>> union_bits = 0
>>> union_bits = union_bits | int("1001010", 2)
>>> union_bits = union_bits | int("0100011", 2)
>>> union_bits = union_bits | int("1001100", 2)
>>> union_bits
111
>>> int("1101111", 2)
111
>>> hex(union_bits)
'0x6f'
>>> 

Code to find the union of all the fingerprints

Once that's made sense, here's the Python code I used to generate the bitwise-union of all of the fingerprints in the NCI subset:

#filename = "drugs.fpt"
filename = "nci_subset.fpt"

# Accumulate the union of all the fingerprint bits
union_bits = 0

# Extract only the hex value for each compound fingerprint
# This assumes the fingerprint file is small enough to fit into memory!
# (The parser was easier to write this way.)
for block in open(filename).read().split("\n>"):
    header, remainder = block.split("\n", 1)
    if remainder.startswith("Possible superstructure"):
        _, remainder = remainder.split("\n", 1)

    # Remove spaces and newlines to get only a hex number
    fp_str = remainder.replace(" ", "").replace("\n", "")
    # Python handles 1024 bit long integers without a problem
    fp = int(fp_str, 16)

    # Merge all the one bits together
    union_bits |= fp

# This should start "0x1" if the first 3 bits are 0
print hex(union_bits)
print len(hex(union_bits)[2:-1]), "bytes"

The output I got was, with some folding of the very long line to make it fit into this text:

0x1fffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff
  ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff
  ffffffffffffffffffffffffffffffdfffffffffffffffffffffffffffffffff
  ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffL
256 bytes
Yes indeed, the first 3 bits are never 1, though there is one bit (where the 'd' is) which is also never 1. Since bits 1021, 1022 and 1023 of the fingerprint should always be 0, this is pretty conclusive evidence that the hex fingerprints in Babel's "fpt" file format are written in big-endian order. The left-most bit of the written fingerprint is number 1023 and the right-most bit is number 0.

Checking the substructure fingerprint order

Anyone who knows me knows I like sweating the details. I've been checking the FP2 fingerprints, but perhaps the FP3 and FP4 fingerprints use a different ordering. Those are feature fingerprints based on SMARTS patterns. The FP3 definition in "patterns.txt" starts:

[+]     1
[-]     2
[#6][CX3](=O)   3 aldehyde or ketone
[CX3H1](=O)[#6] 4 aldehyde
I can easily construct molecules which make those bits be set. For simplicity I'll only only force the first then the second to be set.
% cat carbon.smi
[C+] C-plus
[C-] C-minus
% babel carbon.smi carbon.fpt -xh -xfFP3
2 molecules converted
27 audit log messages 
% cat carbon.fpt
>C-plus   1 bits set 
00000000 00000001 
>C-minus   Tanimoto from C-plus = 0
00000000 00000002 
% 
Cool, it's consistent with what I expected! Always a nice thing.

Is knowing the bit position useful?

In most cases this level of detail isn't important. If you are using hash fingerprints then the bit doesn't really tell you much. It's more important if you are using a feature fingerprint because each bit can be mapped back to a chemically understandable property. Suppose you do some sort of clustering or rule based learning on the bits and find that bit 592 is the most divisive bit. A chemist might want to know what that bit "means", so being able to get from the bit back to the rule is useful.

It's also important if you want interoperability between different tools. If one tool generates bits in a different order than another then they aren't going to work well together.

OpenBabel does support a "-xs" option to explain which bits are set, and a "-xu" to explain which bits are not set. Only one of these is valid, and if used then "-xf" does not work. Grr. On top of that, the [+] and [-] definitions don't have a human readable comment, so I can't even show you what the output looks like for this example. I've filed a few bug reports against the fingerprint generation code in Babel. (Since I wrote this these bugs have been fixed in version control.)

The point of the previous expositions was to make sure you and I know where fingerprint data comes from and how it looks like at the bit level. The next step is to support a Tanimoto search of a set of fingerprints. Stay tuned to this bat channel!

Changelog

[1] Edited 2010-12-03 to fix a counting mistake pointed out by Cyrille Menye.


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2013 Andrew Dalke Scientific AB