Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2007/07/26/gzip_for_molecular_similarity

Re: "Gzip for Molecular Similarity"

In catching up in the world of chemical informatics blogs, I see Noel commented on my post last month about SMILES parsing and compression. He also links to commentary by Rajarshi on that paper.

I started to add a comment to that system but it grew, and the markup language support is limited on Blogger/Blogspot, so I moved it here. Imagine the rest of this essay in the context of "adding a comment to Rajarshi's post."

My response to Rajarshi's commentary on Melville et al.

I also tried out the Melville et al. algorithm, though I didn't compare the results to an existing similarity algorithm or result. I was trying to get an idea of when it would fail, and what the implementation would be like.

The paper says

It is easy to see from eq 2 that the form of the NCD makes it impossible for negative distances to occur, as file sizes are always positive and compressing two files together will never make the output smaller than the compressed size of one file.
This isn't correct. For example,
% echo -n "CCO.CCO" | gzip -f | wc -c
      27
% echo -n "CCO.CCO.CCO" | gzip -f | wc -c
      26

  - or without a "." separator -

% echo -n "CCOCCO" | gzip -f | wc -c
      26
% echo -n "CCOCCOCCO" | gzip -f | wc -c
      25
However, failures like that will be rare in real compounds.

I took the first 9999 compounds in the NCI SMILES strings distributed with OpenBabel and tested all of the triangle inequalities After computing the 4,999,500 similarities and 999,700,029,999 tests I found 29 violations. In other words, the zlib/gzip approach rarely violates the triangle inequality .. and computers are amazingly fast these days.

Using zlib instead of gzip

I did not do those tests with gzip on the command-line. The overhead of starting gzip would have been too much, and inelegant. You (Rajarshi) instead used Python's gzip module, which would save a lot of the time overhead. My choice was to skip gzip completely and go directly to zlib, which gzip uses to do the compression. The gzip output format always includes header and trailer fields (see the gzip spec) so the zlib output should be consistently "better."

Here's Python code. Note that the second "getsize" replaces the first, which I keep for posterity's sake.

import subprocess
impot zlib

# calling gzip on the command-line (slow)
def getsize(s):
  x = subprocess.Popen(["gzip", "-f"], stdin=subprocess.PIPE, stdout=subprocess.PIPE)
  x.stdin.write(s)
  x.stdin.close()
  return len(x.stdout.read())

# using zlib directly
def getsize(s):
  return len(zlib.compress(s))

def dist(a, b):
  n1 = getsize(a); n2=getsize(b)
  return (min(getsize(a+b), getsize(b+a)) - min(n1,n2))/float(max(n1,n2))
and for fun, Ruby code that starts to do the same thing
require 'zlib';
z = Zlib::Deflate.new();
s = z.deflate("c1ccccc1O", Zlib::FINISH);
s.size
I included this Ruby code because the authors of the paper used Ruby for their testing. The zlib module is trivially available to Ruby, Python and Java and likely Perl as well, as part of the normal install. I don't think there's big need to go though the command-line version to do the calculations.

Failures

As I said, some compounds fail. Here are a few:

x = S(Sc1c([N+](=O)[O-])cc([N+](=O)[O-])cc1)c1c([N+](=O)[O-])cc([N+](=O)[O-])cc1
y = N(C(C)=O)c1ccc(C(CCl)=O)cc1
z = C(C(C)=O)c1ccc([N+](=O)[O-])cc1
             dist(x,y), dist(x,z), dist(z,y)
using zlib: 0.714285714286 0.27027027027 0.432432432432
using gzip: 0.531914893617 0.204081632653 0.326530612245


x = S(Cc1ccccc1)(=O)(=O)c1ccc(C)cc1
y = O(CCO)c1c(Cl)cc(Cl)cc1
z = C(CC)(=O)c1ccc(C)cc1
             dist(x,y), dist(x,z), dist(z,y)
zlib: 0.642857142857 0.214285714286 0.428571428571
gzip: 0.45 0.15 0.3


x = C([C@@H](C(OCC)=O)C(C)=O)C(OCC)=O
y = C([C@@](C(C)(C)C)(C(=O)O)C)C(C)(C)C
z = C([C@@H](C(=O)O)C)C(C)(C)C
             dist(x,y), dist(x,z), dist(z,y)
zlib: 0.58064516129 0.34375 0.21875
gzip: 0.418604651163 0.25 0.159090909091

Why use compression as a similarity score?

In your blog you (Rajarshi) said:

it's not clear as to why this would be useful as it (intuitively) seems that without a strict metric function, the resultant clustering would be unstable. This might be something interesting to look at.
The paper points out that you don't need any special chemical informatics tools as gzip is included with many machines (well, except perhaps MS Windows). You can do a perhaps somewhat naive but reasonable analysis using tools included these days in most languages. Though since OpenBabel's fp implementations are easily available, this isn't a strong reason.

As for the applicability in clustering, what I bear in mind is that converting a molecule into a bitstring is only an approximation. Generating a perfect clustering (not that one exists) from the fp doesn't mean that the chemistry clusters correctly. If the compressability value were a better estimate of chemical similarity than a binary fingerprint then a fault-tolerant clustering algorithm might still give good results.

This paper and LINGOS suggest that syntax alone (at the SMILES level) gives some chemical understanding, without needing software with a better understanding of chemistry. It's putting some doubt into my mind that bitstring fingerprints do a good job of estimating chemical similarity. Why have all the processing overhead for just a few percent better results?

One thing I've been thinking of is the gzip/zlib algorithm works by identifying repeated string patterns. SMILES linearizes the structure so in a rough sense zlib is generating many of the subpaths of length roughly 1-5 and comparing the distribution spectrums between the two compounds. That explains [*] what bzip2 (using the block transform) doesn't work as well - it loses some of that neighbor information. Plus, bzip works best on a large block. SMILES strings are likely too small. The paper does talk about "padding the string to twice its length" by repeating the basic structure. This would help minimize some of those length problems.

[*] Explanations can be coincidence. Correlation does not imply causation. Past performance is not indicative of future results. But I think it's a reasonable interpretation.


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2013 Andrew Dalke Scientific AB