Before starting into today's lecture, my lecture yesterday was based on John Bradshaw's Introduction to Chemical Systems article. He understand chemistry and the history of chemical systems much better than I do, and it has some neat pictures. I recommend reading it.
A chemical fingerprint is a list of binary values (0 or 1) which characterize a molecule. There are several ways to create the list. I'll describe the widely use MACCS keys and how to use them for similarity comparisons and for database filtering. I'll then switch over to John Barnard's talk titled Chemical Structure Representation and Search Systems, which is a very good and comprehensive overview of the ways people have developed to compare two molecules.
The MACCS keys are a set of questions about a chemical structure. Here are some of the questions:
- Are there fewer than 3 oxygens?
- Is there a S-S bond?
- Is there a ring of size 4?
- Is at least one F, Cl, Br, or I present?
Here's an example. If the molecule is C1CCC1 then the answers to those questions are:
- 0 oxygens < 3 oxygens → True
- no S-S bond → False
- there is a ring of size 4 → True
- there are no halogens → False
I can repeat that for other compounds. If the input structure is
C1(=C(SSC1=O)Cl)Cl, which looks like
then the bitstring is "1101".
An interesting idea, but why do it? Comparing two molecules directly is a hard problem. In bioinformatics you're used to comparing two sequences based on the alignment. That works because the concept of a minimum string edit maps pretty well to the physical model of how evolution works on the sequence. The direct mapping into chemistry is to look for the minimum edit distance of the graphs. That doesn't work because that operation has little physical meaning.
Chemists have worked hard to understand molecules and discovered that some substructure motifs give an indication of the functionality (or lack) of a compound. While not a perfect description these bitstrings have three useful properties. They are easy to compare, a chemist can understand the results, and they have some predictive power.
Here's an easy way to compare two bitstrings. Compare each bit and add 1 when they are they different (one is 1 and the other 0 or vice versa). Divide the result by the total number of bits in the string. If the two strings are identical then this value is 0. If one string is the exact opposite of the other then this value is 1. This is known as the Hamming distance between the two bitstrings.
I can use the fingerprint bitstrings to search a chemical database. If I think the bitstrings and the comparison method are close enough to the chemistry then I can find similar compounds to the query by comparing bitstrings and choosing only those that are similar enough. Computers are very fast at comparing bits so this technique can be used even in very large databases.
Fingerprints are also useful as filters for substructure searching. Suppose each structure has a fingerprint with fields like
- Has fewer than 5 carbons
- Has a ring of size 6
There are many ways to compare two compounds and many nuances. John Barnard's talk does a great job of covering the topic so I'll walk through his slides.
Copyright © 2001-2020 Andrew Dalke Scientific AB