8-bit SMILES

[ previous | newer ] /home/writings/diary/archive/2014/06/17/8-bit_SMILES

8-bit SMILES

SMILES strings are a compact way to store molecular topology. I would like to use them as the primary storage format for an in-memory compound database which supports substructure search. With a more compact format, I can fit more structures into RAM and also perhaps make better use of cache. There's some trade-off though; it takes time to convert a SMILES string into a usable data structure for subgraph isomorphism testing. I though of a way to make that conversion a bit faster. Rather than determine ring membership after creating the internal connection table, encode the ring membership in the high bit of the SMILES string, assuming the otherwise 7-bit clean SMILES strings is stored as 8-bit bytes.

I though of two forms of "8-bit SMILES." One is a purely internal format, where the high bit of byte i indicates that bond is is in a ring, and the other is an 8-bit exchange SMILES, where ring membership is portable between different implementation. The former is perhaps practical, the latter is more conceptually interesting than useful. At this point though it's mostly theoretical, as I don't have access to software where this might make a noticable difference.

The rest of this post describes what I just said, along with a lot of details.

Substructure search needs compound lookup for validation

A long-time reader of this blog will recall the many essays I've written about similarity search, and more specifically Tanimoto search. One result of that work is chemfp, a set of tools for generating and searching chemical fingerprints. Now I'm going to tackle substructure search, which is a much more complicated task.

Most computer-based substructure search systems break the problem up into two phases: a fast screening phase, which uses a pre-computed index to reject obvious mismatches, and a validation phase, which uses the slower subgraph isomorphism tests to remove false positives. The validation phase get identifiers from the screen, fetches the corresponding target molecules, and tests if the query subgraph exists in the target.

The easiest way to implement this is to load all of the molecules into memory. The "fetch" step then is simply an index or table lookup to get the corresponding molecule object, which is already in a form ready for subgraph isomorphism testings.

Loading a large database takes time

There are two main problems with that. First, the database needs to load all of the molecule objects before doing any searches. This may take some time, which is okay for a chemistry database server, where server starts are rare and where the startup cost is essentially invisible. It's more of a problem when you want to embed substructure search in other tools, like command-line programs or directly your web server.

The startup cost can be mitigated in a few ways. The input structures can be converted to a format which is faster to load; if the structures are stored in a random-access file format (I like cdb and sqlite, but of course any database would also work) instead of a streaming format like a SMILES or SD file then the records can be loaded only when needed; and once read the resulting molecule can kept in memory for later reuse.

A large database takes a lot of memory

The second main problem is that molecule objects take up space. My desktop has 6 GB of memory, and PubChem has about 50 million compounds. That means the average compound must take up less than 128 bytes for the identifier, molecule, and the screen. Most substructure screens are 1024 bits or larger, which is 128 bytes right there! (In a later post I'll point out one technique which might reduce the screen to 20 or 24 bytes.)

There are also ways to mitigate this overhead. The molecule's connection table representation can be made more compact. This can include shaving off bits for the atom and bond information. There's little need to have a 32 bit integer or even 8 bit integer for atomic charge when only 4 bits are needed to represent the charges -8 to +7. Nor is there need for an 8 byte pointer to each atom and bond object when a 2 byte table index is all that's needed for 64K atoms or bonds.

If you instead want a full connection table then you might instead store the connection table in a compressed form (doing this means you can't use pointers), and decompress it when needed. As a simple test, zlib compresses RDKit's binary molecule format by about 50%. You might also consider Google's Snappy or Blosc.

Unfortunately, it's hard to do this on your own. The chemistry toolkit developers follow the encapsulation principle, which means you as a user of the toolkit don't have direct access to the underlying data structure, but only have access through some API.

If you use OEChem then you're in luck - their API has a "DBMol", which compresses the internal representation. You need to call UnCompress before you can use it. In the other toolkits I know about, the closest you can get is to store the molecule in one of the support toolkit formats; preferably one that's compact and fast to load. For RDKit that's the binary molecule format:

>>> from rdkit import Chem
>>> mol = Chem.MolFromSmiles("C=O")
>>> s = mol.ToBinary()
>>> s
'\xef\xbe\xad  [...removed for clarity...] \x00\x00\x00\x16'
>>> len(s)
60
>>> mol2 = Chem.Mol(s)
>>> mol2.GetNumAtoms()
2

(The other toolkits have similar binary formats.)

SMILES as the internal format

As you might have noticed from the previous section, RDKit's binary format is large compared to the original SMILES string. The two byte ethane SMILES produced a 60 byte result. Using SMILES strings from ChEMBL-16, the average SMILES string was 52 bytes long while the zlib compressed molecule byte string was 215 bytes long and the uncompressed molecule byte string was 422 bytes long. (zlib compression of the SMILES string gives another ~15% compression, but if you go this route you should consider a special encoder using a prebuilt dictionary for SMILES, along the line of SMAZ.)

Assuming 55 bytes/SMILES, 24 bytes/screen, 10 bytes/id, and 32 bytes for hash table overhead means each record in the database takes only 121 bytes, which is just barely within my memory budget of 128 bytes. My goal can be achieved – if I can keep everything as SMILES, or a simliarly compact notation.

(Using a canonical SMILES is likely better than using an alternative format, because odds are you'll want to return the SMILES to the caller. Since you have the SMILES already, you don't need to do another lookup. Also, it's probably a better use of your time to optimize SMILES parsing, which has broad functionality, than to develop and optimize a new format.)

There's long precedence for using SMILES as the primary structure format. Daylight's database systems (Thor and Merlin) did this. They provided the first generation of in-memory chemical databases, and from what I heard, the result was about 100x faster than the previous disk-based systems, because memory is so much faster than disk.

You can even see some evidence of that in the literature. John Barnard's excellent review paper "Substructure Searching Methods: Old and New" in JCICS (1993), p. 535 in the section "Hardware Solutions" starts:

The modern availability of computers with very large amounts of random-access memory has provided an opportunity for futher application of the superimposition principle in the substructure sarch system marketed by Daylight Cehmical Information Systems Inc. The reduction in the length of the big string (or "fingerprint") required for each compound enables the entire bit map, even for very large databases, to be held in memory and, thus, permits very rapid screening searches.

That's not quite the verification I wanted, since it only talks about in-memory screens, but it's the best I can find so far.

SMILES parse performance

The question is, is the SMILES parser too slow for this task? That depends very much on the toolkit, or rather, the skills of the toolkit authors, the time they spent working on optimization, and how they decided to decompose the various stages of SMILES parsing.

Take for example OEChem. Its OEParseSmiles handles a blistering 60,000 SMILES strings per second. Not only did they spend a lot of time making the code go fast, but the parser does the bare minimum needed to parse the SMILES string. For example, this low-level API call will not modify the input aromaticity, so if the input benzene is in Kekule form then the molecule object will have single and double bonds instead of aromatic bonds. (You need to call another function if you want to perceive a certain aromaticity model.) I believe it also uses lazy initialization of some of the internal data structures, so timing OEParseSmiles isn't enough to get a real understanding of the overall performance.

By comparison, RDKit's MolFromSmiles does a more sedate 2,000 SMILES per second. Much of the time is spent doing in sanitization, such as aromaticity perception. If you pass in benzene in Kekule form then RDKit will convert it to aromatic form. It also builds up information ring and ring size information; see that above link for details.

Disabling RDKit sanitization

RDKit's default is good for a program which accepts arbitrary SMILES strings. But a chemistry database gets to specify its own internal format. It can assure that the SMILES strings in the database are valid, have the right aromaticity assignment and chirality, and so on. There's little need for sanitization overhead in this case.

RDKit recognizes this. MolFromSmiles has a 'sanitize' parameter which, when false, disables sanitization. This brings the performance up to 8,500 SMILES strings per second. Unfortunately, some SMARTS queries depend on information that's normally computed during aromaticity:

>>> from rdkit import Chem
>>> mol = Chem.MolFromSmiles("C1=CC=CC=C1", sanitize=False)
>>> Chem.MolToSmiles(mol)
'C1=CC=CC=C1'
>>> query = Chem.MolFromSmarts("[R]~[R]")
>>> mol.HasSubstructMatch(query)
[13:11:16] 

****
Pre-condition Violation
RingInfo not initialized
Violation occurred on line 36 in file /Users/dalke/envs/RDKit_2014_03_1_py27/Code/GraphMol/RingInfo.cpp
Failed Expression: df_init
****

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: Pre-condition Violation

There are similar problem for ring size-based queries like "[R3]" and valence-based queries like "[v4]". RDKit has alternative methods to set some of these internal data structures, like the following:

>>> Chem.FastFindRings(mol)
>>> mol.HasSubstructMatch(query)
True

Encode ring membership in the SMILES

My point is that both RDKit and OEChem must do some sort of ring-finding, if that information needed for the substructure query, because that information isn't directly expressed in the SMILES string. It's one of the tradeoffs between a compact representation and one which is directly usable for substructure search.

SMILES strings contain only printable ASCII characters. Each character only needs 7 bits, but are often stored in a byte with 8 bits. There's an extra bit available! What about encoding the ring membership information in that bit? I came up with two possibilities.

Tag bonds by byte position

The first is based on the fact that a SMILES string with N bonds has at least N+1 characters. Proof: every bond requires either an additional atom symbol (adding at least one character) or a ring closure (adding at least two characters). Set the high bit of byte i to 1 if bond i is in a ring, otherwise set it to 0. Obviously, both atoms at the end of a ring bond are also in a ring.

Very simple, yes? It also has the advantage that the correct SMILES is trivially generated by masking the high bit to 0. It's even possible to experiment with parts of this idea on your own. Use your toolkit of choice to make a SMILES string, then parse the string back into a molecule. Use the resulting bond list to set the SMILES string bits.

Unfortunately, single pass SMILES generation (that is, without reading the SMILES back into a molecule to get the bond ordering) isn't possible without toolkit support, since I don't think any toolkit exposes that information into user space. Not that it makes a difference, since there's no way to make your own reader; the toolkits don't have a way to modify their internal ring membership state.

It's also not portable, since there's nothing in the SMILES spec which requires that toolkits preserve bond ordering is preserved.

By the way, I told a lie when I said that the number of bonds was less than the number of characters in the SMILES string. Most toolkits interpret "[ReH9-2]" as single rhenium atom with an implicit hydrogen count of 9, but a few treat it as a rhenium atom with a bond to each of 9 hydrogen atoms. That's 9 bonds in SMILES with 8 characters, so it's not even possible to experiment with 8-bit SMILES using those toolkits.

OEChem and RDKit support for 'foreign' aromaticity modes

Before I get to the second possibility, a bit about aromaticity models. The Daylight toolkit supports a single aromaticity model. All input structures are reperceived according to that model. But different toolkits have different models, and some file formats are based on toolkit-specific aromaticity assignments. If you want to work with multiple toolkits then you need a way to work with 'foreign' aromaticity, and not just the toolkit's default.

As you saw earlier, OEParseSmiles keeps the original bond type assignments, and RDKit can be convinced to do the same. But there's an ambiguity in how to handle implicit bonds in SMILES if there's no aromaticity perception stage.

Consider biphenyl, "c1ccccc1c1ccccc1". It's bonds are all implicit, which mean they are either a single bond or an aromatic bond and the toolkit needs to figure out which is which. What are the rules for how to interpret a foreign aromaticity model? In this case, it's possible to do a ring search, detect that one of the implicit bonds is not in a ring, and determine that it's a single bond. For the others, it can assume that an implicit ring bond between two aromatic atoms is itself aromatic.

But ring detection isn't enough. Consider "Cn1ccc2nccc2c1", where the fusion bond between the two rings is not aromatic enough though the bond is in a ring and the end atoms are aromatic. A smart enough reader could then apply chemistry knowledge to resolve the ambiguity, but "smart enough" readers are more complex, and at some point there will be disagreements on which interpretation is correct.

Instead, good SMILES writers follow a simple rule. A single bond whose two terminal atoms are both aromatic must be represented with an explicit "-". A non-validating reader which sees an implicit bond, need only to check the two terminal atoms. If both are aromatic then it's an aromatic bond, otherwise it's a single bond.

Under this rule, the biphenyl from before is better written as 'c1ccccc1-c1ccccc1' and the structure with two fused rings is better written as "Cn1ccc-2nccc2c1" instead of "Cn1ccc2nccc2c1". In that way, another toolkit can read the SMILES and unambiguously support the foreign aromaticity model.

Tag atoms and explicit bonds; workaround implicit ring bonds

My other 8-bit SMILES idea is portable, because it doesn't depend on a specific atom or bond parser order, but it's more complex to implement. The basic idea is to mark each ring atom and bond by setting the high bit of its corresponding SMILES symbol to 1. This is trivial for the atoms, because each atom must have at least one symbol in the SMILES, and it's trivial for explicit bonds, but what about the implicit bonds - how do we know which is in a ring and which isn't?

I considered making each ring bond explicit, but that was too much overhead. Instead, I'll use a rule which mirrors the aromaticity rule: if a bond which could be implicit is not in a ring but its terminal atoms are ring atoms then the bond must be made explicit. A quick survey using ChEMBL shows that on average the SMILES strings grow by 0.5 character, to incorporate the extra "-"s needed for this scheme.

Which about SSSR?

The two proposals use the available bit to specify ring membership, but most toolkits implement the more complex SSSR - "Smallest Set of Smallest Rings." As you can see from the RDKit timings, SSSR perception takes about 15% of the sanitization overhead, while bond membership detection is about 1/5th of that time. Thus, my proposals would only save a few percent of the overall time.

But then again, I don't think we should worry about SSSR.

There are two camps about SSSR. The old papers say things like "SSSR is very important; it reflects the way that chemists really think." Daylight comes from this era, and most toolkits follow its lead. Only thing is, there are so many SSSR implementations. My SSSR analysis from a few years ago convinced me from that SSSR is not worthwhile. In short, where it was possible to have multiple choices for the SSSR, there was little toolkit consensus on which was the right set.

I'm decidedly in the OpenEye camp - don't use SSSR. This has a few downstream consequences, the most obvious being that I agree with OpenEye's decision to redefine the "R<n>" SMARTS term. According to Daylight, "R2" selects atoms which are in 2 SSSR rings. According to OpenEye, "R2" selects atoms which have at least 2 ring bonds. (As an approximation, Daylight's "R2" is OpenEyes's "R3".)

You can see how this change to the SMARTS semantics - which I agree with - gives a definition which is more portable, easier to implement, and requires less setup time to get the parsed molecule into a useable state.

Bear in mind though that it's unlikely that RDKit will change the way it does things, even if it can get a ~15% speedup by getting rid of SSSR. And yes, this lack of SSSR perception is one of the reasons why OEChem's SMILES parser is so much faster than RDKit.

Which is better?

I'll call these "bond tagged" and "atom tagged" 8-bit SMILES, respectively. I originally came up with the atom tagged format because its portability fits in with the underlying goal that SMILES is a implementation-independent line notation for valance bond chemistry. The problem is that portability comes at a cost. For example, either the 7-bit output SMILES contains an occasional extra "-" in its canonical format, or there's a decidedly non-trivial conversion function to remove the occasional "-" characters. By comparison, the high bit stripping function for bond tagged SMILES is just a few lines of code.

The thing is, I can't come up with a good reason to have a portable 8-bit SMILES. Effectively every system will re-canonicalize the SMILES to put it into the database for record lookup, because there is no standard SMILES canonicalization algorithm. There might be some very clever programs which can reason directly on the SMILES strings without making a full-blown connection table, but overall it seems like a lot of work for little gain.

One observation is that it's easy to count the number of ring bonds in bond tagged SMILES, and ring atoms in atom tagged SMILES, by just doing a scan of the bytes. I speculate this might be useful.

Is it useful?

That's really the question, isn't it? I don't know. For RDKit it won't make a difference - ring membership identification is about 3% of the overall time, so it doesn't seem like it's worth speeding up. But for something like OEChem .. well, it's proprietary code and I don't know how much of its time is spent in ring detection.

It will be many years before I can get to the stage where I can try out this idea. Until then, this page is meant as a thought experiment for what might be possible.

Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me