Dalke Scientific Software: More science. Less time. Products

Diary RSS | All of Andrew's writings | Diary archive

Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure. Code to taste. Best served at room temperature.

Weininger's Realization #

Dave Weininger passed away recently. He was very well known in the chemical informatics community because of his contribution to the field and his personality. Dave and Yosi Taitz founded Daylight Chemical Information Systems to turn some of these ideas into a business, back in the 1980s. It was very profitable. (As a bit of trivia, the "Day" in "Daylight" comes from "Dave And Yosi".)

Some of the key ideas that Dave and Daylight introduced are SMILES, SMARTS, and fingerprints (both the name and the hash-based approach). Together these made for a new way to handle chemical information search, and do so in significantly less memory. The key realization which I think lead to the business success of the comany, is that the cost of memory was decreasing faster than the creation of chemical information. This trend, combined with the memory savings of SMILES and fingerprints, made it possible to store a corporate dataset in RAM, and do chemical searches about 10,000 times faster than the previous generation of hard-disk based tools, and do it before any competition could. I call this "Weininger's Realization". As a result, the Daylight Thor and Merlin databases, along with the chemistry toolkits, became part of the core infrastructure of many pharmaceutical companies.

I don't know if there was a specific "a-ha" moment when that realization occurred. It certainly wasn't what drove Dave to work on those ideas in the first place. He was a revolutionary, a Prometheus who wanted to take chemical information from what he derisively called 'the high priests' and bring it to the people.

An interest of mine in the last few years is to understand more about the history of chemical information. The best way I know to describe the impact of Dave and Daylight is to take some of the concepts back to the roots.

You may also be interested in reading Anthony Nicholls description of some of the ways that Dave influenced him, and Derek Lowe's appreciation of SMILES.

Errors and Omissions

Before I get there, I want to emphasize that the success of Daylight cannot be attributed to just Dave, or Dave and Yosi. Dave's brother Art and his father Joseph were coauthors on the SMILES canonicalization paper. The company hired people to help with the development, both as employees and consultants. I don't know the details of who did what, so I will say "Dave and Daylight" and hopefully reduce the all too easy tendency to give all the credit on the most visible and charismatic person.

I'm unfortunately going to omit many parts of the Daylight technologies, like SMIRKS, where I don't know enough about the topic or its effect on cheminformatics. I'll also omit other important but invisible aspects of Daylight, like documentation or the work Craig James did to make the database servers more robust to system failures. Unfortunately, it's the jockeys and horses which attract the limelight, not those who muck the stables or shoe the horses.

Also, I wrote this essay mostly from what I have in my head and from presentations I've given, which means I've almost certainly made mistakes that could be fixed by going to my notes and primary sources. Over time I hope to spot and fix those mistakes in this essay. Please let me know of anything you want me to change or improve.

Dyson and Wiswesser notations

SMILES is a "line notation", that is, a molecular representation which can be described as a line of text. Many people reading this may have only a vague idea of the history of line notations. Without that history, it's hard to understand what helped make SMILES successful.

The original line notations were developed in the 1800s. By the late 1800s chemists began to systematize the language into what is now called the IUPAC nomenclature. For example, caffeine is "1,3,7-trimethylpurine-2,6-dione". The basics of this system are taught in high school chemistry class. It takes years of specialized training to learn how to generate the correct name for complex structures.

Chemical nomenclature helps chemists index the world's information about chemical structures. In short, if you can assign a unique name to a chemical structure (a "canonical" name), then it you can use standard library science techniques to find information about the structure.

The IUPAC nomenclature was developed when books and index cards were the best way to organize data. Punched card machines brought a new way of thinking about line notations. In 1946, G. Malcolm Dyson proposed a new line notation meant for punched cards. The Dyson notion was developed as a way to mechanize the process of organizing and publishing a chemical structure index. It became a formal IUPAC notation in 1960, but was already on its last legs and dead within a few years. While it might have been useful for mechanical punched card machines, it wasn't easily repurposed for the computer needs of the 1960s. For one, it depended on superscripts and subscripts, and used characters which didn't exist on the IBM punched cards.

William J. Wiswesser in 1949 proposed the Wiswesser Line Notation, universally called WLN, which could be represented in EBCIDIC and (later) ASCII in a single line of text. More importantly, unlike the Dyson notation, which follows the IUPAC nomenclature tradition of starting with the longest carbon chain, WLN focuses on functional groups, and encodes many functional groups directly as symbols.

Chemists tend to be more interested in functional groups, and want to search based on those groups. For many types of searches, WLN acts as its own screen, that is, it's possible to do some types of substructure search directly on the symbols of the WLN, without having to convert the name into a molecular structure for a full substructure search. To search for structures containing a single sulfur, look for WLNs with a single occurrence of S, but not VS or US or SU. The chemical information scientists of the 1960s and 1970s developed several hundred such clever pattern searches to make effective use of the relatively limited hardware of that era.

WLNs started to disappear in the early 1980s, before SMILES came on the scene. Wendy Warr summarized the advantages and disadvantages of WLNs in 1982. She wrote "The principle disadvantage of WLN is that it is not user friendly. This can only be overcome by programs which will derive a canonical WLN from something else (but no one has yet produced a cost-effective program to do this for over 90% of compounds), by writing programs to generate canonical connection tables from noncanonical WLNs, or by accepting the intervention of a skilled "middle man"."

Dyson/IUPAC and WLNs were just two of dozens, if not hundreds, of proposed line notations. Nearly every proposal suffered from a fatal flaw - they could not easily be automated on a computer. Most required postgraduate-level knowledge of chemistry, and were error-prone. The more rigorous proposals evaluated the number of mistakes made during data entry.

One of the few exceptions are the "parentheses free" notations from a pair of papers from 1964, one by Hiz and the other by Eisman, in the same issue of the Journal of Chemical Documentation. In modern eyes, they very much like SMILES but represented in a postfix notation. Indeed, the Eisman paper gives a very SMILES-like notation for a tree structure, "H(N(C(HC(CIC(N(HH)C(HN(IH)))I)H)H))" and a less SMILES-like notation for a cyclic structure, before describing how to convert them into a postfix form.

I consider the parentheses-free nomenclatures a precursor to SMILES, but they were not influential to the larger field of chemical information. I find this a bit odd, and part of my research has been to try and figure out why. It's not like it had no influence. A version of this notation was in the Chemical Information and Data System (CIDS) project in the 1960s and early 1970s. In 1965, "CIDS was the first system to demonstrate online [that is, not batch processing] searching of chemical structures", and CIDS wasn't completely unknown in the chemical information field.

But most of the field in the 1970s went for WLN for a line notation, or a canonical connection table.

SMILES

Dave did not know about the parentheses free line notations when he started work on SMILES, but he drew from similar roots in linguistics. Dave was influenced by Chomsky's writings on linguistics. Hiz, mentioned earlier, was at the Department of Linguistics at the University of Pennsylvania, and that's also where Eugene Garfield did his PhD work on the linguistics of chemical nomenclature.

Dave's interest in chemical representations started when he was a kid. His father, Joseph Weininger, was a chemist at G.E., with several patents to his hame. He would draw pictures of chemical compounds for Dave, and Dave would, at a non-chemical level, describe how they were put together. These seeds grew into what became SMILES.

SMILES as we know it started when Dave was working for the EPA in Duluth. They needed to develop a database of environmental compounds, to be entered by untrained staff. (For the full story of this, read Ed Regis's book "The Info Mesa: Science, Business, and New Age Alchemy on the Santa Fe Plateau.") As I recall, SMILES was going to the the internal language, with a GUI for data entry, but it turns out that SMILES was easy enough for untrained data entry people to write it directly.

And it's simple. I've taught the basics of SMILES to non-chemist programmers in a matter of minutes, while WLN, Dyson, and InChI, as example of other line notations, are much more difficult to generate either by hand or by machine. Granted, those three notations have canonicalization rules built into them, which is part of the difficulty. Still, I asked Dave why something like SMILES didn't appear earlier, given that the underlying concepts existed in the literature by then.

He said he believes it's because the generation of people before him didn't grow up with the software development background. I think he's right. When I go to a non-chemist programer, I say "it's a spanning tree with special rules to connect the cycles", and they understand. But that vocabulary was still new in the 1960s, and very specialized.

There's also some conservatism in how people work. Dyson defended the Dyson/IUPAC notation, saying that it was better than WLN because it was based on the longest carbon chain principle that chemists were already familiar with, even though the underlying reasons for that choice were becoming less important because of computer search. People know what works, and it's hard to change that mindset when new techniques become available.

Why the name SMILES? Dave Cosgrove told me it was because when Dave woud tell people he had a new line notations, most would "groan, or worse. When he demonstrated it to them, that would rapidly turn to a broad smile as they realised how simple and powerful it is."

Exchange vs. Canonical SMILES

I not infrequently come across people who say that SMILES is a proprietary format. I disagree. I think the reason for the disagreement is that two different concepts go under the name "SMILES". SMILES is an exchange language for chemistry, and it's an identifier for chemical database search. Only the second is proprietary.

Dave wanted SMILES as a way for chemists from around the world and through time to communicate. SMILES describes a certain molecular valence model view of the chemistry. This does not need to be canonical, because you can always do that yourself once you have the information. I can specify hydrogen cyanide as "C#N", "N#C", or "[H][C]#[N]" and you will be able to know what I am talking about, without needing to consult some large IUPAC standard.

He wanted people to use SMILES that way, without restriction. The first SMILES paper describes the grammar. Later work at Daylight in the 1990s extended SMILES to handle isotopes and stereochemistry. (This was originally called "isomeric SMILES", but it's the SMILES that people think of when they want a SMILES.) Daylight published the updated grammar on their website. It was later included as part of Gasteiger's "Handbook of Chemoinformatics: From Data to Knowledge in 4 Volumes". Dave also helped people at other companies develop their own SMILES parsers.

To say that SMILES as an exchange format is proprietary is opposite to what Dave wanted and what Daylight did.

What is proprietary is the canonicalization algorithm. The second SMILES paper describes the CANGEN algorithm, although it is incomplete and doesn't actually work. Nor does it handle stereochemistry, which was added years later. Even internally at Daylight, it took many years to work out all of the bugs in the implementation.

There's a good financial reason to keep the algorithm proprietary. People were willing to pay a lot of money for a good, fast chemical database, and the canonicalization algorithm was a key part of Daylight's Thor and Merlin database servers. In business speak, this is part of Daylight's "secret sauce".

On the other hand, there's little reason for why that algorithm should be published. Abstractly speaking, it would mean that different tools would generate the same canonical SMILES, so a federated data search would reduce to a text search, rather than require a re-canonicalization. This is one of the goals of the InChI project, but they discovered that Google didn't index the long InChI strings in a chemically useful way. They created the InChI key as a solution. SMILES has the same problem and would need a similar solution.

Noel O'Boyle published a paper pointed out that the InChI canonicalization assignment could be used to assign the atom ordering for a SMILES string. This would give a universal SMILES that anyone could implement. There's been very little uptake of that idea, which gives a feel of how little demand there is.

Sometimes people also include about the governance model to decide if something is proprietary or not, or point to the copyright restrictions on the specification. I don't agree with these interpretations, and would gladly talk about them at a conference meeting if you're interested.

Line notations and connection tables

There are decades of debate on the advantages of line notations over connection tables, or vice versa. In short, connection tables are easy to understand and parse into an internal molecule data structure, while line notations are usually more compact and can be printed on a single line. And in either case, at some point you need to turn the text representation into a data structure and treat it as a chemical compound rather than a string.

Line notations are a sort of intellectual challenge. This alone seems to justify some of the many papers proposing a new line notation. By comparison, Open Babel alone supports over 160 connection table formats, and there are untold more in-house or internal formats. Very few of these formats have ever been published, except perhaps in an appendix in a manual.

Programmers like simple formats because they are easy to parse, often easy to parse quickly, and easy to maintain. Going back the Warr quote earlier, it's hard to parse WLN efficiently.

On the other hand, line notations fit better with text-oriented systems. Back in the 1960s and 1970s, ISI (the Institute for Scientific Information) indexed a large subset of the chemistry literature and distributed the WLNs as paper publications, in a permuted table to help chemists search the publication by hand. ISI was a big proponent of WLN. And of course it was easy to put a WLN on a punched card and search it mechanically, without an expensive computer,

Even now, a lot of people use Excel or Spotfire to display their tabular data. It's very convenient to store the SMILES as a "word" in a text cell.

Line notations also tend to be smaller than connection tables. As an example, the connection table lines from the PubChem SD files (excluding the tag data) average about 4K per record. The PUBCHEM_OPENEYE_ISO_SMILES tag values average about 60 bytes in length.

Don't take the factor of 70 as being all that meaningful. The molfile format is not particularly compact, PubChem includes a bunch of "0" entries which could be omitted, and the molfile stores the coordinates, which the SMILES does not. The CAS search system in the late 1980s used about 256 bytes for each compact connection table, which is still 4x larger than the equivalent SMILES.

Dave is right. SMILES, unlike most earlier line notations, really is built with computer parsing in mind. Its context-free grammar is easy to parse using simple stack, thought still not as easy as a connection table. It doesn't require much in the way of lookup tables or state information. There's also a pretty natural mapping from the SMILES to the corresponding topology.

What happens if you had a really fast SMILES parser? As a thought experiment which doesn't reflect real hardware, suppose you could convert 60 bytes of SMILES string to a molecule data structure faster than you could read the additional 400 bytes of connection table data. (Let's say the 10 GHz CPU is connected to the data through a 2400 baud modem.) Then clearly it's best to use a SMILES, even if it takes longer to process.

A goal for the Daylight toolkit was to make SMILES parsing so fast that there was no reason to store structures in a special internal binary format or data structure. Instead, when it's needed, parse the SMILES into a molecule object, use the molecule, and throw it away.

On the topic of muck and horseshoes, as I recall Daylight hired an outside company at one point to go through the code and optimize it for performance.

SMARTS

SMARTS is a natural recasting and extension of the SMILES grammar to define a related grammar for substructure search.

I started in chemical information in the late 1990s, with the Daylight toolkit and a background which included university courses on computer grammars like regular expressions. The analogy of SMARTS to SMILES is like regular expressions to strings seems obvious, and I modeled my PyDaylight API on the equivalent Python regular expression API.

Only years later did I start to get interested in the history of chemical information, though I've only gotten up to the early 1970s so there's a big gap that I'm missing. Clearly there were molecular query representations before SMARTS. What I haven't found is a query line notation, much less one implemented in multiple systems. This is a topic I need to research more.

The term "fingerprint"

Fingerprints are a core part of modern cheminformatics. I was therefore surprised to discover that Daylight introduced the term "fingerprint" to the field, around 1990 or so.

The concept existed before then. Adamson and Bush did some of the initial work in using fingerprint similarity as a proxy for molecular similarity in 1973, and Willett and Winterman's 1986 papers [1, 2] (the latter also with Bawden) reevaluated the earlier work and informed the world of the effectiveness of the Tanimoto. (We call it "Tanimoto" instead of "Jaccard" precisely because of those Sheffield papers.)

But up until the early 1990s, published papers referred to fingerprints as the "screening set" or "bit screens", which describes the source of the fingerprint data, and didn't reifying them into an independent concept. The very first papers which used "fingerprint" were by Yvonne Martin, an early Daylight user at Abbott, and John Barnard, who uses "fingerprint", in quotes, in reference specifically to Daylight technology.

I spent a while trying to figure out the etymology of the term. I asked Dave about it, but it isn't the sort of topic which interests him, and he didn't remember. "Fingerprint" already existed in chemistry, for IR spectra, and the methods for matching spectra are somewhat similar to those of cheminformatics fingerprint similarity, but not enough for me to be happy with the connection. Early in his career Dave wrote software for a mass spectrometer, so I'm also not rejecting the possibility.

The term "fingerprint" was also used in cryptographic hash functions, like "Fingerprinting by Random Polynomial" by Rabin (1981). However, these fingerprints can only be used to test if two fingerprints are identical. They are specifically designed to make it hard to use the fingerprints to test for similarity of the source data.

I've also found many papers talking about image fingerprints or audio fingerprints which can be used for both identity and similarity testing; so-called "perceptual hashes". However, their use of "fingerprint" seems to have started a good decade after Daylight popularized it in cheminformatics.

Hash fingerprints

Daylight needed a new name for fingerprints because they used a new approach to screening.

Fingerprint-like molecular descriptions go back to at least the Frear code of the 1940s. Nearly all of the work in the intervening 45 years was focused on finding fragments, or fragment patterns, which would improve substructure screens.

Screen selection was driven almost entirely by economics. Screens are cheap, with data storage as the primary cost. Atom-by-atom matching, on the other hand, had a very expensive CPU cost. The more effective the screens, the better the screenout, the lower the overall cost for an exact substructure search.

The best screen would have around 50% selection/rejection on each bit, with no correlation between the bits. If that could exist, then an effective screen for 1 million structures would need only 20 bits. This doesn't exist, because few fragments meet that criteria. The Sheffield group in the early 1970s (who often quoted Mooers as the source of the observation) looked instead at more generic fragment descriptions, rather than specific patterns. This approach was further refined by BASIC in Basel and then at CAS to become the CAS Online screens. This is likely the pinnacle of the 1970s screen development.

Even then, it had about 4000 patterns assigned to 2048 bits. (Multiple rare fragments can be assigned to the same bit with little reduction in selectivity.)

A problem with a fragment dictionary is that it can't take advantage of unknown fragments. Suppose your fragment dictionary is optimized for pharmaceutical compounds, then someone does a search for plutonium. If there isn't an existing fragment definition like "transuranic element" or "unusual atom", then the screen will not be able to reject any structures. Instead, it will slowly go through the entire data set only to return no matches.

This specific problem is well known, and the reason for the "OTHER" bit of the MACCS keys. However, other types of queries may still have an excessive number of false positives during screening.

Daylight's new approach was the enumeration-based hash fingerprint. Enumerate all subgraphs of a certain size and type (traditionally all linear paths with up to 7 atoms), choose a canonical order, and use the atom and bond types in order t generate a hash value. Use this value to seed a pseudo random number generator, then generate a few values to set bits in the fingerprint; the specific number depends on the size of the subgraph. (The details of the hash function and the number of bits to set were also part of Daylight's "secret sauce.")

Information theory was not new in chemical information. Calvin Mooers developed superimposed coding back in the 1940s in part to improve the information density of chemical information on punched cards (and later complained about how computer scientists rediscovered it as hash tables). The Sheffield group also used information theory to guide their understanding of the screen selection problem. Feldman and Hodes in the 1970s developed screens by the full enumeration of common subgraphs of the target set and a variant of Mooers' superimposed coding.

But Daylight was able to combine information theory and computer science theory (i.e., hash tables) to develop a fingerprint generation technique which was completely new. And I do mean completely new.

Remember how I mentioned there are SMILES-like line notations in the literature, even if people never really used them? I've looked hard, and only with a large stretch of optimism can I find anything like the Daylight fingerprints before Daylight, and mostly as handwaving proposal for what might be possible. Nowadays, almost every fingerprint is based on a variation of the hash approach, rather than a fragment dictionary.

In addition, because of the higher information density, Daylight fingerprints were effective as both a substructure screen and a similarity fingerprint using only 1024 bits, instead of the 2048 bits of the CAS Online screen. This will be important in the next section.

Chemical data vs. compute power and storage size

CAS had 4 million chemical records in 1968. The only cost-effective way to store that data was on tape. Companies could and did order data tapes from CAS for use on their own corporate computers.

Software is designed for the hardware, so the early systems were built first for tape drives and then, as they became affordable, for the random-access capabilities of hard disks. A substructure search would first check against the screens to reject obvious mismatches, then for each of the remaining candidates, read the corresponding record off disk and do the full atom-by-atom match.

Apparently Derek Price came up with the "law of exponential increase" in "Science Since Babylon" (1961), which describes how science information has exponential growth. I've only heard about that second hand. Chemical data is no exception, and its exponential growth was noted, I believe, in the 1950s by Perry.

In their 1971 text book, Lynch, et al. observed the doubling period was about 12 years. I've recalculated that number over a longer baseline, and it still holds. CAS has 4 million structures in 1968 and 100 million structure in 2015, which is a doubling every 10-11 years.

On the other hand, computers have gotten more powerful at a faster rate. For decades the effective computing power doubled every 2 years, and the amount of RAM and data storage for constant dollars has doubled even faster than that.

In retrospect it's clear that at some point it would be possible to store all of the world's chemistry data, or at least a corporate database, in memory.

Weininger's Realization

Disks are slow. Remember how the atom-by-atom search needed to pull a record off the disk? That means the computer needs to move the disk arm to the right spot and wait for the data to come by, while the CPU simply waits. If the data were in RAM then it would be 10,000x faster to fetch a randomly chosen record.

Putting it all in RAM sounds like the right solution, but in in the early 1980s memory was something like $2,000 per MB while hard disk space was about $300 per MB. One million compounds with 256 bytes per connection table and 256 bytes per screen requires almost 500MB of space. No wonder people kept the data on disk.

By 1990, RAM was about $80/MB while hard disk storage was $4/MB, while the amount of chemistry data had only doubled.

Dave, or at least someone at Daylight, must have realized that the two different exponential growth rates make for a game changer, and that the Daylight approach would give them a head start over the established vendors. This is explicit in the Daylight documentation for Merlin:

The basic idea behind Merlin is that data in a computer's main memory can be manipulated roughly five orders of magnitude faster than data on its disk. Throughout the history of computers, there has been a price-capacity-speed tradeoff for data storage: Large-capacity storage (tapes, drums, disks, CD-ROMS) is affordable but slow; high-speed storage ("core", RAM) is expensive but fast. Until recently, high-speed memory was so costly that even a modest amount of chemical information had to be stored on tapes or disks.

But technology has a way of overwhelming problems like this. The amount of chemical information is growing at an alarming rate, but the size of computer memories is growing even faster: at an exponential rate. In the mid-1980's it became possible for a moderately large minicomputer to fit a chemical database of several tens of thousands of structures into its memory. By the early 1990's, a desktop "workstation" could be purchased that could hold all of the known chemicals in the world (ca. 15 million structures) in its memory, along with a bit of information about each.

On the surface, in-memory operations seem like a straightforward good deal: A computer's memory is typically 105 times faster than its disk, so everything you could do on disk is 100000 times faster when you do it in memory. But these simple numbers, while impressive, don't capture the real differences between disk- and memory-based searches:

Scaling down

I pointed out earlier that SMILES and fingerprints take up less space. I estimate it was 1/3 the space of what CAS needed, which is the only comparison I've been able to figure out. That let Daylight scale up to larger data set for a given price, but also scale down to smaller hardware.

Let's say you had 250,000 structures in the early 1990s. With the Daylight system you would need just under 128 MB of RAM, which meant you could buy a Sun 3, which maxed out at 128 MB, instead of a more expensive computer.

It still requires a lot of RAM, and that's where Yosi comes in. His background was in hardware sales, and he know how to get a computer with a lot of RAM in it. Once the system was ready, Dave and his brother Art put it in the back of a van and went around the country to potential customers to give a demo, often to much astonishment that it could be so fast.

I think the price of RAM was the most important hardware factor to the success of Daylight, but it's not the only one. When I presented some of these ideas at Novartis in 2015, Bernhard Rohde correctly pointed out that decreasing price of hardware also meant that computers were no longer big purchase items bought and managed by IT, but something that even individual researchers could buy. That's another aspect of scaling down.

While Daylight did sell to corporate IT, their heart was in providing tools and especially toolkits to the programmer-chemists who would further develop solutions for their company.

Success and competitors

By about 1990, Daylight was a market success. I have no real idea how much profit the company made, but it was enough that Dave bought his own planes, including a fighter jet. When I was at the Daylight Summer School in 1998, the students over dinner came up with a minimum of $15 million in income, and maximum of $2 million in expenses.

It was also a scientific success, measured by the number of people talking about SMILES and fingerprints in the literature.

I am not a market analyst so I can't give that context. I'm more of a scholar. I've been reading through the old issues of JCICS (now titled JCIM) trying to identify the breakthrough transition for Daylight. There is no bright line, but there is tantalizing between-the-lines.

In 1992 or so (I'll have to track it down), there's a review of a database vendors' product. The reviewer mentions that the vendor plans to have an in-memory database the next year. I can't help but interpret it as a competitor responding to the the new Daylight system, and having to deal with customers who now understood Weininger's Realization.

Dave is a revolutionary

The decreasing price of RAM and hardware may help explain the Daylight's market success, but Dave wasn't driven by trying to be the next big company. You can see that in how the company acted. Before the Oracle cartridge, they catered more towards the programmer-chemist. They sold VMS and then Unix database servers and toolkits, with somewhat primitive database clients written using the XView widget toolkit for X. I remember Dave once saying that the clients were meant as examples of what users could do, rather than as complete applications. A different sort of company would have developed Windows clients and servers, more tools for non-programmer chemists, and focused more on selling enterprise solutions to IT and middle managers.

A different sort of company would have tried to be the next MDL. Dave didn't think they were having fun at MDL, so why would he want to be like them?

Dave was driven by the idea of bringing chemical information away from the "high priests" who held the secret knowledge of how to make things work. Look at SMILES - Dyson and WLN required extensive training, while SMILES could be taught to non-chemists in an hour or so. Look at fingerprints - the CAS Online screens were the result of years of research in picking out just the right fragments, based on close analysis of the types of queries people do, while the hash fingerprints can be implemented in a day. Look even at the Daylight depictions, which were well known as being ugly. But Dave like to point out that the code, at least originally, needed only 4K. That's the sort of code a non-priest could understand, and the sort of justification a revolutionary could appreciate.

Dave is a successful revolutionary, which is rare. SMILES, SMARTS and fingerprints are part of the way we think about modern cheminformatics. Innumerable tools implement them, or variations of them.

High priests of chemical information

Revolutionary zeal is powerful. I remember hearing Dave's "high priests" talk back in 1998 and the feeling empowered, that yes, even as a new person in the field, cheminformatics is something I can take on on my own.

As I learn more about the history of the field, I've also learned that Dave's view is not that uncommon. In the post-war era the new field of information retrieval wanted to challenge the high priests of library science. (Unfortunately I can't find that reference now.)

Michael Lynch would have been the high priest of chemical information if there ever was one. Yet at ICCS 1988 he comments "I can recollect very little sense, in 1973, that this revolution was imminent. Georges Anderla .. noted that the future impact of very large scale integration (VLSI) was evident only to a very few at that time, so that he quite properly based his projections on the characteristics of the mainframe and minicomputer types then extant. As a result, he noted, he quite failed to see, first, that the PC would result in expertise becoming vastly more widely disseminated, with power passing out of the hands of the small priesthood of computer experts, thus tapping a huge reservoir of innovative thinking, and, second, that the workstation, rather than the dumb terminal, would become standard."

A few years I talked with Egon Willighagen. He is one of the CDK developers and an advocate of free and open source software for chemical information. He also used the metaphor of talking information from the high priests to the people, but in his case he meant the previous generation of closed commercial tools, like the Daylight toolkit.

Indeed, one way to think of it is that Dave the firebrand of Daylight became it high priest of Daylight, and only the priests of Daylight control the secret knowledge of fingerprint generation and canonicalization.

That's why I no longer like the metaphor. Lynch and the Sheffield group published many books and papers, including multiple textbooks on how to work with chemical information. Dave and Daylight did a lot of work to disseminate the Daylight way of thinking about cheminformatics. These are not high priests hoarding occult knowledge, but humans trying to do the right thing in an imperfect world.

There's also danger in the metaphor. Firebrand revolutionaries doesn't tend to know history. Perhaps some of the temple should be saved? At the very least there might be bad feelings if you declare your ideas revolutionary only to find out that not only they are not new, and you are talking to the previous revolutionary who proposed them.

John Barnard told me a story of Dave and Lynch meeting at ICCS in 1988, I believe. Dave explained how his fingerprint algorithm worked. Lynch commented something like "so it's like Calvin Mooers' superimposed coding"? Lynch knew his history, and he was correct - fingerprints and superimposed coding are related, though not the same. Dave did not know the history or how to answer the question.

My view has its own danger. With 75 years of chemical information history, one might feel a paralysis of not doing something out of worry that it's been done before and you just don't know about it.

Leaving Daylight

In the early 2000s Dave became less interested in chemical information. He had grown increasingly disenchanted with the ethics of how pharmaceutical companies work and do research. Those who swear allegience to money would have no problems just making a sale and profits, but he was more interested in ideas and people. He had plenty of money.

He was concerned about the ethics of how people used his work, though I don't know how big a role that was overall. Early on, when Daylight was still located at Pomona, they got purchase request from the US military. He didn't want the Daylight tools to be used to make chemical weapons, and put off fulfilling the sale. The military group that wanted the software contacted him and pointed out that they were actually going to use the tools to develop defenses against chemical weapons, which helped him change his mind.

I think the Daylight Oracle cartridge marked the big transition for Daylight. The 1990s was the end of custom domain-specific databases like Thor and Merlin and the rise of object-relational databases with domain-specific extensions. Norah MacCuish (then Shemetulskis) helped show how the Daylight functionality could work as an Informix DataBlade. Oracle then developed their data cartridge, and most of the industry, including the corporate IT that used the Daylight servers, followed suit.

Most of Daylight's income came from the databases, not the toolkit. Companies would rather use widely-used technology because it's easier to find people with the skills to use it, and there can be better integration with other tools. If Daylight didn't switch, then companies would turn to competitive products which, while perhaps not as elegant, were a better fit for IT needs. Daylight did switch, and the Daylight user group meetings (known as "MUG" because it started off as the Medchem User Group) started being filled by DBAs and IT support, not the geek programmer-chemist that were interested in linguistics and information theory and the other topics that excited Dave.

It didn't help that the Daylight tools were somewhat insular and static. The toolkit didn't have an officially supported way to import SD file, though there was a user-contributed package for that, which Daylight staff did maintain. Ragu Bharadwaj developed Daylight's JavaGrins chemical sketcher, which was a first step towards are more GUI-centered Daylight system, but also the only step. The RDKit, as an example of a modern chemical informatics toolkit, includes algorithms Murko scaffolds, matched molecular pairs, and maximum common substructure, and continues to get new ones. But Daylight didn't go that route either.

What I think it comes down to is that Dave was really interested in databases and compound collections. Daylight was also a reseller of commercial chemical databases, and Dave enjoyed looking through the different sets. He told me there was a different feel to the chemistry done in different countries, and he could spot that by looking at the data. He was less interested in the other parts of cheminformatics, and as the importance of old-school "chemical information" diminished in the newly renamed field of "chem(o)informatics", so too did his interest, as did the interest from Daylight customers.

Post-Daylight

Dave left chemical information and switched to other topics. He tried to make theobromine-free chocolate, for reasons I don't really understand, though I see now that many people buy carob as a chocolate alternative because it's theobromine free and thus stimulant free. He was also interested in binaural recording and hearing in general. He bought the house next door to turn it into a binaural recoding studio.

He became very big into solar power, as a way to help improve the world. He bought 12 power panels, which he called The Twelve Muses, from a German company and installed them for his houses. These were sun trackers, to maximize the power. Now, Santa Fe is at 2,300 m/7,000 ft. elevation, in a dry, near-desert environment. He managed to overload the transformers because they produced a lot more power than the German-made manufacturer expected. Once that was fixed, both houses were solar powered, plus he had several electric cars and motorcycles, and could feed power back into the grid for profit. He provided a recharging station for the homebrew electric car community who wanted to drive from, say, Albuquerque to Los Alamos. (This was years before Teslas were on the market). Because he had his own DC power source, he was also able to provide a balanced power system to his recording studio and minimize the power hum noise.

He tried to persuade the state of New Mexico to invest more in solar power. It makes sense, and he showed that it was possible. But while he was still a revolutionary, he was not a politician, and wasn't able to make the change he wanted.

When I last saw him in late 2015, he was making a cookbook of New Mexico desserts.

Dave will always have a place in my heart.

Andrew Dalke
Trollhättan, Sweden
2 December 2016

Changes



Fragment by copy and trim #

This is part of a series of essays on how to fragment a molecular graph using RDKit. These are meant to describe the low-level steps that go into fragmentation, and the ways to think about testing. To fragment a molecule in RDKit, use FragmentOnBonds(). The essays in this series are:

Why another fragmentation method?

How do you tell if an algorithm is correct? Sometimes you can inspect the code. More often you have test cases where you know the correct answer. For complex algorithms, especially when using real-world data, testing is even harder. It's not always clear where edge cases are, and often the real-world is often more complicated than you think.

Another validation solution is to write two different algorithms which accomplish the same thing, and compare them with each other. I did that in my essay about different ways to evaluate the parity of a permutation order. That cross-validation testing was good enough, because it's easy to compute all possible input orders.

In the previous essay in this series, I cross-validated that fragment_chiral() and fragment_on_bonds() give the same results. They did. That isn't surprising because they implement essentially the same algorithm. Not only that, but I looked at the FragmentOnBonds() implementation when I found out my first fragment_chiral() didn't give the same answers. (I didn't know I needed to ClearComputedProps() after I edit the structure.)

Cross-validation works better when the two algorithm are less similar. The way you think about a problem and implement a solution are connected. Two similar solutions may be the result of thinking about things the same way, and come with the same blind spots. (This could also be a design goal, if you want the new implementation to be "bug compatible" with the old.)

I've made enough hints that you likely know what I'm leading up to. RDKit's FragmentOnBonds() code currently (in mid-2016) converts directional bonds (the E-Z stereochemistry around double bonds into non-directional single bonds.

>>> from rdkit import Chem
>>> mol = Chem.MolFromSmiles("F/C=C/F")
>>> fragmented_mol = Chem.FragmentOnBonds(mol, [0], dummyLabels=[(0,0)])
>>> Chem.MolToSmiles(fragmented_mol, isomericSmiles=True)
'[*]C=CF.[*]F'
when I expected the last line to be
'[*]/C=C/F.[*]F'

This may or may not be what you want. Just in case, I've submitted a patch to discuss changing it in RDKit, so your experience in the future might be different than mine in the past.

I bring it up now to show how cross-validation of similar algorithms isn't always enough to figure out if the algorithms do as you expect.

Validate though re-assembly

As a reminder, in an earlier essay I did a much stronger validation test. I took the fragments, reassembled them via SMILES syntax manipulation, and compared the canonicalized result to the canonicalized input structure. These should always match, to within limits of the underlying chemistry support.

They didn't, because my original code didn't support directional bonds. Instead, in that essay I pre-processed the input SMILES string to replace the "/" and "\" characters with a "-". That gives the equivalent chemistry without directional bonds.

I could use that validation technique again, but I want to explore more of what it means to cross-validate by using a different fragmentation implementations.

Fragment by 'copy and trim'

Dave Cosgrove, on the RDKit mailing list, described the solution he uses to cut a bond and preserve chirality in toolkits which don't provide a high-level equivalent to FragmentOnBonds(). This method only works on non-ring bonds, which isn't a problem for me as I want to fragment R-groups. (As I discovered while doing the final testing, my implementation of the method also assumes there is only one molecule in the structure.)

To make fragment from a molecule, trim away all the atoms which aren't part of the fragment, except for the atom connected to the fragment. Convert that atom into the wildcard atom "*" by setting its atomic number, charge, and number of hydrogens to 0, and setting the chirality and aromaticity flags correctly. The image below shows the steps applied to cutting the ester off of aspirin:

steps to trim a fragment of aspirin

This method never touches the bonds connected to a chiral atom of a fragment, so chirality or other bond properties (like bond direction!) aren't accidently removed by breaking the bond and making a new one in its place.

A downside is that I need to copy and trim twice in order to get both fragments after cutting a chain bond. I can predict now the performance won't be as fast as FragmentOnBonds().

Let's try it out.

Trim using the interactive shell

I'll use aspirin as my reference structure and cut the bond to the ester (the R-OCOCH3), which is the bond between atoms 2 and 3.

aspirin with atoms labeled by index
I want the result to be the same as using FragmentOnBonds() to cut that bond:
>>> from rdkit import Chem
>>> aspirin = Chem.MolFromSmiles("O=C(Oc1ccccc1C(=O)O)C")
>>> bond_idx = aspirin.GetBondBetweenAtoms(2, 3).GetIdx()
>>> bond_idx
2
>>> fragmented_mol = Chem.FragmentOnBonds(aspirin, [bond_idx], dummyLabels=[(0,0)])
>>> Chem.MolToSmiles(fragmented_mol, isomericSmiles=True)
'[*]OC(C)=O.[*]c1ccccc1C(=O)O'

I'll need a way to find all of the atoms which are on the ester side of the bond. I don't think there's a toolkit function to get all the atoms on one side of a bond, so I'll write one myself.

Atom 2 is the connection point on the ester to the rest of the aspirin structure. I'll start with that atom, show that it's an oxygen, and get information about its bonds:

>>> atom = aspirin.GetAtomWithIdx(2)
>>> atom.GetSymbol()
'O'
>>> [bond.GetBondType() for bond in atom.GetBonds()]
[rdkit.Chem.rdchem.BondType.SINGLE, rdkit.Chem.rdchem.BondType.SINGLE]
I'll use the bond object to get to the atom on the other side of the bond from atom 2, and report more detailed information about the atom at the other end of each bond:
>>> for bond in atom.GetBonds():
...   other_atom = bond.GetOtherAtom(atom)
...   print(other_atom.GetIdx(), bond.GetBondType(), other_atom.GetSymbol(), other_atom.GetIsAromatic())
... 
1 SINGLE C False
3 SINGLE C True
This says that atom 1 is an aliphatic carbon, and atom 3 is an aromatic carbon. This matches the structure diagram I used earlier.

I know that atom 3 is the other end of the bond which was cut, so I can stop there. What about atom 1? What is it connected to?

>>> next_atom = aspirin.GetAtomWithIdx(1)
>>> [b.GetOtherAtom(next_atom).GetIdx() for b in next_atom.GetBonds()]
[0, 2, 12]
I've already processed atom 2, so only atoms 0 and 12 are new.

What are those atoms connected to?

>>> next_atom = aspirin.GetAtomWithIdx(0)
>>> [b.GetOtherAtom(next_atom).GetIdx() for b in next_atom.GetBonds()]
[1]
>>> next_atom = aspirin.GetAtomWithIdx(12)
>>> [b.GetOtherAtom(next_atom).GetIdx() for b in next_atom.GetBonds()]
[1]
Only atom 1, which I've already seen before so I've found all of the atoms on the ester side of the bond.

Semi-automated depth-first search

There are more atoms on the other side of the bond, so I'll have to automate it somewhat. This sort of graph search is often implemented as a depth-first search (DFS) or a breadth-first search (BFS). Python lists easily work as stacks, which makes DFS slightly more natural to implement.

To give you an idea of what I'm about to explain, here's an animation of the aspirin depth-first search:

depth-first search of an apsirin fragment

If there is the chance of a ring (called "cycle" in graph theory), then there are two ways to get to the same atom. To prevent processing the same atom multiple times, I'll set up set named "seen_ids", which contains the atoms indices I've before. Since it's the start, I've only seen the two atoms which are at the end of the bond to cut.

>>> seen_ids = set([2, 3])
I also store a list of the atoms which are in this side of the bond, at this point is only atom 3, and a stack (a Python list) of the atoms I need to process further:
>>> atom_ids = [3]
>>> stack = [aspirin.GetAtomWithIdx(3)]
I'll start by getting the top of the stack (the last element of the list) and looking at its neighbors:
>>> atom = stack.pop()
>>> neighbor_ids = [b.GetOtherAtom(atom).GetIdx() for b in atom.GetBonds()]
>>> neighbor_ids
[2, 4, 8]
I'll need to filter out the neighbors I've already seen. I'll write a helper function for that. I'll need both the atom objects and the atom ids, so this will return two lists, one for each:
def get_atoms_to_visit(atom, seen_ids):
    neighbor_ids = []
    neighbor_atoms = []
    for bond in atom.GetBonds():
        neighbor_atom = bond.GetOtherAtom(atom)
        neighbor_id = neighbor_atom.GetIdx()
        if neighbor_id not in seen_ids:
            neighbor_ids.append(neighbor_id) 
            neighbor_atoms.append(neighbor_atom)
    
    return neighbor_ids, neighbor_atoms
and use it:
>>> atom_ids_to_visit, atoms_to_visit = get_atoms_to_visit(atom, seen_ids)
>>> atom_ids_to_visit
[4, 8]
>>> [a.GetSymbol() for a in atoms_to_visit]
['C', 'C']
These atom ids are connected to the original atom, so add them all it to the atom_ids.
>>> atom_ids.extend(atom_ids_to_visit) 
>>> atom_ids
[3, 4, 8]
These haven't been seen before, so add the atoms (not the atom ids) to the stack of items to process:
>>> stack.extend(atoms_to_visit)
>>> stack
[<rdkit.Chem.rdchem.Atom object at 0x111bf37b0>, <rdkit.Chem.rdchem.Atom object at 0x111bf3760>]
>>> [a.GetIdx() for a in stack]
[4, 8]
Now that they've been seen, I don't need to process them again, so add the new ids to the set of seen ids:
>>> seen_ids.update(atom_ids_to_visit)
>>> seen_ids
{8, 2, 3, 4}

That's the basic loop. The stack isn't empty, so pop the top of the stack to get a new atom object, and repeat the earlier steps:

>>> atom = stack.pop()
>>> atom.GetIdx()
8
>>> atom_ids_to_visit, atoms_to_visit = get_atoms_to_visit(atom, seen_ids)
>>> atom_ids_to_visit
[7, 9]
>>> atom_ids.extend(atom_ids_to_visit) 
>>> atom_ids 
[3, 4, 8, 7, 9]
>>> stack.extend(atoms_to_visit)
>>> [a.GetIdx() for a in stack]
[4, 7, 9]
>>> seen_ids.update(atom_ids_to_visit)
>>> seen_ids
{2, 3, 4, 7, 8, 9}
Then another loop. I'll stick it in a while loop to process the stack until its empty, and I'll only have it print the new atom ids added to the list:
>>> while stack:
...     atom = stack.pop()
...     atom_ids_to_visit, atoms_to_visit = get_atoms_to_visit(atom, seen_ids)
...     atom_ids.extend(atom_ids_to_visit)
...     stack.extend(atoms_to_visit)
...     seen_ids.update(atom_ids_to_visit)
...     print("added atoms", atom_ids_to_visit)
... 
added atoms [10, 11]
added atoms []
added atoms []
added atoms [6]
added atoms [5]
added atoms []
added atoms []
The final set of atoms is:
>>> atom_ids
[3, 4, 8, 7, 9, 10, 11, 6, 5]

Fully automated - fragment_trim()

In this section I'll put all the pieces together to make fragment_trim(), a function which implements the trim algorithm.

Find atoms to delete

The trim algorithm needs to know which atoms to delete. That will be all of the atoms on one side of the bond, except for the atom which is at the end of the bond. (I'll turn that atom into a "*" atom.) I'll use the above code to get the atom ids to delete:

# Look for neighbor atoms of 'atom' which haven't been seen before  
def get_atoms_to_visit(atom, seen_ids):
    neighbor_ids = []
    neighbor_atoms = []
    for bond in atom.GetBonds():
        neighbor_atom = bond.GetOtherAtom(atom)
        neighbor_id = neighbor_atom.GetIdx()
        if neighbor_id not in seen_ids:
            neighbor_ids.append(neighbor_id) 
            neighbor_atoms.append(neighbor_atom)
    
    return neighbor_ids, neighbor_atoms

# Find all of the atoms connected to 'start_atom_id', except for 'ignore_atom_id'
def find_atom_ids_to_delete(mol, start_atom_id, ignore_atom_id):
    seen_ids = {start_atom_id, ignore_atom_id}
    atom_ids = [] # Will only delete the newly found atoms, not the start atom
    stack = [mol.GetAtomWithIdx(start_atom_id)]
    
    # Use a depth-first search to find the connected atoms
    while stack:
        atom = stack.pop()
        atom_ids_to_visit, atoms_to_visit = get_atoms_to_visit(atom, seen_ids)
        atom_ids.extend(atom_ids_to_visit)
        stack.extend(atoms_to_visit)
        seen_ids.update(atom_ids_to_visit)
    
    return atom_ids
and try it out:
>>> from rdkit import Chem
>>> aspirin = Chem.MolFromSmiles("O=C(Oc1ccccc1C(=O)O)C")
>>> find_atom_ids_to_delete(aspirin, 3, 2)
[1, 0, 12]
>>> find_atom_ids_to_delete(aspirin, 2, 3)
[4, 8, 7, 9, 10, 11, 6, 5]
That looks right. (I left out the other testing I did to get this right.)

Trim atoms

The trim function has two parts. The first is to turn the attachment point into the wildcard atom "*", which I'll do by removing any charges, hydrogens, chirality, or other atom properties. This atom will not be deleted. The second is to remove the other atoms on that side of the bond.

Removing atoms from a molecule is slightly tricky. The atom indices are reset after each atom is removed. (While I often use a variable name like "atom_id", the atom id is not a permanent id, but simply the current atom index.) When atom "i" is deleted, all of the atoms with a larger id "j" get the new id "j-1". That is, the larger ids are all shifted down one to fill in the gap.

This becomes a problem because I have a list of ids to delete, but as I delete them the real ids change. For example, if I delete atoms 0 and 1 from a two atom molecule, in that order, I will get an exception:

>>> rwmol = Chem.RWMol(Chem.MolFromSmiles("C#N"))
>>> rwmol.RemoveAtom(0)
>>> rwmol.RemoveAtom(1)
[13:54:02] 

****
Range Error
idx
Violation occurred on line 155 in file /Users/dalke/ftps/rdkit-Release_2016_03_1/Code/GraphMol/ROMol.cpp
Failed Expression: 1 <= 0
****

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: Range Error
	idx
	Violation occurred on line 155 in file Code/GraphMol/ROMol.cpp
	Failed Expression: 1 <= 0
	RDKIT: 2016.03.1
	BOOST: 1_60
This is because after RemoveAtom(0) the old atom with id 1 gets the id 0. As another example, if I remove atoms 0 and 1 from a three atom molecule, I'll end up with what was originally the second atom:
>>> rwmol = Chem.RWMol(Chem.MolFromSmiles("CON"))
>>> rwmol.RemoveAtom(0)
>>> rwmol.RemoveAtom(1)
>>> Chem.MolToSmiles(rwmol)
'O'
Originally the ids were [C=0, O=1, N=2]. The RemoveAtom(0) removed the C and renumbered the remaining atoms, giving [O=0, N=1]. The RemoveAtom(1) then removed the N, leaving [O=0] as the remaining atom.

While you could pay attention to the renumbering and adjust the index of the atom to delete, the far easier solution is to sort the atom ids, and delete starting from the largest id.

Here is the function to turn the attachment atom into a wildcard and to delete the other atoms:

def trim_atoms(mol, wildcard_atom_id, atom_ids_to_delete):
    rwmol = Chem.RWMol(mol)
    # Change the attachment atom into a wildcard ("*") atom
    atom = rwmol.GetAtomWithIdx(wildcard_atom_id)
    atom.SetAtomicNum(0)
    atom.SetIsotope(0)
    atom.SetFormalCharge(0)
    atom.SetIsAromatic(False)
    atom.SetNumExplicitHs(0)
    
    # Remove the other atoms.
    # RDKit will renumber after every RemoveAtom() so
    # remove from highest to lowest atom index.
    # If not sorted, may get a "RuntimeError: Range Error"
    for atom_id in sorted(atom_ids_to_delete, reverse=True):
        rwmol.RemoveAtom(atom_id)
    
    # Don't need to ClearComputedProps() because that will
    # be done by CombineMols()
    return rwmol.GetMol()
(If this were a more general-purpose function I would need to call ClearComputedProps(), which is needed after any molecular editing in RDKit. I don't need to do this here because it will be done during CombineMols(), which is coming up.)

How did I figure out which properties needed to be changed to make a wildcard atom? I tracked the down during cross-validation tests of an earlier version of this algorithm.

I'll try out the new code:

>>> fragment1 = trim_atoms(aspirin, 2, [1, 0, 12])
>>> Chem.MolToSmiles(fragment1, isomericSmiles=True)
'[*]c1ccccc1C(=O)O'
>>> fragment2 = trim_atoms(aspirin, 3, [4, 8, 7, 9, 10, 11, 6, 5])
>>> Chem.MolToSmiles(fragment2, isomericSmiles=True)
'[*]OC(C)=O'

fragment_trim()

The high-level fragment code is straight-forward, and I don't think it requires explanation:

def fragment_trim(mol, atom1, atom2):
    # Find the fragments on either side of a bond
    delete_atom_ids1 = find_atom_ids_to_delete(mol, atom1, atom2)
    fragment1 = trim_atoms(mol, atom1, delete_atom_ids1)
    
    delete_atom_ids2 = find_atom_ids_to_delete(mol, atom2, atom1)
    fragment2 = trim_atoms(mol, atom2, delete_atom_ids2)
    
    # Merge the two fragments.
    # (CombineMols() does the required ClearComputedProps())
    return Chem.CombineMols(fragment1, fragment2)

I'll check that it does what I think it should do:

>>> new_mol = fragment_trim(aspirin, 2, 3)
>>> Chem.MolToSmiles(new_mol)
'[*]OC(C)=O.[*]c1ccccc1C(=O)O'
This matches the FragmentOnBonds() output at the start of this essay.

Testing

I used the same cross-validation method I did in my previous essay. I parsed structures from ChEMBL and checked that fragment_trim() produces the same results as FragmentOnBonds(). It doesn't. The first failure mode surprised me:

fragment_trim() only works with a single connected structure

Here's one of the failure reports:

Mismatch: record: 61 id: CHEMBL1203109 smiles: Cl.CNC(=O)c1cc(C(O)CNC(C)CCc2ccc3c(c2)OCO3)ccc1O cut: 1 2
   smiles_trim: Cl.Cl.[*]C.[*]NC(=O)c1cc(C(O)CNC(C)CCc2ccc3c(c2)OCO3)ccc1O
smiles_on_bond: Cl.[*]C.[*]NC(=O)c1cc(C(O)CNC(C)CCc2ccc3c(c2)OCO3)ccc1O
Somehow the "Cl" appears twice. After thinking about it I realized that my implementation of the copy and trim algorithm assumes that the input structure is strongly connected, that is that it only contains a single molecule. Each copy contains a chlorine atom, so when I CombineMols() I end up with two chlorine atoms.

I could fix my implementation to handle this, but as I'm only interested in cross-validation, the easier solution is to never process a structure with multiple molecules. I decided that if the input SMILES contains multiple molecules (that is, if the SMILES contains the dot disconnection symbol ".") then I would choose the component which has the most characters, using the following:

if "." in smiles:
    # The fragment_trim() method only works on a connected molecule.
    # Arbitrarily pick the component with the longest SMILES string.
    smiles = max(smiles.split("."), key=len)

Directional bonds

After 1000 records and 31450 tests there were 362 mismatches. All of them appeared to be related to directional bonds, like this mismatch report:

Mismatch: record: 124 id: CHEMBL445177 smiles: CCCCCC/C=C\CCCCCCCc1cccc(O)c1C(=O)O cut: 7 8
   smiles_trim: [*]/C=C\CCCCCC.[*]CCCCCCCc1cccc(O)c1C(=O)O
smiles_on_bond: [*]C=CCCCCCC.[*]CCCCCCCc1cccc(O)c1C(=O)O
I knew this would happen coming in to this essay, but it's still nice to see. It demonstrates that the fragment_trim() is a better way to cross-validate FragmentOnBonds() than my earlier fragment_chiral() algorithm.

It's possible that other mismatches are hidden in the hundreds of reports, so I removed directional bonds from the SMILES and processed again:

if "/" in smiles:
    smiles = smiles.replace("/", "")
if "\\" in smiles:
    smiles = smiles.replace("\\", "")
That testing shows I forgot to include "atom.SetIsotope(0)" when I converted the attachment atom into a wildcard atom. Without it, the algorithm turned a "[18F]" into a "[18*]". It surprised me because I copied it from working code I made for an earlier version of the algorithm. Looking into it, I realized I left that line out during the copy&paste. This is why regression tests using diverse real-world data are important.

I stopped the testing after 100,00 records. Here is the final status line:
Processed 100000 records, 100000 molecules, 1120618 tests, 0 mismatches.  T_trim: 2846.79 T_on_bond 258.18
While trim code is slower than using the built-in function, the point of this exercise isn't to figure out the fastest implementation but to have a better way to cross-validate the original code, which it has done.

The final code

Here is the final code:

from __future__ import print_function

import datetime
import time

from rdkit import Chem

# Look for neighbor atoms of 'atom' which haven't been seen before  
def get_atoms_to_visit(atom, seen_ids):
    neighbor_ids = []
    neighbor_atoms = []
    for bond in atom.GetBonds():
        neighbor_atom = bond.GetOtherAtom(atom)
        neighbor_id = neighbor_atom.GetIdx()
        if neighbor_id not in seen_ids:
            neighbor_ids.append(neighbor_id) 
            neighbor_atoms.append(neighbor_atom)
    
    return neighbor_ids, neighbor_atoms

# Find all of the atoms connected to 'start_atom_id', except for 'ignore_atom_id'
def find_atom_ids_to_delete(mol, start_atom_id, ignore_atom_id):
    seen_ids = {ignore_atom_id, start_atom_id}
    atom_ids = [] # Will only delete the newly found atoms, not the start atom
    stack = [mol.GetAtomWithIdx(start_atom_id)]

    # Use a depth-first search to find the connected atoms
    while stack:
        atom = stack.pop()
        atom_ids_to_visit, atoms_to_visit = get_atoms_to_visit(atom, seen_ids)
        atom_ids.extend(atom_ids_to_visit)
        stack.extend(atoms_to_visit)
        seen_ids.update(atom_ids_to_visit)
    
    return atom_ids

def trim_atoms(mol, wildcard_atom_id, atom_ids_to_delete):
    rwmol = Chem.RWMol(mol)
    # Change the attachment atom into a wildcard ("*") atom
    atom = rwmol.GetAtomWithIdx(wildcard_atom_id)
    atom.SetAtomicNum(0)
    atom.SetIsotope(0)
    atom.SetFormalCharge(0)
    atom.SetIsAromatic(False)
    atom.SetNumExplicitHs(0)
    
    # Remove the other atoms.
    # RDKit will renumber after every RemoveAtom() so
    # remove from highest to lowest atom index.
    # If not sorted, may get a "RuntimeError: Range Error"
    for atom_id in sorted(atom_ids_to_delete, reverse=True):
        rwmol.RemoveAtom(atom_id)
    
    # Don't need to ClearComputedProps() because that will
    # be done by CombineMols()
    return rwmol.GetMol()
    
def fragment_trim(mol, atom1, atom2):
    # Find the fragments on either side of a bond
    delete_atom_ids1 = find_atom_ids_to_delete(mol, atom1, atom2)
    fragment1 = trim_atoms(mol, atom1, delete_atom_ids1)
    
    delete_atom_ids2 = find_atom_ids_to_delete(mol, atom2, atom1)
    fragment2 = trim_atoms(mol, atom2, delete_atom_ids2)

    # Merge the two fragments.
    # (CombineMols() does the required ClearComputedProps())
    return Chem.CombineMols(fragment1, fragment2)

####### Cross-validation code

def fragment_on_bond(mol, atom1, atom2):
    bond = mol.GetBondBetweenAtoms(atom1, atom2)
    return Chem.FragmentOnBonds(mol, [bond.GetIdx()], dummyLabels=[(0, 0)])


def read_records(filename):
    with open(filename) as infile:
        for recno, line in enumerate(infile):
            smiles, id = line.split()
            if "*" in smiles:
                continue
            yield recno, id, smiles
            
def cross_validate():
    # Match a single bond not in a ring
    BOND_SMARTS = "[!#0;!#1]-!@[!#0;!#1]"
    single_bond_pat = Chem.MolFromSmarts(BOND_SMARTS)
    
    num_mols = num_tests = num_mismatches = 0
    time_trim = time_on_bond = 0.0

    # Helper function to print status information. Since
    # the function is defined inside of another function,
    # it has access to variables in the outer function.
    def print_status():
        print("Processed {} records, {} molecules, {} tests, {} mismatches."
              "  T_trim: {:.2f} T_on_bond {:.2f}"
              .format(recno, num_mols, num_tests, num_mismatches,
                      time_trim, time_on_bond))
    
    filename = "/Users/dalke/databases/chembl_20_rdkit.smi"
    start_time = datetime.datetime.now()
    for recno, id, smiles in read_records(filename):
        if recno % 100 == 0:
            print_status()

        if "." in smiles:
            # The fragment_trim() method only works on a connected molecule.
            # Arbitrarily pick the component with the longest SMILES string.
            smiles = max(smiles.split("."), key=len)
            
        ## Remove directional bonds to see if there are mismatches
        ## which are not caused by directional bonds.
        #if "/" in smiles:
        #    smiles = smiles.replace("/", "")
        #if "\\" in smiles:
        #    smiles = smiles.replace("\\", "")
        
        mol = Chem.MolFromSmiles(smiles)
        if mol is None:
            continue
        num_mols += 1

        # Find all of the single bonds
        matches = mol.GetSubstructMatches(single_bond_pat)

        for a1, a2 in matches:
            # For each pair of atom indices, split on that bond
            # and determine the canonicalized fragmented molecule.
            # Do it for both fragment_trim() ...
            t1 = time.time()
            mol_trim = fragment_trim(mol, a1, a2)
            t2 = time.time()
            time_trim += t2-t1
            smiles_trim = Chem.MolToSmiles(mol_trim, isomericSmiles=True)

            # ... and for fragment_on_bond()
            t1 = time.time()
            mol_on_bond = fragment_on_bond(mol, a1, a2)
            t2 = time.time()
            time_on_bond += t2-t1
            smiles_on_bond = Chem.MolToSmiles(mol_on_bond, isomericSmiles=True)

            # Report any mismatches and keep on going.
            num_tests += 1
            if smiles_trim != smiles_on_bond:
                print("Mismatch: record: {} id: {} smiles: {} cut: {} {}".format(
                      recno, id, smiles, a1, a2))
                print("   smiles_trim:", smiles_trim)
                print("smiles_on_bond:", smiles_on_bond)
                num_mismatches += 1
    
    end_time = datetime.datetime.now()
    elapsed_time = str(end_time - start_time)
    elapsed_time = elapsed_time.partition(".")[0]  # don't display subseconds
    
    print("Done.")
    print_status()
    print("elapsed time:", elapsed_time)

if __name__ == "__main__":
    cross_validate()


Software mentioned in ChEMBL SD records #

When I work with real-world data sets, I usually start with PubChem before going on to ChEMBL. Why? The PubChem data is all generated by one chemistry toolkit - I'm pretty sure it's OEChem - while ChEMBL data comes from many sources.

To get a sense of the diversity, I processed the ChEBML 21 release to get the second line of each record. The ctfile format specification says that if non-blank then the 8 characters starting in the third position should contain the program name. The documentation includes a description of the fields:

IIPPPPPPPPMMDDYYHHmmddSS…
where those fields are: Here are a few examples:
  SciTegic12231509382D
  Mrv0541 05211314572D          
          08180812482D 1   1.00000     0.00000     0
  Symyx   11300911332D 1   1.00000     0.00000     0
The third program wishes to remain anonymous.

Extract program names from the ChEMBL 21 SDF

I'll write a program to extract those program names and count how many times each one occurs. I don't need a general-purpose SD file reader because ChEMBL uses a subset of the SD format. For example, there are only two lines in each record which start with "CHEMBL", the title line (the first line of the record) and the data line after the "chembl_id" tag.

My code can read line through the file. The first time it sees a "CHEMBL" line is the title line, so the following line (the second line) contains the data. Then when it sees "> <chembl_id>" it knows to skip the following line, which will have a CHEMBL on it.

There are two oddities here. First, the gzip reader returns byte strings. I decided to do the pattern matching on the byte strings to ignore the overhead of converting everything to Unicode when I only need a few characters from the file. Second, a Python file object is its own iterator, so I can use "infile" in both the for-loop and in the body itself.

import sys, gzip
from collections import Counter
counter = Counter()
with gzip.open("/Users/dalke/databases/chembl_21.sdf.gz", "rb") as infile:
  for line in infile:
      # Get the line after the title line (which starts with 'CHEMBL')
      if line[:6] == b"CHEMBL":
          next_line = next(infile)
          # Print unique program names
          program = next_line[2:10]
          if program not in counter:
              print("New:", repr(str(program, "ascii")))
          counter[program] += 1
      
      if line[:13] == b"> ":
          ignore = next(infile) # skip the CHEMBL
  
  print("Done. Here are the counts for each program seen.")
  for name, count in counter.most_common():
    print(repr(str(name, "ascii")), count)

Program counts

Here is the table of counts:
Program namecount
'SciTegic'1105617
' '326445
'CDK 9' 69145
'Mrv0541 '30962
''24531
'CDK 6'10805
'CDK 1'8642
'CDK 5'4771
'CDK 8'2209
'-ISIS- '281
'Symyx '171
'CDK 7'144
'-OEChem-'96
'Marvin '61
'Mrv0540 '13
' RDKit'3
'CDK 3'1
The number of different CDK lines is not because of a version number but because CDK doesn't format the line correctly. The specification states that the first few fields of a non-blank line are supposed to be:

IIPPPPPPPPMMDDYYHHmmddSS…
while the CDK lines use:
  CDK    7/28/10,10:58
  CDK    8/10/10,12:22
  CDK    9/16/09,9:40
  CDK    10/7/09,10:42
  CDK    11/9/10,11:20
were the date starts one byte too early. The date also isn't in the specified format.

Further examination shows these were only generated in 2009 and 2010. The current CDK implementation is correct, and a 'git annotate' suggests it was fixed in 2010.

I don't think anyone uses that line for anything, and don't see the point of changing anything, so I don't think it's worthwhile to even ask ChEBML to change these old records.



Use FragmentOnBonds to fragment a molecule in RDKit #

This is part of a continuing series on how to fragment a molecular graph. So far I have covered:

Those were mostly pedagogical. They describe the low-level details of how to cut a bond to fragment a molecule into two parts, and still maintain correct chirality.

If you are writing production code, you should almost certainly use the built-in RDKit function FragmentOnBonds() to fragment a molecule, and not use any of the code from those essays.

FragmentOnBonds

FragmentOnBonds() will fragment a molecule given a list of bond identifiers. In default use it attaches a new "dummy" atom (what I've been calling a wildcard atom) to the atoms at the end of each broken bond. It will also set the isotope label for each newly created dummy atom to the atom index of the atom to which it is attached.

>>> from rdkit import Chem
>>> bond_idx = vanillin.GetBondBetweenAtoms(4, 5)
>>> from rdkit import Chem
>>> vanillin = Chem.MolFromSmiles("c1(C=O)cc(OC)c(O)cc1")
>>> bond = vanillin.GetBondBetweenAtoms(4, 5)
>>> bond
<rdkit.Chem.rdchem.Bond object at 0x102bec8a0<
>>> bond.GetIdx()
4
>>> new_mol = Chem.FragmentOnBonds(vanillin, [bond.GetIdx()])
>>> Chem.MolToSmiles(new_mol, isomericSmiles=True)
'[4*]OC.[5*]c1cc(C=O)ccc1O'
I don't want a label, so I'll specify 0 for the atom labels. (An unlabeled atom has an isotope of zero.)
>>> new_mol = Chem.FragmentOnBonds(vanillin, [bond.GetIdx()], dummyLabels=[(0, 0)])
>>> Chem.MolToSmiles(new_mol, isomericSmiles=True)
'[*]OC.[*]c1cc(C=O)ccc1O'

I'll turn this into function that I can use for testing:

def fragment_on_bond(mol, a1, a2):
    bond = mol.GetBondBetweenAtoms(a1, a2)
    return Chem.FragmentOnBonds(mol, [bond.GetIdx()], addDummies=True, dummyLabels=[(0, 0)])
That's certainly much easier than fragment_chiral()!

FragmentOnBonds implementation

The RDKit source code is available, so we can look at how this built-in function is implemented. A search for "FragmentOnBonds":

Code/GraphMol/Wrap/MolOps.cpp
1631:      python::def("FragmentOnBonds", fragmentOnBondsHelper,
plus some investigation shows that the actual C++ code has the name "fragmentOnBonds". The RDKit convention is to use an intial uppercase letter for Python function, and an initial lowercase letter for C++ functions.

A search this time for "fragmentOnBonds" has a few more hits, including this very suggestive one:

Code/GraphMol/ChemTransforms/MolFragmenter.cpp
320:        ROMol *nm=fragmentOnBonds(mol,fragmentHere,addDummies,dummyLabelsHere,bondTypesHere,lCutsPerAtom);
329:    ROMol *fragmentOnBonds(const ROMol &mol,const std::vector &bondIndices,
That second result is start of the FragmentOnBondsfunction definition,. Here's the key part, with commentary.

The function can fragment multiple bonds. It iterates through the bonds in order. For each one it gets the bond type and the begin and end atoms, and if requested it stores information about the number of cuts per atom.

  for (unsigned int i = 0; i < bondsToRemove.size(); ++i) {
    const Bond *bond = bondsToRemove[i];
    unsigned int bidx = bond->getBeginAtomIdx();
    unsigned int eidx = bond->getEndAtomIdx();
    Bond::BondType bT = bond->getBondType();
    res->removeBond(bidx, eidx);
    if (nCutsPerAtom) {
      (*nCutsPerAtom)[bidx] += 1;
      (*nCutsPerAtom)[eidx] += 1;
    }
FragmentOnBonds() by default will add "dummy" atoms (atoms with atomic number 0), but this can be disabled. If enabled, then it will create the two atoms with atomic number 0. By default it will set the istope to be the index of the atom it will be attached to, but that can also be specified with the dummyLabels parameter:
    if (addDummies) {
      Atom *at1, *at2;
      at1 = new Atom(0);
      at2 = new Atom(0);
      if (dummyLabels) {
        at1->setIsotope((*dummyLabels)[i].first);
        at2->setIsotope((*dummyLabels)[i].second);
      } else {
        at1->setIsotope(bidx);
        at2->setIsotope(eidx);
      }
Next, make the bond from the old terminal atoms to the new wildcard atoms. By default the bond type for the new bonds is the same as the bond which was broken (determined earlier). Otherwise, there's an option to specify an alternate bond type.
      unsigned int idx1 = res->addAtom(at1, false, true);
      if (bondTypes) bT = (*bondTypes)[i];
      res->addBond(eidx, at1->getIdx(), bT);
      unsigned int idx2 = res->addAtom(at2, false, true);
      res->addBond(bidx, at2->getIdx(), bT);

The last part does what the comment says it does; if the atom has a tetrahedral chirality then check if the permutation order has changed and invert the chirality if needed:
      // figure out if we need to change the stereo tags on the atoms:
      if (mol.getAtomWithIdx(bidx)->getChiralTag() ==
              Atom::CHI_TETRAHEDRAL_CCW ||
          mol.getAtomWithIdx(bidx)->getChiralTag() ==
              Atom::CHI_TETRAHEDRAL_CW) {
        checkChiralityPostMove(mol, mol.getAtomWithIdx(bidx),
                               res->getAtomWithIdx(bidx),
                               mol.getBondBetweenAtoms(bidx, eidx));
      }
      if (mol.getAtomWithIdx(eidx)->getChiralTag() ==
              Atom::CHI_TETRAHEDRAL_CCW ||
          mol.getAtomWithIdx(eidx)->getChiralTag() ==
              Atom::CHI_TETRAHEDRAL_CW) {
        checkChiralityPostMove(mol, mol.getAtomWithIdx(eidx),
                               res->getAtomWithIdx(eidx),
                               mol.getBondBetweenAtoms(bidx, eidx));
}
The code to check if the chirality has changed iterates through all of the bonds of the old atom ("oAt") to make a list ("newOrder") containing all of the atom identifiers on the other side of the bond. (The "OEDGE_ITER" is part of the adapter to work with Boost Graph library. It's an "out_edge_iterator".)
void checkChiralityPostMove(const ROMol &mol, const Atom *oAt, Atom *nAt,
                            const Bond *bond) {
  INT_LIST newOrder;
  ROMol::OEDGE_ITER beg, end;
  boost::tie(beg, end) = mol.getAtomBonds(oAt);
  while (beg != end) {
    const BOND_SPTR obond = mol[*beg];
    ++beg;
The code knows that the RemoveBond()/AddBond() caused the new dummy atom to be placed at the end of bond list, so when it sees the old bond it simply ignores it. Once it's gone through the old atoms, it adds the new wildcard atom id to the end of the list. Interestingly, this knowledge isn't something I can depend on, because it's an implementation detail which might change in the future. That's why I needed to substitute the new bond id at this point in my code.
    if (obond.get() == bond) {
      continue;
    }
    newOrder.push_back(obond->getIdx());
  }
  newOrder.push_back(bond->getIdx());
The last bit is to compute the permutation order, or as it says, "perturbation order". I believe that is a typo. Dig down a bit and it uses a selection sort to count the number of swaps needed to make the "oAt" atom list and the "newOrder" list match. The result, modulo two, gives the parity. If the parity is odd, invert the chirality:
  unsigned int nSwaps = oAt->getPerturbationOrder(newOrder);
  if (nSwaps % 2) nAt->invertChirality();
}
This was a bit different than my approach, which compared the parity before and after, but it gives the same results.

Jumping back to the fragmentOnBonds() function, the last bit of the function is:

  res->clearComputedProps();
  return static_cast(res);
This is needed because some of the computed properties may depend on bond patterns which are no longer present. Note that it does not use SanitizeMol(), which is something I investigated in the previous essay.

While the details are a bit different between my fragment_chiral() and FragmentOnBonds(), the general approach is the same.

A possible optimization

I mentioned that checkChiralityPostMove() uses insider knowledge. It knows that the deleted bond is removed from the list and the new bond added at the end. It uses this information to build the "newOrder" list with atom indices in the correct order, before determining the difference in parity between that list and the list of atom indices around the new bond.

There's an easier way. Count the number of bond between where the deleted bond was and the end, and take the result modulo 2. For example, if the transformation were:

initial neighbors = [1, 6, 5, 4]
   - remove bond to atom 5
neighbors after delete = [1, 6, 4]
   - add bond to new atom 9
final neighbors = [1, 6, 4, 9]
then the bond to atom 5 was in the third position, with one bond (to atom 4) between it and the end of the list. The parity is therefore odd. I added this suggestion to the RDKit issue tracker as #1033.

The proof is simple. Assume there are n elements afterward the bond to delete. Then there are n swaps needed to bring it to the end, where it can be replaced with the connection to the new atom. Hence the parity is n%2.

Not using SanitizeMol for this investigation

In the earlier essay in this series, "Fragment chiral molecules in RDKit", I investigated some failure cases where the chirality wasn't preserved. Greg Landrum pointed out: after you break one or more bonds, you really, really should re-sanitize the molecule (or at least call ClearComputedProps() .... I found that if I added SanitizeMol() then some of the error cases done by using ClearComputedProps() were no longer error cases. This may be because of a bug in FastFindRings. Greg, in the second link, writes: There is a workaround until this is fixed; explicitly sanitize your molecules or just cal GetSymmSSSR() before you generate SMILES.

So that's what I did. However, without SanitizeMol() there were only 232 failures out of 1.4M+ input structures. Adding SanitizeMol() didn't fix all of the remaining errors. On the other hand, the performance overhead to SanitizeMol() is (as expected) larger than changing a few bonds around. In one timing test I made, the time to process 10,000 structures using fragment_chiral() without SanitizeMol() was 107 seconds, while the time with SanitizeMol() was 255 seconds. The overall processing time takes about twice as long. (There is an additional constant overhead to parse the SMILES and canonicalize the fragmented molecule, which I did not include.)

For this pedagogical investigation, don't think the performance impact is worth a slightly error rate, so I will not include SanitizeMol() in this essay. That means I can use my new fragment_on_mol() function as-is, and I'll modify fragment_chiral() so it calls ClearComputedProps() instead of SanitizeMol().

In any case, the explicit call to SanitizeMol() is a workaround. I expect ClearComputedProps() will suffice in the RDKit of the future world that you, the reader, inhabit.

Cross-validation

In the earlier essay I tested the fragment_chiral() function by attempting to re-connect the fragments. If the canonicalized result doesn't match the canonicalized input structure then there's a problem. (And there were a small number of problems, at the edge of RDKit's ability to handle chemisty.)

That test ignored what SMILES calls "directional bonds", that is the "/" and "\" in "F/C=C\F", which is how SMILES denotes the stereochemistry of double bonds. This is how SMILES deals with "E-Z notation", which would be more familiar to a chemist. The tests ignored those bonds for the simple reason that FragmentOnBonds() doesn't preserve that information if it cuts the bond. In a future essay I'll show how to implement that ability.

Instead of fully testing fragment_on_bonds(), I'll do a weaker test where I compare its results to fragment_chiral(). I say it's weaker because what it shows is that they are likely bug-compatible, which is different than being correct. It's also weaker because the two implementation methods are very similar. A stronger test would use a different way to fragment.

I think of this as a cross-validation test. As Wikipedia helpful points out, that term is overloaded. Statistics uses a different definition than chemistry; the latter term is very similar to what's used in verification and validation, so I think I'm right to use the term.

The test code is not interesting, so I'll put it at the end and not discuss it in detail. The results weren't interesting either. There were no mismatches. I just had to wait for a few hours while I let it process the 1.4M+ structures from ChEMBL 20.

Here are the last few lines of output:

Processed 1455762 records, 1455763 molecules, 16017700 tests, 0 mismatches.  T_chiral: 5912.26 T_on_bond 3554.22
elapsed time: 8:30:15
The two implementations gave identical results, and time needed for fragment_chiral() is about 1.7x that needed for fragment_on_bond(). The fragment_on_bond() code is shorter, faster, and easier to understand, so you should always use it instead of fragment_chiral().

The code

The fragment_chiral() function here is a copy of the previous essay, except I replaced the SanitizeMol(new_mol) with a mol.ClearComputedProps() for 2x performance, at the expense of a few incorrect structures. (The SanitizeMol() is a work-around to a current bug report. It shouldn't be needed in long-term use.) I put the code here so the test code was self-contained, except of course you'll need your own dataset.

from __future__ import print_function
import time
import datetime
from rdkit import Chem

# Two different methods to cut a bond and fragment an RDKit molecule.
#   - fragment_on_bond() uses RDKit's FragmentOnBonds()
#   - fragment_chiral() uses lower-level API calls

# This also includes a cross-validation function which checks that the
# two methods produce the same fragments as output.
# The final two lines when I evaluated ChEMBL20 were:
#
# Processed 1455762 records, 1455763 molecules, 16017700 tests, 0 mismatches.  T_chiral: 5912.26 T_on_bond 3554.22
# elapsed time: 8:30:15
#
# The timings show that fragment_on_bond() is about 1.7x the performance of fragment_chiral().


#### Fragment using a high-level RDKit API function

def fragment_on_bond(mol, atom1, atom2):
    bond = mol.GetBondBetweenAtoms(atom1, atom2)
    new_mol = Chem.FragmentOnBonds(mol, [bond.GetIdx()], dummyLabels=[(0, 0)])
    # FragmentOnBonds() calls ClearComputedProps() at the end.  There
    # is a current bug report where, as a downstream effect, that may
    # cause some chiralities to change, most notably on some
    # bridgeheads.. A workaround for now is to call SanitizeMol(),
    # though that ends up tripling the time. I'll stay compatible
    # with FragmentOnBonds() and not call it.
    #Chem.SanitizeMol(new_mol)

    return new_mol


#### Fragment using low-level RDKit API functions

# See http://www.dalkescientific.com/writings/diary/archive/2016/08/14/fragment_chiral_molecules.html
# for implementation discussion

CHI_TETRAHEDRAL_CW = Chem.ChiralType.CHI_TETRAHEDRAL_CW
CHI_TETRAHEDRAL_CCW = Chem.ChiralType.CHI_TETRAHEDRAL_CCW

def parity_shell(values):
    # Simple Shell sort; while O(N^2), we only deal with at most 4 values
    # See http://www.dalkescientific.com/writings/diary/archive/2016/08/15/fragment_parity_calculation.html
    # for faster versions for fixed-size lists.
    values = list(values)
    N = len(values)
    num_swaps = 0
    for i in range(N-1):
        for j in range(i+1, N):
            if values[i] > values[j]:
                values[i], values[j] = values[j], values[i]
                num_swaps += 1
    return num_swaps % 2


def get_bond_parity(mol, atom_id):
    """Compute the parity of the atom's bond permutation

    Return None if it does not have tetrahedral chirality,
    0 for even parity, or 1 for odd parity.
    """
    atom_obj = mol.GetAtomWithIdx(atom_id)
    
    # Return None unless it has tetrahedral chirality
    chiral_tag = atom_obj.GetChiralTag()
    if chiral_tag not in (CHI_TETRAHEDRAL_CW, CHI_TETRAHEDRAL_CCW):
        return None
    
    # Get the list of atom ids for the each atom it's bonded to.
    other_atom_ids = [bond.GetOtherAtomIdx(atom_id) for bond in atom_obj.GetBonds()]
    
    # Use those ids to determine the parity
    return parity_shell(other_atom_ids)


def set_bond_parity(mol, atom_id, old_parity, old_other_atom_id, new_other_atom_id):
    """Compute the new bond parity and flip chirality if needed to match the old parity"""
    
    atom_obj = mol.GetAtomWithIdx(atom_id)
    # Get the list of atom ids for the each atom it's bonded to.
    other_atom_ids = [bond.GetOtherAtomIdx(atom_id) for bond in atom_obj.GetBonds()]
    
    # Replace id from the new wildcard atom with the id of the original atom
    i = other_atom_ids.index(new_other_atom_id)
    other_atom_ids[i] = old_other_atom_id
    
    # Use those ids to determine the parity
    new_parity = parity_shell(other_atom_ids)
    if old_parity != new_parity:
        # If the parity has changed, invert the chirality
        atom_obj.InvertChirality()

def fragment_chiral(mol, atom1, atom2):
    """Cut the bond between atom1 and atom2 and replace with connections to wildcard atoms

    Return the fragmented structure as a new molecule.
    """
    rwmol = Chem.RWMol(mol)
    
    atom1_parity = get_bond_parity(mol, atom1)
    atom2_parity = get_bond_parity(mol, atom2)
    
    rwmol.RemoveBond(atom1, atom2)
    wildcard1 = rwmol.AddAtom(Chem.Atom(0))
    wildcard2 = rwmol.AddAtom(Chem.Atom(0))
    new_bond1 = rwmol.AddBond(atom1, wildcard1, Chem.BondType.SINGLE)
    new_bond2 = rwmol.AddBond(atom2, wildcard2, Chem.BondType.SINGLE)
    
    if atom1_parity is not None:
        set_bond_parity(rwmol, atom1, atom1_parity, atom2, wildcard1)
    if atom2_parity is not None:
        set_bond_parity(rwmol, atom2, atom2_parity, atom1, wildcard2)
    
    # After breaking bonds, should re-sanitize See
    # https://github.com/rdkit/rdkit/issues/1022#issuecomment-239355482
    # or at least call ClearComputedProps().  I found that
    # SanitizeMol() improves chiral carbon bridgeheads handling,
    # though using it doubles the execution time. I'll stick with
    # ClearComputedProps(), which matches what FragmentOnBonds() does
    new_mol = rwmol.GetMol()
    # I must ClearComputedProps() after editing the structure.
    new_mol.ClearComputedProps()
    #Chem.SanitizeMol(new_mol)
    return new_mol

####### Cross-validation code

def read_records(filename):
    with open(filename) as infile:
        for recno, line in enumerate(infile):
            smiles, id = line.split()
            if "*" in smiles:
                continue
            yield recno, id, smiles

def cross_validate():
    # Match a single bond not in a ring
    BOND_SMARTS = "[!#0;!#1]-!@[!#0;!#1]"
    single_bond_pat = Chem.MolFromSmarts(BOND_SMARTS)
    
    num_mols = num_tests = num_mismatches = 0
    time_chiral = time_on_bond = 0.0

    # Helper function to print status information. Since
    # the function is defined inside of another function,
    # it has access to variables in the outer function.
    def print_status():
        print("Processed {} records, {} molecules, {} tests, {} mismatches."
              "  T_chiral: {:.2f} T_on_bond {:.2f}"
              .format(recno, num_mols, num_tests, num_mismatches,
                      time_chiral, time_on_bond))
    
    filename = "/Users/dalke/databases/chembl_20_rdkit.smi"
    start_time = datetime.datetime.now()
    for recno, id, smiles in read_records(filename):
        if recno % 1000 == 0:
            print_status()

        mol = Chem.MolFromSmiles(smiles)
        if mol is None:
            continue
        num_mols += 1

        # Find all of the single bonds
        matches = mol.GetSubstructMatches(single_bond_pat)

        for a1, a2 in matches:
            # For each pair of atom indices, split on that bond
            # and determine the canonicalized fragmented molecule.
            # Do it for both fragment_chiral() ...
            t1 = time.time()
            mol_chiral = fragment_chiral(mol, a1, a2)
            t2 = time.time()
            time_chiral += t2-t1
            smiles_chiral = Chem.MolToSmiles(mol_chiral, isomericSmiles=True)

            # ... and for fragment_on_bond()
            t1 = time.time()
            mol_on_bond = fragment_on_bond(mol, a1, a2)
            t2 = time.time()
            time_on_bond += t2-t1
            smiles_on_bond = Chem.MolToSmiles(mol_on_bond, isomericSmiles=True)

            # Report any mismatches and keep on going.
            num_tests += 1
            if smiles_chiral != smiles_on_bond:
                print("Mismatch: record: {} id: {} smiles: {} cut: {} {}".format(
                      recno, id, smiles, a1, a2))
                print(" smiles_chiral:", smiles_chiral)
                print("smiles_on_bond:", smiles_on_bond)
                num_mismatches += 1
    
    end_time = datetime.datetime.now()
    elapsed_time = str(end_time - start_time)
    elapsed_time = elapsed_time.partition(".")[0]  # don't display subseconds
    
    print("Done.")
    print_status()
    print("elapsed time:", elapsed_time)
                
if __name__ == "__main__":
    cross_validate()



Faster parity calculation #

In the previous essay I needed determine the parity of a permutation. I used a Shell sort and counted the number of swaps needed to order the list. The parity is even (or "0") if the number of swaps is even, otherwise it's odd (or "1"). The final code was:

def parity_shell(values):
    # Simple Shell sort; while O(N^2), we only deal with at most 4 values 
    values = list(values)
    N = len(values)
    num_swaps = 0
    for i in range(N-1):
        for j in range(i+1, N):
            if values[i] > values[j]:
                values[i], values[j] = values[j], values[i]
                num_swaps += 1
    return num_swaps % 2

I chose this implementation because it's easy to understand, and any failure case is easily found. However, it's not fast.

It's tempting to use a better sort method. The Shell sort takes quadratic time in the number of elements, while others take O(N*ln(N)) time in the asymptotic case.

However, an asymptotic analysis is pointless for this case. The code will only ever receive 3 terms (if there is a chiral hydrogen) or 4 terms because the code will only ever be called for tetrahedral chirality.

Sorting networks

The first time I worked on this problem, I used a sorting networks. A sorting network works on a fixed number of elements. It uses a pre-determined set of pairwise comparisons, each followed by a swap if needed. These are often used where code branches are expensive, like in hardware or on a GPU. A sorting network takes constant time, so can help minimize timing side-channel attacks, where the time to sort may give some insight into what is being sorted.

A general algorithm to find a perfect sorting network for a given value of 'N' element isn't known, though there are non-optimal algorithms like Bose-Nelson and Batcher's odd–even mergesort, and optimal solutions are known for up to N=10.

John M. Gamble has a CGI script which will generate a sorting network for a given number of elements and choice of algorithm. For N=4 it generates:

N=4 elements: SWAP(0, 1); SWAP(2, 3); SWAP(0, 2); SWAP(1, 3); SWAP(1, 2);
where the SWAP would modify the elements of an array in-place. Here's one way to turn those instructions into a 4-element sort for Python:
def sort4(data):
  if data[1] < data[0]:  # SWAP(0, 1)
    data[0], data[1] = data[1], data[0]
  
  if data[3] < data[2]:  # SWAP(2, 3)
    data[2], data[3] = data[3], data[2]
  
  if data[2] < data[0]:  # SWAP(0, 2)
    data[0], data[2] = data[2], data[0]
  
  if data[3] < data[1]:  # SWAP(1, 3)
    data[1], data[3] = data[3], data[1]
  
  if data[2] < data[1]:  # SWAP(1, 2)
    data[1], data[2] = data[2], data[1]

As a test, I'll sort every permutation of four values and make sure the result is sorted. I could write the test cases out manually, but it's easier to use the "permutations()" function from Python's itertools module, as in this example with 3 values:

>>> list(itertools.permutations([1, 2, 3]))
[(1, 2, 3), (1, 3, 2), (2, 1, 3), (2, 3, 1), (3, 1, 2), (3, 2, 1)]
Here's the test, which confirms that the function sorts correctly:
>>> for permutation in itertools.permutations([0, 1, 2, 3]):
...   permutation = list(permutation) # Convert the tuple to list sort4() can swap elements
...   sort4(permutation)
...   if permutation != [0, 1, 2, 3]:
...     print("ERROR:", permutation)
... 
>>>

I think it's obvious how to turn this into a parity function by adding a swap counter. If the input array cannot be modified then the parity function need to make a copy of the array first. That's what parity_shell() does. (NOTE: the day after writing this I realized I'm thinking like a C programmer. In Python the better solution is to copy the items into local variables.)

No need to sort

A sort network will always do D comparisions, but those sorts aren't always needed. The reason is simple - if you think of the network as a decision tree, where each comparison is a branch, then D comparison will always have 2D leaves. This must be at least as large as N!, where N is the number of elements in the list. But N! for N>2 is not a perfect power of 2, so there will be some unused leaves.

I would like to minimize the number of comparisions. I would also like to not modify the array in-place by actually sorting it.

The key realization is that there's no need to sort in order to determine the parity. For example, if there are only two elements in the list, then the parity is as simple as testing

def two_element_parity(x):
  assert len(x) == 2
  return x[1] > x[0]

The three element parity is a bit harder to do by hand:

def three_element_parity(x):
  assert len(x) == 3
  if x[0] < x[1]:
    if x[1] < x[2]:
      return 0      # 1, 2, 3
    elif x[0] < x[2]:
      return 1      # 1, 3, 2
    else:
      return 0      # 2, 3, 1
  elif x[0] < x[2]:
    return 1        # 2, 1, 3
  elif x[1] < x[2]:
    return 0        # 3, 1, 2
  else:
    return 1        # 3, 2, 1
It's complicated enough that it took several attempts before it was correct. I had to fix it using the following test code, which uses parity_shell() as a reference because I'm confident that it gives the correct values. (A useful development technique is to write something that you know works, even if it's slow, so you can use it to test more complicated code which better fits your needs)

The test code is:

def test_three_element_parity():
  for x in itertools.permutations([1,2,3]):
    p1 = parity_shell(x)
    p2 = three_element_parity(x)
    if p1 != p2:
      print("MISMATCH", x, p1, p2)
    else:
      print("Match", x, p1, p2)
which gives the output:
>>> test_three_element_parity()
Match (1, 2, 3) 0 0
Match (1, 3, 2) 1 1
Match (2, 1, 3) 1 1
Match (2, 3, 1) 0 0
Match (3, 1, 2) 0 0
Match (3, 2, 1) 1 1

A debugging technique

As I said, it took a couple of iterations to get correct code. I wasn't sure sometimes which branch was used to get a 0 or 1. During development I added a second field to each return value, to serve as a tag. The code looked like:

def three_element_parity(x):
  assert len(x) == 3
  if x[0] < x[1]:
    if x[1] < x[2]:
      return 0,1      # 1, 2, 3
    elif x[0] < x[2]:
      return 1,2      # 1, 3, 2
    else:
      return 0,3      # 2, 3, 1
  …
which meant I could see which path was in error. Here's one of the debug outputs using that technique:
>>> test_three_element_parity()
MISMATCH (1, 2, 3) 0 (0, 0)
MISMATCH (1, 3, 2) 1 (1, 1)
MISMATCH (2, 1, 3) 1 (1, 3)
MISMATCH (2, 3, 1) 0 (0, 2)
MISMATCH (3, 1, 2) 0 (0, 4)
MISMATCH (3, 2, 1) 1 (1, 5)
Of course the "MISMATCH" now is misleading and I need to compare things by eye, but with this few number of elements that's fine. For more complicated code I would modify the test code as well.

Brute force solution

The last time I worked on this problem I turned the sorting network for N=4 into a decision tree. With 5 swaps there 25=32 terminal nodes, but only N! = 4! = 24 of them will be used. I pruned them by hand, which is possible with 32 elements.

I thought this time I would come up with some clever way to handle this, and pulled out Knuth's "The Art of Computer Programming" for a pointer, which has a lot about optimal sorts and sorting network. Oddly, "parity" wasn't in the index.

There's probably some interesting pattern I could use to figure out which code paths to use, but N is small, so I decided to brute force it.

I'll start with all of the possible permutations:

>>> permutations = list(itertools.permutations(range(3)))
>>> permutations
[(0, 1, 2), (0, 2, 1), (1, 0, 2), (1, 2, 0), (2, 0, 1), (2, 1, 0)]
I want to build a decision tree where each leaf contains only one permutation. Each decision will be made by choosing two indices to use for the comparison test. I'll go through the permtuations. If its values at those indices are sorted then I'll put them into the "lt_permutations" list ("lt" is short for "less than"), otherwise they go into the "gt_permutations" list.

For now, I'll assume the first pair of indices to swap is (0, 1):

>>> lt_permutations = []
>>> gt_permutations = []
>>> for permutation in permutations:
...   if permutation[0] < permutation[1]:
...     lt_permutations.append(permutation)
...   else:
...     gt_permutations.append(permutation)
... 
>>> lt_permutations
[(0, 1, 2), (0, 2, 1), (1, 2, 0)]
>>> gt_permutations
[(1, 0, 2), (2, 0, 1), (2, 1, 0)]

I'll turn the above into a utility function:

def partition_permutations(i, j, permutations):
    lt_permutations = []
    gt_permutations = []
    for permutation in permutations:
        if permutation[i] < permutation[j]:
            lt_permutations.append(permutation)
        else:
            gt_permutations.append(permutation)
    return lt_permutations, gt_permutations
and use it further partition the 'lt_permutations' on the pair (1, 2):
>>> lt_permutations2, gt_permutations2 = partition_permutations(1, 2, lt_permutations)
>>> lt_permutations2
[(0, 1, 2)]
>>> gt_permutations2
[(0, 2, 1), (1, 2, 0)]
The lt_permutations2 list contains one element, so this time I'll partition gt_permutations2 using the swap index pair (0, 2):
>>> lt_permutations3, gt_permutations3 = partition_permutations(0, 2, gt_permutations2)
>>> lt_permutations3
[(0, 2, 1)]
>>> gt_permutations3
[(1, 2, 0)]
Each partitioning corresponds to additional if-statements until there is only one element in the branch. I want to use the above information to make a decision tree which looks like:
def parity3(data):
  if data[0] < data[1]:
    if data[1] < data[2];
      return 0 # parity of (0, 1, 2)
    else:
      if data[0] < data[2]:
        return 1 # parity of (0, 2, 1)
      else:
        return 0 # parity of (1, 2, 0)
  ...

Partition scoring

In the previous section I partioned using the successive pairs (0, 1), (1, 2) and (0, 2). These are pretty obvious. What should I use for N=4 or higher? In truth, I could likely use same swap pairs as from the sorting network, but I decided to continue with brute force.

For N item there are N*(N-1)/2 possible swap pairs.

>>> n = 4
>>> swap_pairs = [(i, j) for i in range(n-1) for j in range(i+1, n)]
>>> swap_pairs = [(i, j) for i in range(n-1) for j in range(i+1, n)]
>>> swap_pairs
[(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)]
I decided to pick the one which is more likely to partition the set of permutations in half. For each pair, I partition the given permutations, and use the absolute value of the difference between the "less than" and the "greater than" subsets.
def get_partition_score(swap_pair, permutations):
    i, j = swap_pair
    num_lt = num_gt = 0
    for permutation in permutations:
        if permutation[i] < permutation[j]:
            num_lt += 1
        else:
            num_gt += 1
    return abs(num_lt - num_gt)
I'll create all permutations for 4 terms and score each of the pairs on the results:
>>> permutations = list(itertools.permutations(range(n)))
>>> len(permutations)
24
>>> permutations[:3]
[(0, 1, 2, 3), (0, 1, 3, 2), (0, 2, 1, 3)]
>>> for swap_pair in swap_pairs:
...   print(swap_pair, get_partition_score(swap_pair, permutations))
... 
(0, 1) 0
(0, 2) 0
(0, 3) 0
(1, 2) 0
(1, 3) 0
(2, 3) 0
The score is 0, which means that all partitions are equally good. I'll use the first, which is (0, 1), and partition into two sets of 12 each:
>>>lt_permutations, gt_permutations = partition_permutations(0, 1, permutations)
>>> lt_permutations
[(0, 1, 2, 3), (0, 1, 3, 2), (0, 2, 1, 3), (0, 2, 3, 1), (0, 3, 1, 2),
(0, 3, 2, 1), (1, 2, 0, 3), (1, 2, 3, 0), (1, 3, 0, 2), (1, 3, 2, 0),
(2, 3, 0, 1), (2, 3, 1, 0)]
>>> len(lt_permutations), len(gt_permutations)
(12, 12)
I'll redo the scoring with the "less than" permutations
>>> permutations = lt_permutations
>>> for swap_pair in swap_pairs:
...   print(swap_pair, get_partition_score(swap_pair, permutations))
... 
(0, 1) 12
(0, 2) 4
(0, 3) 4
(1, 2) 4
(1, 3) 4
(2, 3) 0
Obviously it does no good to use (0, 1) again because those are all sorted. Most of the other fields are also partially sorted so using them leads to an imbalanced 4-8 partitioning, but (2, 3) gives a perfect partitioning, so I'll use it for the next partitioning, and again select the "less-than" subset and re-score:
>>> lt_permutations, gt_permutations = partition_permutations(2, 3, permutations)
>>> lt_permutations
[(0, 1, 2, 3), (0, 2, 1, 3), (0, 3, 1, 2), (1, 2, 0, 3), (1, 3, 0, 2), (2, 3, 0, 1)]
>>> gt_permutations
[(0, 1, 3, 2), (0, 2, 3, 1), (0, 3, 2, 1), (1, 2, 3, 0), (1, 3, 2, 0), (2, 3, 1, 0)]
>>> permutations = lt_permutations
>>> for swap_pair in swap_pairs:
...   print(swap_pair, get_partition_score(swap_pair, permutations))
... 
(0, 1) 6
(0, 2) 0
(0, 3) 4
(1, 2) 4
(1, 3) 0
(2, 3) 6

Repeat this process until only one permutation is left, and use the parity of that permutation as the return value.

Code generation

I'll combine the above code together and put it into program which generates Python code that will compute the parity of a list with N distinct items. It uses recursion. The main entry point is "generate_parity_function()", which sets up the data for the recursive function "_generate_comparison()". That identifies the best pair of indices to use for the swap then calls itself to process each side.

On the other hand, if there's one permutation in the list, then there's nothing more do to but compute the parity of that permutation and use that as the return value for that case.

import itertools

def parity_shell(values):
    # Simple Shell sort; while O(N^2), we only deal with at most 4 values 
    values = list(values)
    N = len(values)
    num_swaps = 0
    for i in range(N-1):
        for j in range(i+1, N):
            if values[i] > values[j]:
                values[i], values[j] = values[j], values[i]
                num_swaps += 1
    return num_swaps % 2

def get_partition_score(swap_pair, permutations):
    i, j = swap_pair
    num_lt = num_gt = 0
    for permutation in permutations:
        if permutation[i] < permutation[j]:
            num_lt += 1
        else:
            num_gt += 1
    return abs(num_lt - num_gt)

def partition_permutations(i, j, permutations):
    lt_permutations = []
    gt_permutations = []
    for permutation in permutations:
        if permutation[i] < permutation[j]:
            lt_permutations.append(permutation)
        else:
            gt_permutations.append(permutation)
    return lt_permutations, gt_permutations

def generate_parity_function(n):
    print("def parity{}(data):".format(n))
    permutations = list(itertools.permutations(range(n)))
    swap_pairs = [(i, j) for i in range(n-1) for j in range(i+1, n)]
    _generate_comparison(permutations, swap_pairs, "  ")

def _generate_comparison(permutations, swap_pairs, indent):
    if len(permutations) == 1:
        parity = parity_shell(permutations[0])
        print(indent + "return {} # {} ".format(parity, permutations[0]))
        return
    
    swap_pair = min(swap_pairs, key=lambda x: get_partition_score(x, permutations))
    # Delete the swap pair because it can't be used again.
    # (Not strictly needed as it will always have the worse score.)
    del swap_pairs[swap_pairs.index(swap_pair)]

    # I could have a case where the lt subset has 0 elements while the
    # gt subset has 1 element. Rather than have the 'if' block do nothing,
    # I'll swap the comparison indices and swap branches.
    i, j = swap_pair
    lt_permutations, gt_permutations = partition_permutations(i, j, permutations)
    if not lt_permutations:
        lt_permutations, gt_permutations = gt_permutations, lt_permutations
        i, j = j, i
    
    print(indent + "if data[{i}] < data[{j}]:".format(i=i, j=j))
    # Need to copy the swap_pairs because the 'else' case may reuse a pair.
    _generate_comparison(lt_permutations, swap_pairs[:], indent+"  ")
    if gt_permutations:
        print(indent + "else:")
        _generate_comparison(gt_permutations, swap_pairs, indent+"  ")
    

if __name__ == "__main__":
    import sys
    n = 4
    if sys.argv[1:]:
        n = int(sys.argv[1])
    generate_parity_function(n)

The output for n=2 elements is the expected trivial case:

def parity2(data):
  if data[0] < data[1]:
    return 0 # (0, 1) 
  else:
    return 1 # (1, 0) 
For n=3 it's a bit more complicated.
def parity3(data):
  if data[0] < data[1]:
    if data[0] < data[2]:
      if data[1] < data[2]:
        return 0 # (0, 1, 2) 
      else:
        return 1 # (0, 2, 1) 
    else:
      return 0 # (1, 2, 0) 
  else:
    if data[0] < data[2]:
      return 1 # (1, 0, 2) 
    else:
      if data[1] < data[2]:
        return 0 # (2, 0, 1) 
      else:
        return 1 # (2, 1, 0) 
and for n=4 elements, well, you can see why I wrote a program to help generate the function:
def parity4(data):
  if data[0] < data[1]:
    if data[2] < data[3]:
      if data[0] < data[2]:
        if data[1] < data[2]:
          return 0 # (0, 1, 2, 3) 
        else:
          if data[1] < data[3]:
            return 1 # (0, 2, 1, 3) 
          else:
            return 0 # (0, 3, 1, 2) 
      else:
        if data[0] < data[3]:
          if data[1] < data[3]:
            return 0 # (1, 2, 0, 3) 
          else:
            return 1 # (1, 3, 0, 2) 
        else:
          return 0 # (2, 3, 0, 1) 
    else:
      if data[0] < data[3]:
        if data[1] < data[2]:
          if data[1] < data[3]:
            return 1 # (0, 1, 3, 2) 
          else:
            return 0 # (0, 2, 3, 1) 
        else:
          return 1 # (0, 3, 2, 1) 
      else:
        if data[0] < data[2]:
          if data[1] < data[2]:
            return 1 # (1, 2, 3, 0) 
          else:
            return 0 # (1, 3, 2, 0) 
        else:
          return 1 # (2, 3, 1, 0) 
  else:
    if data[2] < data[3]:
      if data[0] < data[3]:
        if data[0] < data[2]:
          return 1 # (1, 0, 2, 3) 
        else:
          if data[1] < data[2]:
            return 0 # (2, 0, 1, 3) 
          else:
            return 1 # (2, 1, 0, 3) 
      else:
        if data[1] < data[2]:
          return 1 # (3, 0, 1, 2) 
        else:
          if data[1] < data[3]:
            return 0 # (3, 1, 0, 2) 
          else:
            return 1 # (3, 2, 0, 1) 
    else:
      if data[0] < data[2]:
        if data[0] < data[3]:
          return 0 # (1, 0, 3, 2) 
        else:
          if data[1] < data[3]:
            return 1 # (2, 0, 3, 1) 
          else:
            return 0 # (2, 1, 3, 0) 
      else:
        if data[1] < data[2]:
          if data[1] < data[3]:
            return 0 # (3, 0, 2, 1) 
          else:
            return 1 # (3, 1, 2, 0) 
        else:
          return 0 # (3, 2, 1, 0) 

The test code is essentially the same as "test_three_element_parity()", so I won't include it here.

Evaluation

I don't think it makes much sense to use this function beyond n=5 because there's so much code. Here's a table of the number of lines of code it generates for difference values of n:

# elements   # lines
----------   -------
    2              5
    3             17
    4             71
    5            349
    6          2,159
    7         15,119
    8        120,959
    9      1,088,662
This appears to be roughly factorial growth, which is what it should be. For my case, n=4, so 71 lines is not a problem.

I wrote some timing code which does 100,000 random selections from the possible permutations and compares the performance of the parityN() function with parity_shell(). To put them on a more even basis, I changed the parity_shell() implementation so mutates the input values rather than making a temporary list. The timing code for parity5() looks like:

import itertools

def parity_shell(values):
    # Simple Shell sort; while O(N^2), we only deal with at most 4 values 
    #values = list(values)
    N = len(values)
    num_swaps = 0
    for i in range(N-1):
        for j in range(i+1, N):
            if values[i] > values[j]:
                values[i], values[j] = values[j], values[i]
                num_swaps += 1
    return num_swaps % 2


if __name__ == "__main__":
    import random
    import time
    permutations = list(itertools.permutations(range(5)))
    perms = [list(random.choice(permutations)) for i in range(100000)]
    t1 = time.time()
    p1 = [parity5(perm) for perm in perms]
    t2 = time.time()
    p2 = [parity_shell(perm) for perm in perms]
    t3 = time.time()
    if p1 != p2:
      print("oops")
    print("parity5:", t2-t1, "parity_shell:", t3-t2)
    print("ratio:", (t2-t1)/(t3-t2))
The decision tree version is consisently 5-6x faster than the Shell sort version across all the sizes I tested.

A performance improvement

By the way, I was able to raise the performance to 9x faster by switching to local variables rather than an array index each time. Here's the start of parity4() with that change:

def parity4(data):
  data0,data1,data2,data3 = data
  if data0 < data1:
    if data2 < data3:
      if data0 < data2:
        if data1 < data2:
          return 0 # (0, 1, 2, 3) 
        else:
          if data1 < data3:
            return 1 # (0, 2, 1, 3) 
          else:
            return 0 # (0, 3, 1, 2) 
      else:
        if data0 < data3:
          if data1 < data3:
            return 0 # (1, 2, 0, 3) 
          else:
            return 1 # (1, 3, 0, 2) 
        else:
          return 0 # (2, 3, 0, 1) 
    else:
      if data0 < data3:
        if data1 < data2:
          if data1 < data3:
            return 1 # (0, 1, 3, 2) 
          else:
            return 0 # (0, 2, 3, 1) 
        else:
          return 1 # (0, 3, 2, 1) 
      else:
        if data0 < data2:
          if data1 < data2:
            return 1 # (1, 2, 3, 0) 
          else:
            return 0 # (1, 3, 2, 0) 
        else:
          return 1 # (2, 3, 1, 0) 
  else:
    if data2 < data3:
      if data0 < data3:
        if data0 < data2:
          return 1 # (1, 0, 2, 3) 
        else:
          if data1 < data2:
            return 0 # (2, 0, 1, 3) 
          else:
            return 1 # (2, 1, 0, 3) 
      else:
        if data1 < data2:
          return 1 # (3, 0, 1, 2) 
        else:
          if data1 < data3:
            return 0 # (3, 1, 0, 2) 
          else:
            return 1 # (3, 2, 0, 1) 
    else:
      if data0 < data2:
        if data0 < data3:
          return 0 # (1, 0, 3, 2) 
        else:
          if data1 < data3:
            return 1 # (2, 0, 3, 1) 
          else:
            return 0 # (2, 1, 3, 0) 
      else:
        if data1 < data2:
          if data1 < data3:
            return 0 # (3, 0, 2, 1) 
          else:
            return 1 # (3, 1, 2, 0) 
        else:
          return 0 # (3, 2, 1, 0) 
It's easy to change the code so it generates this version instead, so I won't show it.

Are the if/else:s worthwhile?

This is written the day after I wrote the previous text. In the above, I minimized the number of comparisions, at the expense of a lot of code generation. But the sorting network doesn't add that much overhead, and it's clearly a lot easier to implement. The real test shouldn't be how much faster the if/else decision tree is over the Shell sort solution, but how much faster it is to the optimal sorting network.

So I did that. Here's the N=4 and N=5 code, which is clearly simpler than the 71 and 349 lines of code from the decision tree:

def parity4_network(data):
    # N=4 Bose-Nelson sorting network from http://pages.ripco.net/~jgamble/nw.html 
    num_swaps = 0
    x0, x1, x2, x3 = data
    if x1 < x0:
        num_swaps += 1
        x0, x1 = x1, x0
  
    if x3 < x2:
        num_swaps += 1
        x2, x3 = x3, x2
  
    if x2 < x0:
        num_swaps += 1
        x0, x2 = x2, x0
  
    if x3 < x1:
        num_swaps += 1
        x1, x3 = x3, x1
  
    if x2 < x1:
        num_swaps += 1
        x1, x2 = x2, x1
  
    return num_swaps % 2
    
def parity5_network(data):
    # N=5 Bose-Nelson sorting network from http://pages.ripco.net/~jgamble/nw.html 
    num_swaps = 0
    x0, x1, x2, x3, x4 = data
    if x1 < x0:
        num_swaps += 1
        x0, x1 = x1, x0
    
    if x4 < x3:
        num_swaps += 1
        x3, x4 = x4, x3
    
    if x4 < x2:
        num_swaps += 1
        x2, x4 = x4, x2
    
    if x3 < x2:
        num_swaps += 1
        x2, x3 = x3, x2
    
    if x3 < x0:
        num_swaps += 1
        x0, x3 = x3, x0
    
    if x2 < x0:
        num_swaps += 1
        x0, x2 = x2, x0
    
    if x4 < x1:
        num_swaps += 1
        x1, x4 = x4, x1
    
    if x3 < x1:
        num_swaps += 1
        x1, x3 = x3, x1
    
    if x2 < x1:
        num_swaps += 1
        x1, x2 = x2, x1
    
    return num_swaps % 2
My timing tests say that the N=4 sorting network takes 1.4x the time of the decision tree with local variables, and the N=5 sorting network takes 1.7x the time. The decision tree is clearly faster.

If you are really interested in performance then you could push this into C or switch to PyPy. But sometimes you just need the code to be fast enough, which is why it can be worthwhile to explore different solutions and know their tradeoffs.



Fragment chiral molecules in RDKit using low-level functions #

In the previous essay, I showed that the simple fragmentation function doesn't preserve chiral after making a single cut. Here's the function definition:

from rdkit import Chem

# Only works correctly for achiral molecules
def fragment_simple(mol, atom1, atom2):
    rwmol = Chem.RWMol(mol)
    rwmol.RemoveBond(atom1, atom2)
    wildcard1 = rwmol.AddAtom(Chem.Atom(0))
    wildcard2 = rwmol.AddAtom(Chem.Atom(0))
    rwmol.AddBond(atom1, wildcard1, Chem.BondType.SINGLE) 
    rwmol.AddBond(atom2, wildcard2, Chem.BondType.SINGLE) 
    return rwmol.GetMol()
The reason is the RemoveBond()/AddBond() combination can change the permutation order of the bonds around and atom, which inverts the chirality. Here's the relevant part of the connection table from the end of that essay:
  connections from atom 1 (as bond type + other atom index)
1 C -0 -2 -3 -5   original structure
1 C -0 -2 -5 -11  modified; bond to atom 3 is now a bond to atom 11
             ^^^--- was bond to atom 3
         ^^^^^^^--- these two bonds swapped position = inverted chirality

I'll now show how to improve the code to handle chirality. (Note: this essay is pedagogical. To fragment in RDKit use FragmentOnBonds().)

Parity of a permutation

There's no way from Python to go in and change the permutation order of RDKit's bond list for an atom. Instead, I need to detect if the permutation order has changed, and if so, un-invert the atom's chirality.

While I say "un-invert", that's because we only need to deal with tetrahedral chirality, which has only two chirality types. SMILES supports more complicated chiralities, like octahedral (for example, "@OH19") which can't be written simply as "@" or "@@". However, I've never seen them in use.

With only two possibilities, this reduces to determining the "parity" of the permutation. There are only two possible parities. I'll call one "even" and the other "odd", though in code I'll use 0 for even and 1 for odd.

A list of values in increasing order, like (1, 2, 9), has an even parity. If I swap two values then it has odd parity. Both (2, 1, 9) and (9, 2, 1) have odd parity, because each needs only one swap to put it in sorted order. With another swap, such as (2, 9, 1), the permutation order is back to even parity. The parity of a permutation is the number of pairwise swaps needed to order the list, modulo 2. If the result is 0 then it has even parity, if the result is 1 then it has odd parity.

One way to compute the permutation order is to sort the list, and count the number of swaps needed. Since there will only be a handful of bonds, I can use a simple sort like the Shell sort:

def parity_shell(values):
    # Simple Shell sort; while O(N^2), we only deal with at most 4 values 
    values = list(values)
    N = len(values)
    num_swaps = 0
    for i in range(N-1):
        for j in range(i+1, N):
            if values[i] > values[j]:
                values[i], values[j] = values[j], values[i]
                num_swaps += 1
    return num_swaps % 2
(I have another essay where I show how to write a faster parity function for a fixed number of values.)

I'll test it with a few different cases to see if it gives the expected results:

>>> parity_shell( (1, 2, 9) )
0
>>> parity_shell( (2, 1, 9) )
1
>>> parity_shell( (2, 9, 1) )
0
>>> parity_shell( (2, 1, 9) )
1
>>> parity_shell( (1, 3, 9, 8) )
1
There are faster and better ways to determine the parity. I find it best to start with the most obviously correct solution first.

Determine an atom's parity

The next step is to determine the configuration order before and after attaching the dummy atom. I'll use the fragment_simple() and parity_shell() functions I defined earlier, and define a couple of helper functions to create an isomeric canonical SMILES from a molecule or SMILES string.

from rdkit import Chem

def C(mol): # Create a canonical isomeric SMILES from a molecule
  return Chem.MolToSmiles(mol, isomericSmiles=True)

def Canon(smiles): # Create a canonical isomeric SMILES from a SMILES string
  return C(Chem.MolFromSmiles(smiles))

The permutation order is based on which atoms are connected to a given bond. I'll parse a simple chiral structure (which is already in canonical form) and get the ids for the atoms bonded to the second atom. (The second atom has an index of 1.)

>>> mol = Chem.MolFromSmiles("O[C@](F)(Cl)Br")
>>> C(mol)
'O[C@](F)(Cl)Br'
>>> 
>>> atom_id = 1
>>> atom_obj = mol.GetAtomWithIdx(atom_id)
>>> other_atoms = [bond.GetOtherAtomIdx(atom_id) for bond in atom_obj.GetBonds()]
>>> other_atoms
[0, 2, 3, 4]
The list values are in order, so you won't be surprised it has a parity of 0 ("even"):
>>> parity_shell(other_atoms)
0

I'll use the fragment_simple() function to fragment between the oxygen and the chiral carbon:

>>> fragmented_mol = fragment_simple(mol, 0, 1)
>>> fragmented_smiles = C(fragmented_mol)
>>> fragmented_smiles
'[*]O.[*][C@@](F)(Cl)Br'
the use the convert_wildcards_to_closures() function from the previous essay to re-connect the fragments and produce a canonical SMILES from it:
>>> from smiles_syntax import convert_wildcards_to_closures
>>> 
>>> closure_smiles = convert_wildcards_to_closures(fragmented_smiles, (0, 0))
>>> closure_smiles
'O%90.[C@@]%90(F)(Cl)Br'
>>> 
>>> Canon(closure_smiles)
'O[C@@](F)(Cl)Br'

If you compare this to the canonicalized input SMILES you'll see the chirality is inverted from what it should be. I'll see if I can detect that from the list of neighbor atoms to the new atom 1 of the fragmented molecule:

>>> atom_id = 1
>>> atom_obj = fragmented_mol.GetAtomWithIdx(atom_id)
>>> other_atoms = [bond.GetOtherAtomIdx(atom_id) for bond in atom_obj.GetBonds()]
>>> other_atoms
[2, 3, 4, 6]
These values are ordered. It's tempting to conclude that this list also has an even parity. But recall that the original list was [0, 2, 3, 4]. The id 0 (the connection to the oxygen) has been replaced with the id 6 (the connection to the wildcard atom).

The permutation must use the same values, so I'll replace the 6 with a 0 and determine the parity of the resulting list:

>>> i = other_atoms.index(6)
>>> i
3
>>> other_atoms[i] = 0
>>> other_atoms
[2, 3, 4, 0]
>>> parity_shell(other_atoms)
1
This returned a 1 when the ealier parity call returned a 0, which means parity is inverted, which means I need to invert the chirality of the second atom:
>>> atom_obj.InvertChirality()

Now to check the re-assembled structure:

>>> fragmented_smiles = C(fragmented_mol)
>>> fragmented_smiles
'[*]O.[*][C@](F)(Cl)Br'
>>> 
>>> closure_smiles = convert_wildcards_to_closures(fragmented_smiles, (0, 0))
>>> closure_smiles
'O%90.[C@]%90(F)(Cl)Br'
>>> 
>>> Canon(closure_smiles)
'O[C@](F)(Cl)Br'
This matches the canonicalized input SMILES, so we're done.

An improved fragment function

I'll use a top-down process to describe the changes to fragment_simple() to make it work. What this doesn't show you is the several iterations I went through to make it look this nice.

At the top level, I need some code to figure out if an atom is chiral, then after I made the cut, and if the atom is chiral, I need some way to restore the correct chirality once I've connected it to the new wildcard atom.

def fragment_chiral(mol, atom1, atom2):
    rwmol = Chem.RWMol(mol)

    # Store the old parity as 0 = even, 1 = odd, or None for no parity 
    atom1_parity = get_bond_parity(mol, atom1)
    atom2_parity = get_bond_parity(mol, atom2)
    
    rwmol.RemoveBond(atom1, atom2)
    wildcard1 = rwmol.AddAtom(Chem.Atom(0))
    wildcard2 = rwmol.AddAtom(Chem.Atom(0))
    new_bond1 = rwmol.AddBond(atom1, wildcard1, Chem.BondType.SINGLE)
    new_bond2 = rwmol.AddBond(atom2, wildcard2, Chem.BondType.SINGLE)

    # Restore the correct parity if there is a parity
    if atom1_parity is not None:
        set_bond_parity(rwmol, atom1, atom1_parity, atom2, wildcard1)
    if atom2_parity is not None:
        set_bond_parity(rwmol, atom2, atom2_parity, atom1, wildcard2)

    # (Later I'll find I also need to call ClearComputedProps().)
    return rwmol.GetMol()

To get atom's bond permutation parity, check if it has tetrahedral chirality (it will either be clockwise or counter-clockwise). If it doesn't have a tetrahedral chirality, return None. Otherwise use the neighboring atom ids to determine the parity:

from rdkit import Chem

CHI_TETRAHEDRAL_CW = Chem.ChiralType.CHI_TETRAHEDRAL_CW
CHI_TETRAHEDRAL_CCW = Chem.ChiralType.CHI_TETRAHEDRAL_CCW

def get_bond_parity(mol, atom_id):
    atom_obj = mol.GetAtomWithIdx(atom_id)
    
    # Return None unless it has tetrahedral chirality
    chiral_tag = atom_obj.GetChiralTag()
    if chiral_tag not in (CHI_TETRAHEDRAL_CW, CHI_TETRAHEDRAL_CCW):
        return None
    
    # Get the list of atom ids for the each atom it's bonded to.
    other_atom_ids = [bond.GetOtherAtomIdx(atom_id) for bond in atom_obj.GetBonds()]
    
    # Use those ids to determine the parity
    return parity_shell(other_atom_ids)

To restore the parity, again get the list of neighboring atom ids, but this time from the fragmented molecule. This will be connected to one of the new wildcard atoms. I need to map that back to the original atom index before I can compute the parity and, if it's changed, invert the chirality:

def set_bond_parity(mol, atom_id, old_parity, old_other_atom_id, new_other_atom_id):
    atom_obj = mol.GetAtomWithIdx(atom_id)
    # Get the list of atom ids for the each atom it's bonded to.
    other_atom_ids = [bond.GetOtherAtomIdx(atom_id) for bond in atom_obj.GetBonds()]

    # Replace id from the new wildcard atom with the id of the original atom
    i = other_atom_ids.index(new_other_atom_id)
    other_atom_ids[i] = old_other_atom_id

    # Use those ids to determine the parity
    new_parity = parity_shell(other_atom_ids)
    if old_parity != new_parity:
        # If the parity has changed, invert the chirality
        atom_obj.InvertChirality()

Testing

I used a simple set of tests during the initial development where I split the bond between the first and second atom of a few structures and compared the result to a reference structure that I fragmented manually (and re-canonicalized):

# Create a canonical isomeric SMILES from a SMILES string.
# Used to put the manually-developed reference structures into canonical form.
def Canon(smiles):
    mol = Chem.MolFromSmiles(smiles)
    assert mol is not None, smiles
    return Chem.MolToSmiles(mol, isomericSmiles=True)

def simple_test():
    for smiles, expected in (
            ("CC", Canon("*C.*C")),
            ("F[C@](Cl)(Br)O", Canon("*F.*[C@](Cl)(Br)O")),
            ("F[C@@](Cl)(Br)O", Canon("*F.*[C@@](Cl)(Br)O")),
            ("F[C@@H](Br)O", Canon("*F.*[C@@H](Br)O")),
            ):
        mol = Chem.MolFromSmiles(smiles)
        fragmented_mol = fragment_chiral(mol, 0, 1)
        fragmented_smiles = Chem.MolToSmiles(fragmented_mol, isomericSmiles=True)
        if fragmented_smiles != expected:
            print("smiles:", smiles)
            print("fragmented:", fragmented_smiles)
            print("  expected:", expected)
These tests passed, so I developed some more extensive tests. My experience is that real-world chemistry is far more complex and interesting than the manual test cases I develop. After the basic tests are done, I do more extensive testing by processing a large number of structures from PubChem and then from ChEMBL.

(Incidentally, I process PubChem first because those files are generated by one tool - OEChem I believe - while ChEMBL files are generated by several different tools, each with a different way to handle chemistry. This makes the chemistry in ChEMBL more diverse.)

Need ClearComputedProps()?

The more extensive test processes every structure which contains a chiral atom (where the SMILES contains a '@'), cuts every bond between heavy atoms, so long as it's a single bond not in a ring, and puts the results back together to see if it matches the canonicalized input structure. The code isn't interesting enough to make specific comments about it. You can get the code at the end of this essay.

The first error occurred quickly, and there were many errors. Here's the first one:

FAILURE in record 94
     input_smiles: C[C@H]1CC[C@@H](NCc2ccc3c(c2)Cc2c(-c4ccc(CC(=O)O)cc4)n[nH]c2-3)CC1
  begin/end atoms: 4 5
fragmented smiles: [*]NCc1ccc2c(c1)Cc1c(n[nH]c1-2)-c1ccc(cc1)CC(=O)O.[*][C@H]1CC[C@H](C)CC1
   closure smiles: N%90Cc1ccc2c(c1)Cc1c(n[nH]c1-2)-c1ccc(cc1)CC(=O)O.[C@@H]%901CC[C@H](C)CC1
     final smiles: C[C@H]1CC[C@H](NCc2ccc3c(c2)Cc2c(-c4ccc(CC(=O)O)cc4)n[nH]c2-3)CC1
  expected smiles: C[C@H]1CC[C@@H](NCc2ccc3c(c2)Cc2c(-c4ccc(CC(=O)O)cc4)n[nH]c2-3)CC1
Greg Landrum, the main RDKit developer, points to the solution. He writes: "after you break one or more bonds, you really, really should re-sanitize the molecule (or at least call ClearComputedProps()".

The modified code is:

def fragment_chiral(mol, atom1, atom2):
    rwmol = Chem.RWMol(mol)
    
    atom1_parity = get_bond_parity(mol, atom1)
    atom2_parity = get_bond_parity(mol, atom2)
    
    rwmol.RemoveBond(atom1, atom2)
    wildcard1 = rwmol.AddAtom(Chem.Atom(0))
    wildcard2 = rwmol.AddAtom(Chem.Atom(0))
    new_bond1 = rwmol.AddBond(atom1, wildcard1, Chem.BondType.SINGLE)
    new_bond2 = rwmol.AddBond(atom2, wildcard2, Chem.BondType.SINGLE)
    
    if atom1_parity is not None:
        set_bond_parity(rwmol, atom1, atom1_parity, atom2, wildcard1)
    if atom2_parity is not None:
        set_bond_parity(rwmol, atom2, atom2_parity, atom1, wildcard2)

    # After breaking bonds, should re-sanitize, or at least use ClearComputedProps()
    # See https://github.com/rdkit/rdkit/issues/1022#issuecomment-239355482
    new_mol = rwmol.GetMol()
    # (I will find that I need to call SanitizeMol().)
    Chem.ClearComputedProps(new_mol)
    return new_mol

Ring chirality failures

There were 200,748 chiral structures in my selected subset from PubChem. Of those, 232 unique structures problems when I try to cut a bond. Here's an example of one of the reported failures:

FAILURE in record 12906
     input_smiles: [O-][S@+](CC1(O)CCN(CCc2c[nH]c3ccc(F)cc23)CC1)c1ccccc1
  begin/end atoms: 0 1
fragmented smiles: [*][O-].[*][S@+](CC1(O)CCN(CCc2c[nH]c3ccc(F)cc23)CC1)c1ccccc1
   closure smiles: [O-]%90.[S@+]%90(CC1(O)CCN(CCc2c[nH]c3ccc(F)cc23)CC1)c1ccccc1
     final smiles: [O-][S@@+](CC1(O)CCN(CCc2c[nH]c3ccc(F)cc23)CC1)c1ccccc1
  expected smiles: [O-][S@+](CC1(O)CCN(CCc2c[nH]c3ccc(F)cc23)CC1)c1ccccc1

Here's the complete list of failing structures, which might make a good test set for other programs:

C-N=C(-S)[C@]1(c2cccnc2)CCCC[S@+]1[O-] 1
C=C(c1ccc([S@+]([O-])c2ccc(OC)cc2)cc1)C1CCN(C2CCCCC2)CC1 2
C=C(c1ccc([S@@+]([O-])c2ccc(OC)cc2)cc1)C1CCN(C2CCCCC2)CC1 3
C=CCCCC[N@+]1(C)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 4
C=CCCCC[N@@+]1(C)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 5
C=CC[S@+]([O-])C[C@H](N)C(=O)O 6
CC#CCOc1ccc([S@@+]([O-])[C@H](C(=O)NO)C(C)C)cc1 7
CC(=O)NC[C@H]1CN(c2ccc([S@+](C)[O-])cc2)C(=O)O1 8
CC(=O)Nc1cc(-c2[nH]c([S@+](C)[O-])nc2-c2ccc(F)cc2)ccn1 9
CC(=O)Nc1cc(-c2c(-c3ccc(F)cc3)nc([S@+](C)[O-])n2C)ccn1 10
CC(C(=O)O[C@@H]1CC2CCC(C1)[N@]2C)(c1ccccc1)c1ccccc1 11
CC(C)(C(=O)N[C@H]1C2CCC1C[C@@H](C(=O)O)C2)N1CCN(c2ccc(C(F)(F)F)cn2)CC1 12
CC(C)(C(=O)N[C@H]1C2CCCC1C[C@@H](C(=O)O)C2)N1CCN(c2ccc(C(F)(F)F)cn2)CC1 13
CC(C)(C(=O)N[C@H]1C2CCCC1C[C@@H](C(N)=O)C2)N1CCN(c2ccc(C(F)(F)F)cn2)CC1 14
CC(C)(C(=O)N[C@H]1C2CCCC1C[C@H](C(N)=O)C2)N1CCN(c2ccc(C(F)(F)F)cn2)CC1 15
CC(C)(Oc1ccc(C#N)cn1)C(=O)N[C@H]1C2CCCC1C[C@H](C(N)=O)C2 16
CC(C)(Oc1ccc(C#N)cn1)C(=O)N[C@H]1C2COCC1C[C@H](C(N)=O)C2 17
CC(C)(Oc1ccc(Cl)cc1)C(=O)N[C@H]1C2CCCC1C[C@@H](C(=O)O)C2 18
CC(C)(Oc1ccc(Cl)cc1)C(=O)N[C@H]1C2CCCC1C[C@@H](C(N)=O)C2 19
CC(C)(Oc1ccc(Cl)cc1)C(=O)N[C@H]1C2CCCC1C[C@H](C(=O)O)C2 20
CC(C)(Oc1ccc(Cl)cc1)C(=O)N[C@H]1C2CCCC1C[C@H](C(N)=O)C2 21
CC(C)(Oc1ccc(Cl)cc1)C(=O)N[C@H]1C2COCC1C[C@@H](C(N)=O)C2 22
CC(C)Cc1c(C#N)cc2ccccc2c1C(=O)N(C)C[C@@H](CCN1CCC(c2ccccc2[S@+](C)[O-])CC1)c1ccc(Cl)c(Cl)c1 23
CC(C)Nc1cc(-c2[nH]c([S@+](C)[O-])nc2-c2ccc(F)cc2)ccn1 24
CC(C)Nc1cc(-c2[nH]c([S@@+]([O-])C(C)C)nc2-c2ccc(F)cc2)ccn1 25
CC(C)OC(=O)N1CCC(Oc2ncnc3c2CCN3c2ccc([S@+](C)[O-])c(F)c2)CC1 26
CC(C)OC(=O)N1CCC(Oc2ncnc3c2CCN3c2ccc([S@@+](C)[O-])c(F)c2)CC1 27
CC(C)[C@@H](C)Nc1cc(-c2[nH]c([S@+](C)[O-])nc2-c2ccc(F)cc2)ccn1 28
CC(C)[C@H](C)Nc1cc(-c2[nH]c([S@+](C)[O-])nc2-c2ccc(F)cc2)ccn1 29
CC(C)[S@+]([O-])c1nc(-c2ccc(F)cc2)c(-c2ccnc(NC3CCCCC3)c2)[nH]1 30
CC1(C)C(=O)N([C@H]2C3CCCC2C[C@@H](C(N)=O)C3)CC1COc1ccc(C#N)cn1 31
CC1(C)C(=O)N([C@H]2C3CCCC2C[C@H](C(N)=O)C3)CC1COc1ccc(C#N)cn1 32
CC1(C)N=C(c2ccccc2)C(=O)N1[C@@H]1C(=O)N2[C@@H]1[S@+]([O-])C(C)(C)[C@@H]2C(=O)O 33
CC1(C)NC(c2ccccc2)C(=O)N1[C@@H]1C(=O)N2[C@@H]1[S@+]([O-])C(C)(C)[C@@H]2C(=O)O 34
CC1(C)NC(c2ccccc2)C(=O)N1[C@@H]1C(=O)N2[C@@H]1[S@@+]([O-])C(C)(C)[C@@H]2C(=O)O 35
CCCCCCCCCCCCCO[C@@H](CO)[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 36
CCCCCCCCC[S@@+]([O-])CC(=O)OC 37
CCCCCCCCC[S@@+]([O-])CC(N)=O 38
CCCCCCCC[N@+]1(C)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 39
CCCCCC[C@@H](C(=O)NO)[S@+]([O-])c1ccc(OC)cc1 40
CCCCCC[C@@H](C(=O)NO)[S@@+]([O-])c1ccc(OC)cc1 41
CCCCCC[N@+]1(C)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 42
CCCCOCCOc1ccc(-c2ccc3c(c2)C=C(C(=O)Nc2ccc([S@@+]([O-])Cc4cncn4CCC)cc2)CCCCCN3CC(C)C)cc1 43
CCCCOCCOc1ccc(-c2ccc3c(c2)C=C(C(=O)Nc2ccc([S@@+]([O-])Cc4cncn4CCC)cc2)CCCCCN3CCC)cc1 44
CCCCOCCOc1ccc(-c2ccc3c(c2)C=C(C(=O)Nc2ccc([S@@+]([O-])Cc4cncn4CCC)cc2)CCCCN3CC(C)C)cc1 45
CCCCOCCOc1ccc(-c2ccc3c(c2)C=C(C(=O)Nc2ccc([S@@+]([O-])Cc4cncn4CCC)cc2)CCCCN3CCC)cc1 46
CCCCOCCOc1ccc(-c2ccc3c(c2)C=C(C(=O)Nc2ccc([S@@+]([O-])Cc4cncn4CCC)cc2)CCCN3CC(C)C)cc1 47
CCCCOCCOc1ccc(-c2ccc3c(c2)C=C(C(=O)Nc2ccc([S@@+]([O-])Cc4cncn4CCC)cc2)CCCN3CC(C)C)cc1.CS(=O)(=O)O 48
CCCCOCCOc1ccc(-c2ccc3c(c2)C=C(C(=O)Nc2ccc([S@@+]([O-])Cc4cncn4CCC)cc2)CCCN3CCC)cc1 49
CCCCOCCOc1ccc(-c2ccc3c(c2)C=C(C(=O)Nc2ccc([S@@+]([O-])Cc4cncn4CCC)cc2)CCCN3Cc2cnn(C)c2)cc1 50
CCCCOCCOc1ccc(-c2ccc3c(c2)C=C(C(=O)Nc2ccc([S@@+]([O-])Cc4nncn4CCC)cc2)CCCN3CC(C)C)cc1 51
CCCCOCCOc1ccc(-c2ccc3c(c2)C=C(C(=O)Nc2ccc([S@@+]([O-])Cc4nncn4CCC)cc2)CCCN3CCC)cc1 52
CCCC[N@+]1(C)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 53
CCC[C@H]1CC[C@H]([C@H]2CC[C@@H](OC(=O)[C@H]3[C@H](c4ccc(O)cc4)[C@H](C(=O)O[C@H]4CC[C@@H]([C@H]5CC[C@H](CCC)CC5)CC4)[C@H]3c3ccc(O)cc3)CC2)CC1 54
CCCc1c(C#N)cc2ccccc2c1C(=O)N(C)C[C@@H](CCN1CCC(c2ccccc2[S@+](C)[O-])CC1)c1ccc(Cl)c(Cl)c1 55
CCN1C(=O)c2ccccc2[C@H]1[C@@H](CCN1CCC(c2ccccc2[S@+](C)[O-])CC1)c1ccc(Cl)c(Cl)c1 56
CCN1c2cc(C(=O)NCc3ccc(C#N)cc3)ccc2[S@@+]([O-])c2ccccc2C1=O 57
CCN1c2cc(C(=O)NCc3ccc(F)cc3)ccc2[S@@+]([O-])c2ccccc2C1=O 58
CCN1c2cc(C(=O)NCc3ccc(OC)cc3)ccc2[S@+]([O-])c2ccccc2C1=O 59
CCN1c2cc(C(=O)NCc3ccc(OC)cc3)ccc2[S@@+]([O-])c2ccccc2C1=O 60
CCN1c2cc(C(=O)N[C@H](C)c3ccc(Br)cc3)ccc2[S@@+]([O-])c2ccccc2C1=O 61
CCN1c2cc(C(=O)N[C@H](C)c3ccc4ccccc4c3)ccc2[S@@+]([O-])c2ccccc2C1=O 62
CCO[C@@H](CO)[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 63
CC[C@H](NC(=O)c1c2ccccc2nc(-c2ccccc2)c1C[S@+](C)[O-])c1ccccc1 64
CC[C@H](NC(=O)c1c2ccccc2nc(-c2ccccc2)c1[S@+](C)[O-])c1ccccc1 65
CC[S@+]([O-])c1ccc(-c2coc3ccc(-c4ccc(C)o4)cc32)cc1 66
CCc1c(C#N)cc2ccccc2c1C(=O)N(C)C[C@@H](CCN1CCC(c2ccccc2[S@+](C)[O-])CC1)c1ccc(Cl)c(Cl)c1 67
CN(C[C@@H](CCN1CCC(c2ccc(Br)cc2[S@+](C)[O-])CC1)c1ccc(Cl)c(Cl)c1)C(=O)c1cc(C#N)cc2ccccc21 68
CN(C[C@@H](CCN1CCC(c2ccc(F)cc2[S@+](C)[O-])CC1)c1ccc(Cl)c(Cl)c1)C(=O)c1cc(C#N)cc2ccccc21 69
CN(C[C@@H](CCN1CCC(c2ccccc2[S@+](C)[O-])CC1)c1ccc(Cl)c(Cl)c1)C(=O)c1c2ccccc2cc(C#N)c1-c1ccccc1 70
CN(C[C@@H](CCN1CCC(c2ccccc2[S@+](C)[O-])CC1)c1ccc(Cl)c(Cl)c1)C(=O)c1cc(C#N)c(Br)c2ccccc12 71
CN(C[C@@H](CCN1CCC(c2ccccc2[S@+](C)[O-])CC1)c1ccc(Cl)c(Cl)c1)C(=O)c1cc(C#N)c(C#N)c2ccccc12 72
CN(C[C@@H](CCN1CCC(c2ccccc2[S@+](C)[O-])CC1)c1ccc(Cl)c(Cl)c1)C(=O)c1cc(C#N)c(F)c2ccccc12 73
CN(C[C@@H](CCN1CCC(c2ccccc2[S@+](C)[O-])CC1)c1ccc(Cl)c(Cl)c1)C(=O)c1cc(C#N)cc2cc(C#N)ccc21 74
CN(C[C@@H](CCN1CCC(c2ccccc2[S@+](C)[O-])CC1)c1ccc(Cl)c(Cl)c1)C(=O)c1cc(C#N)cc2cc(O)ccc21 75
CN(C[C@@H](CCN1CCC(c2ccccc2[S@+](C)[O-])CC1)c1ccc(Cl)c(Cl)c1)C(=O)c1cc(C#N)cc2ccccc21 76
CN1c2cc(C(=O)NCCN3CCCC3)ccc2[S@@+]([O-])c2ccccc2C1=O 77
CN1c2cc(C(=O)NCCc3cccs3)ccc2[S@+]([O-])c2ccccc2C1=O 78
CN1c2cc(C(=O)NCCc3cccs3)ccc2[S@@+]([O-])c2ccccc2C1=O 79
CN1c2cc(C(=O)NCc3ccc(Br)cc3)ccc2[S@@+]([O-])c2ccccc2C1=O 80
CN1c2cc(C(=O)NCc3ccc(Cl)cc3)ccc2[S@@+]([O-])c2ccccc2C1=O 81
CN1c2cc(C(=O)NCc3cccnc3)ccc2[S@@+]([O-])c2ccccc2C1=O 82
COC(=O)[C@H]1CC2COCC(C1)[C@H]2NC(=O)[C@@]1(C)CCCN1S(=O)(=O)c1cccc(Cl)c1C 83
COC(=O)c1ccc2[nH]c([S@+]([O-])Cc3nccc(OC)c3OC)nc2c1 84
COC(=O)c1ccc2[nH]c([S@@+]([O-])Cc3nccc(OC)c3OC)nc2c1 85
COC1(C[S@@+]([O-])c2ccccc2)CCN(CCc2c[nH]c3ccc(F)cc23)CC1 86
COCCCOc1ccnc(C[S@+]([O-])c2nc3ccccc3[nH]2)c1C 87
CO[C@@H](CO)[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 88
CO[C@@H]([C@H](O)[C@@H](O)CO)[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 89
COc1c(C#N)cc2ccccc2c1C(=O)N(C)C[C@@H](CCN1CCC(c2ccc(Br)cc2[S@+](C)[O-])CC1)c1ccc(Cl)c(Cl)c1 90
COc1c(C#N)cc2ccccc2c1C(=O)N(C)C[C@@H](CCN1CCC(c2ccccc2[S@+](C)[O-])CC1)c1ccc(Cl)c(Cl)c1 91
COc1cc(-C=C-OC(=O)[C@H]2CC[C@@H](N(C)[C@H]3CC[C@H](C(=O)O-C=C-c4cc(OC)c(OC)c(OC)c4)CC3)CC2)cc(OC)c1OC 92
COc1cc(C(=O)N2CCO[C@@](CCN3CCC4(CC3)c3ccccc3C[S@+]4[O-])(c3ccc(Cl)c(Cl)c3)C2)cc(OC)c1OC 93
COc1cc(C(=O)N2CCO[C@@](CCN3CCC4(CC3)c3ccccc3C[S@@+]4[O-])(c3ccc(Cl)c(Cl)c3)C2)cc(OC)c1OC 94
COc1cc(C(=O)N2CCO[C@](CCN3CCC4(CC3)c3ccccc3C[S@+]4[O-])(c3ccc(Cl)c(Cl)c3)C2)cc(OC)c1OC 95
COc1cc(C(=O)N2CCO[C@](CCN3CCC4(CC3)c3ccccc3C[S@@+]4[O-])(c3ccc(Cl)c(Cl)c3)C2)cc(OC)c1OC 96
COc1cc(OC(=O)[C@@H]2CC[C@H](N(C)[C@@H]3CC[C@@H](C(=O)Oc4cc(OC)c(OC)c(OC)c4)CC3)CC2)cc(OC)c1OC 97
COc1ccc(C2CCN(CC[C@H](CN(C)C(=O)c3c4ccccc4cc(C#N)c3OC)c3ccc(Cl)c(Cl)c3)CC2)c([S@+](C)[O-])c1 98
COc1ccc(C2CCN(CC[C@H](CN(C)C(=O)c3cc(C#N)cc4ccccc43)c3ccc(Cl)c(Cl)c3)CC2)c([S@+](C)[O-])c1 99
COc1ccc(N(C(=O)OC(C)(C)C)[C@@H]2C(C)=C[C@H]([S@@+]([O-])c3ccc(C)cc3)[C@@H]3C(=O)N(C)C(=O)[C@@H]32)cc1 100
COc1ccc([C@@H]2C3(CO)C4[N@](C)C5C6(CO)C([N@@](C)C3C4(CO)[C@H]6c3ccc(OC)cc3)C52CO)cc1 101
COc1ccc([S@+]([O-])c2ccc(C(=O)C3CCN(C4CCCCC4)CC3)cc2)cc1 102
COc1ccc([S@+]([O-])c2ccc(C(C#N)C3CCN(C4CCCCC4)CC3)cc2)cc1 103
COc1ccc([S@+]([O-])c2ccc(C(C#N)N3CCN(C4CCCCC4)CC3)cc2)cc1 104
COc1ccc([S@@+]([O-])c2ccc(C(=O)C3CCN(C4CCCCC4)CC3)cc2)cc1 105
COc1ccc([S@@+]([O-])c2ccc(C(C#N)C3CCN(C4CCCCC4)CC3)cc2)cc1 106
COc1ccc([S@@+]([O-])c2ccc(C(C#N)N3CCN(C4CCCCC4)CC3)cc2)cc1 107
COc1ccc2c(cc(C#N)cc2C(=O)N(C)C[C@@H](CCN2CCC(c3ccccc3[S@+](C)[O-])CC2)c2ccc(Cl)c(Cl)c2)c1 108
COc1ccc2cc(-c3nc(-c4ccc([S@+](C)[O-])cc4C)[nH]c3-c3ccncc3)ccc2c1 109
COc1ccc2cc(-c3nc(-c4ccc([S@@+](C)[O-])cc4C)[nH]c3-c3ccncc3)ccc2c1 110
COc1ccc2nc([S@@+]([O-])Cc3ncc(C)c(OC)c3C)[nH]c2c1 111
COc1ccnc(C[S@@+]([O-])c2nc3ccc(OC(F)F)cc3[nH]2)c1OC 112
CSC[S@+]([O-])C[C@H](CO)NC(=O)-C=C-c1c(C)[nH]c(=O)[nH]c1=O 113
CSC[S@@+]([O-])CC(CO)NC(=O)-C=C-c1c(C)nc(O)nc1O 114
C[C@@H](Nc1cc(-c2[nH]c([S@+](C)[O-])nc2-c2ccc(F)cc2)ccn1)c1ccccc1 115
C[C@@H]1CC[C@H]2[C@@H](C)[C@@H](CCC(=O)Nc3cccc([S@+](C)[O-])c3)O[C@@H]3O[C@@]4(C)CC[C@@H]1[C@]32OO4 116
C[C@@H]1O[C@H]([C@@H]2CCC[N+]2(C)C)C[S@+]1[O-] 117
C[C@@H]1O[C@H]([C@@H]2CCC[N+]2(C)C)C[S@@+]1[O-] 118
C[C@@H]1O[C@H]([C@H]2CCC[N+]2(C)C)C[S@+]1[O-] 119
C[C@@H]1O[C@H]([C@H]2CCC[N+]2(C)C)C[S@@+]1[O-] 120
C[C@H](Nc1cc(-c2[nH]c([S@+](C)[O-])nc2-c2ccc(F)cc2)ccn1)c1ccccc1 121
C[C@H]1O[C@@H]([C@@H]2CCCN2C)C[S@+]1[O-] 122
C[C@H]1O[C@@H]([C@@H]2CCCN2C)C[S@@+]1[O-] 123
C[C@H]1O[C@@H]([C@@H]2CCC[N+]2(C)C)C[S@+]1[O-] 124
C[C@H]1O[C@@H]([C@@H]2CCC[N+]2(C)C)C[S@@+]1[O-] 125
C[N+](C)(C)C[C@@H]1C[S@+]([O-])C(C2CCCCC2)(C2CCCCC2)O1 126
C[N+](C)(C)C[C@@H]1C[S@@+]([O-])C(C2CCCCC2)(C2CCCCC2)O1 127
C[N@+]1([O-])[C@H]2CC[C@H]1C[C@H](OC(=O)[C@@H](CO)c1ccccc1)C2 128
C[N@@+]1(CC2CC2)C2CCC1C[C@@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 129
C[N@@+]1(CC2CC2)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 130
C[N@@+]1(CCCOCc2ccccc2)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 131
C[N@@+]1(CCCOc2ccccc2)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 132
C[N@@+]1(CCCc2ccccc2)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 133
C[N@@+]1(CCOCc2ccccc2)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 134
C[N@@+]1(CCOc2ccccc2)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 135
C[S@+]([O-])CCCCN=C=S 136
C[S@+]([O-])CC[C@@](CO)(C(=O)O[C@H]1CN2CCC1CC2)c1ccccc1 137
C[S@+]([O-])C[C@@H]1[C@@H](O)[C@]23CC[C@H]1C[C@H]2[C@]1(C)CCC[C@](C)(C(=O)O)[C@H]1CC3 138
C[S@+]([O-])C[C@](C)(O)[C@H]1OC(=O)C=C2[C@@]13O[C@@H]3[C@H]1OC(=O)[C@@]3(C)C=CC[C@@]2(C)[C@@H]13 139
C[S@+]([O-])c1ccc(C2=C(c3ccccc3)C(=O)OC2)cc1 140
C[S@+]([O-])c1cccc2c3c(n(Cc4ccc(Cl)cc4)c21)[C@@H](CC(=O)O)CCC3 141
C[S@+]([O-])c1nc(-c2ccc(F)cc2)c(-c2ccnc(NC(=O)Cc3ccc(F)cc3)c2)[nH]1 142
C[S@+]([O-])c1nc(-c2ccc(F)cc2)c(-c2ccnc(NC3CCCCC3)c2)[nH]1 143
C[S@+]([O-])c1nc(-c2ccc(F)cc2)c(-c2ccnc(NC3CCOCC3)c2)[nH]1 144
C[S@@+]([O-])CCCC-C(=N-OS(=O)(=O)[O-])S[C@@H]1O[C@H](CO)[C@@H](O)[C@H](O)[C@H]1O 145
C[S@@+]([O-])CCCCN=C=S 146
C[S@@+]([O-])C[C@@H]1[C@@H](O)[C@]23CC[C@H]1C[C@H]2[C@]1(C)CCC[C@](C)(C(=O)O)[C@H]1CC3 147
C[S@@+]([O-])Cc1ccc(C(=O)Nc2cccnc2C(=O)NCC2CCOCC2)c2ccccc12 148
Cc1c(C#N)cc(C(=O)N(C)C[C@@H](CCN2CCC(c3ccccc3[S@+](C)[O-])CC2)c2ccc(Cl)c(Cl)c2)c2ccccc12 149
Cc1c(C#N)cc2ccccc2c1C(=O)N(C)C[C@@H](CCN1CCC(c2ccccc2[S@+](C)[O-])CC1)c1ccc(Cl)c(Cl)c1 150
Cc1c(OCC(F)(F)F)ccnc1C[S@@+]([O-])c1nc2ccccc2[nH]1 151
Cc1cc(=O)c(Oc2ccc(Br)cc2F)c(-c2ccc([S@@+](C)[O-])cc2)o1 152
Cc1cc(=O)c(Oc2ccc(Cl)cc2F)c(-c2ccc([S@@+](C)[O-])cc2)o1 153
Cc1cc(=O)c(Oc2ccc(F)cc2F)c(-c2ccc([S@+](C)[O-])cc2)o1 154
Cc1ccc(-c2ccc3occ(-c4ccc([S@+](C)[O-])cc4)c3c2)o1 155
Cc1ccc(-c2ncc(Cl)cc2-c2ccc([S@+](C)[O-])cc2)cn1 156
Cc1ccc([S@+]([O-])-C(F)=C-c2ccccn2)cc1 157
Cc1ccc([S@+]([O-])c2occc2C=O)cc1 158
Cc1nc(O)nc(O)c1-C=C-C(=O)N[C@@H](CO)C[S@+]([O-])CCl 159
Cc1nc(O)nc(O)c1-C=C-C(=O)N[C@@H](CO)C[S@@+]([O-])CCl 160
NC(=O)C[S@+]([O-])C(c1ccccc1)c1ccccc1 161
NC(=O)C[S@@+]([O-])C(c1ccccc1)c1ccccc1 162
O.O.O.O.[Sr+2].COc1ccc2[n-]c([S@@+]([O-])Cc3ncc(C)c(OC)c3C)nc2c1.COc1ccc2[n-]c([S@@+]([O-])Cc3ncc(C)c(OC)c3C)nc2c1 163
O=C(O)CC-C=C-CC[C@H]1[C@H](OCc2ccc(-c3ccccc3)cc2)C[S@+]([O-])[C@@H]1c1cccnc1 164
O=C(O)CC-C=C-CC[C@H]1[C@H](OCc2ccc(-c3ccccc3)cc2)C[S@@+]([O-])[C@@H]1c1cccnc1 165
O=S(=O)([O-])OCCC[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 166
O=S(=O)([O-])OC[C@@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 167
O=S(=O)([O-])OC[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 168
O=S(=O)([O-])O[C@@H](CO)CC[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 169
O=S(=O)([O-])O[C@@H](CO)[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 170
O=S(=O)([O-])O[C@@H]([C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO)[C@@H](O)[C@@H](O)[C@@H](O)CO 171
O=S(=O)([O-])O[C@@H]([C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO)[C@@H](O)[C@@H](O)[C@H](O)CO 172
O=S(=O)([O-])O[C@@H]([C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO)[C@@H](O)[C@H](O)[C@@H](O)CO 173
O=S(=O)([O-])O[C@@H]([C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO)[C@@H](O)[C@H](O)[C@H](O)CO 174
O=S(=O)([O-])O[C@@H]([C@H](O)[C@@H](O)CO)[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 175
O=S(=O)([O-])O[C@H](CO)[C@@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 176
O=S(=O)([O-])O[C@H]([C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO)[C@@H](O)[C@@H](O)[C@@H](O)CO 177
O=S(=O)([O-])O[C@H]([C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO)[C@@H](O)[C@@H](O)[C@H](O)CO 178
O=S(=O)([O-])O[C@H]([C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO)[C@@H](O)[C@H](O)[C@@H](O)CO 179
O=S(=O)([O-])O[C@H]([C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO)[C@@H](O)[C@H](O)[C@H](O)CO 180
OCC12C3[N@](Cc4ccc(O)cc4)C4C5(CO)C([N@@](Cc6ccc(O)cc6)C1C3(CO)[C@H]5c1ccccc1)C4(CO)[C@H]2c1ccccc1 181
OCC12C3[N@](Cc4ccc(OCc5ccccc5)cc4)C4C5(CO)C([N@@](Cc6ccc(OCc7ccccc7)cc6)C1C3(CO)[C@@H]5c1ccccc1)C4(CO)[C@@H]2c1ccccc1 182
OCC12C3[N@](Cc4cccc(OCc5ccccc5)c4)C4C5(CO)C([N@@](Cc6cccc(OCc7ccccc7)c6)C1C3(CO)[C@H]5c1ccccc1)C4(CO)[C@H]2c1ccccc1 183
OCC12C3[N@](Cc4ccccc4)C4C5(CO)C([N@@](Cc6ccccc6)C1C3(CO)[C@H]5c1ccc(O)cc1)C4(CO)[C@H]2c1ccc(O)cc1 184
OCC12C3[N@](Cc4ccccc4)C4C5(CO)C([N@@](Cc6ccccc6)C1C3(CO)[C@H]5c1ccc(OCc3ccccc3)cc1)C4(CO)[C@H]2c1ccc(OCc2ccccc2)cc1 185
OCC12C3[N@](Cc4ccccc4)C4C5(CO)C([N@@](Cc6ccccc6)C1C3(CO)[C@H]5c1cccc(O)c1)C4(CO)[C@H]2c1cccc(O)c1 186
OCC12C3[N@](Cc4ccccc4)C4C5(CO)C([N@@](Cc6ccccc6)C1C3(CO)[C@H]5c1ccccc1)C4(CO)[C@H]2c1ccccc1 187
OCC12C3[N@](Cc4cccnc4)C4C5(CO)C([N@@](Cc6cccnc6)C1C3(CO)[C@H]5c1ccccc1)C4(CO)[C@H]2c1ccccc1 188
OCC12C3[N@](Cc4ccncc4)C4C5(CO)C([N@@](Cc6ccncc6)C1C3(CO)[C@H]5c1ccccc1)C4(CO)[C@H]2c1ccccc1 189
OC[C@@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 190
OC[C@@H](O)[C@@H](O)[C@H](O)[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 191
OC[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 192
OC[C@H](O)[C@@H](O)[C@@H](O)[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 193
OC[C@H](O)[C@@H](O)[C@@H](O)[C@H](O)[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 194
OC[C@H](O)[C@@H](O)[C@H](O)[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 195
OC[C@H](OCc1ccccc1)[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO.F[B-](F)(F)F 196
[Br-].C=CCCCC[N@+]1(C)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 197
[Br-].CCCCCCCC[N@+]1(C)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 198
[Br-].CCCCCC[N@+]1(C)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 199
[Br-].CCCC[N@+]1(C)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 200
[Br-].C[N@@+]1(CC2CC2)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 201
[Br-].C[N@@+]1(CCCOCc2ccccc2)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 202
[Br-].C[N@@+]1(CCCOc2ccccc2)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 203
[Br-].C[N@@+]1(CCCc2ccccc2)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 204
[Br-].C[N@@+]1(CCOc2ccccc2)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 205
[Cl-].CCCCCCCCCCCCCO[C@@H](CO)[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 206
[Cl-].CCO[C@@H](CO)[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 207
[Cl-].CO[C@@H](CO)[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 208
[Cl-].CO[C@@H]([C@H](O)[C@@H](O)CO)[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 209
[Cl-].OC[C@@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 210
[Cl-].OC[C@@H](O)[C@@H](O)[C@H](O)[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 211
[Cl-].OC[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 212
[Cl-].OC[C@H](O)[C@@H](O)[C@@H](O)[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 213
[Cl-].OC[C@H](O)[C@@H](O)[C@H](O)[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 214
[I-].C[C@@H]1O[C@H]([C@@H]2CCC[N+]2(C)C)C[S@+]1[O-] 215
[I-].C[C@@H]1O[C@H]([C@@H]2CCC[N+]2(C)C)C[S@@+]1[O-] 216
[I-].C[C@@H]1O[C@H]([C@H]2CCC[N+]2(C)C)C[S@+]1[O-] 217
[I-].C[C@@H]1O[C@H]([C@H]2CCC[N+]2(C)C)C[S@@+]1[O-] 218
[I-].C[C@H]1O[C@@H]([C@@H]2CCC[N+]2(C)C)C[S@+]1[O-] 219
[I-].C[C@H]1O[C@@H]([C@@H]2CCC[N+]2(C)C)C[S@@+]1[O-] 220
[I-].C[N+](C)(C)C[C@@H]1C[S@+]([O-])C(C2CCCCC2)(C2CCCCC2)O1 221
[I-].C[N+](C)(C)C[C@@H]1C[S@@+]([O-])C(C2CCCCC2)(C2CCCCC2)O1 222
[I-].C[N@@+]1(CCOCc2ccccc2)C2CCC1C[C@H](CC(C#N)(c1ccccc1)c1ccccc1)C2 223
[K+].COc1ccc2[n-]c([S@@+]([O-])Cc3ncc(C)c(OC)c3C)nc2c1 224
[Mg+2].COc1ccc2[n-]c([S@@+]([O-])Cc3ncc(C)c(OC)c3C)nc2c1.COc1ccc2[n-]c([S@@+]([O-])Cc3ncc(C)c(OC)c3C)nc2c1 225
[Na+].COc1ccc2[n-]c([S@@+]([O-])Cc3ncc(C)c(OC)c3C)nc2c1 226
[O-]CC[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 227
[O-]C[C@@H](O)[C@@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 228
[O-][C@@H](CO)[C@H](O)C[S@@+]1C[C@@H](O)[C@H](O)[C@H]1CO 229
[O-][S@+](CC1(O)CCN(CCc2c[nH]c3ccc(F)cc23)CC1)c1ccccc1 230
[O-][S@+](Cc1cc(OCC2CC2)ccn1)c1nc2cc(F)ccc2[nH]1 231
[O-][S@@+](Cc1cc(OCC2CC2)ccn1)c1nc2cc(F)ccc2[nH]1 232

I've been trying to make sense of the 232 failures. Some observations:

RDKit bug reports

The results of my investigations lead to three RDKit bug reports:

In the first, Greg identified that that FastFindRings() isn't putting the two chiral atoms into the same primitive ring, so AssignStereochemistry() isn't seeing that this is an instance of ring stereochemistry.

In the second, Greg points to the August 2015 thread titled "Stereochemistry - Differences between RDKit Indigo" in the RDKit mailing list". Greg comments about nitrogren chirality:

There are two things going on here in the RDKit: 1) Ring stereochemistry 2) stereochemistry about nitrogen centers. Let's start with the second, because it's easier: RDKit does not generally "believe in" stereochemistry around three coordinate nitrogens. ... Back to the first: ring stereochemistry. ... The way the RDKit handles this is something of a hack: it doesn't identify those atoms as chiral centers, but it does preserve the chiral tags when generating a canonical SMILES:

Need SanitizeMol(), not ClearComputedProps()

He proposes that as a workaround I sanitize the newly created molecule, so I replaced the call to ClearComputedProps() with one to "SanitizeMol()", near the end of fragment_chiral(), as shown here:

    # After breaking bonds, should re-sanitize or at least call
    # ClearComputedProps().
    # See https://github.com/rdkit/rdkit/issues/1022#issuecomment-239355482
    new_mol = rwmol.GetMol()
    #new_mol.ClearComputedProps()
    Chem.SanitizeMol(new_mol)
    return new_mol
With that in place, where there were 232 records which failed my test, now there are 195. All 181 of the chiral sulfurs still fail, 11 of the 34 chiral nitrogens still fail, the chiral carbon bridgeheads all pass, while the 3 remaining chiral carbons still fail.

(I also tested with both ClearComputedProps() and SanitizeMol(), but using both made no difference.)

While better, it's not substantially better. What's going on?

RDKit can produce non-canonical SMILES

At this point we're pushing the edge of what RDKit can handle. A few paragraphs ago I quoted Greg as saying that ring chirality is "something of a hack". I think that's the reason why, of the 232 records that cause a problem, 67 of them don't produce a stable SMILES string. That is, if I parse what should be a canonicalized SMILES string and recanonicalize it, I get a different result. The canonicalization is bi-stable, in that recanonicalization swaps between two possibilites, with a different chirality assignment each time.

Here's a reproducible if you want to try it out yourself. (By the time you read this it's likely been fixed!)

from rdkit import Chem

def Canon(smiles):
    mol = Chem.MolFromSmiles(smiles)
    return Chem.MolToSmiles(mol, isomericSmiles=True)

def check_if_canonical(smiles):
    s1 = Canon(smiles)
    s2 = Canon(s1)
    if s1 != s2:
        print("Failed to canonicalize", smiles)
        print(" input:", smiles)
        print("canon1:", s1)
        print("canon2:", s2)
        print("canon3:", Canon(s2)) 
    else:
        print("Passed", smiles)
    
for smiles in (
    "O[C@H]1CC2CCC(C1)[N@@]2C",
    "C[C@]1(c2cccnc2)CCCC[S@+]1O",
    "[C@]1C[S@+]1O"):
    check_if_canonical(smiles)
The output from this is:
Failed to canonicalize O[C@H]1CC2CCC(C1)[N@@]2C
 input: O[C@H]1CC2CCC(C1)[N@@]2C
canon1: C[N@]1C2CCC1C[C@@H](O)C2
canon2: C[N@]1C2CCC1C[C@H](O)C2
canon3: C[N@]1C2CCC1C[C@@H](O)C2
Failed to canonicalize C[C@]1(c2cccnc2)CCCC[S@+]1O
 input: C[C@]1(c2cccnc2)CCCC[S@+]1O
canon1: C[C@]1(c2cccnc2)CCCC[S@@+]1O
canon2: C[C@]1(c2cccnc2)CCCC[S@+]1O
canon3: C[C@]1(c2cccnc2)CCCC[S@@+]1O
Failed to canonicalize [C@]1C[S@+]1O
 input: [C@]1C[S@+]1O
canon1: O[S@@+]1[C]C1
canon2: O[S@+]1[C]C1
canon3: O[S@@+]1[C]C1

SanitizeMol adds some overhead

The SanitizeMol() is a work-around to a problem which is under investigation. I'll use SanitizeMol() for the rest of this essay, but bear in mind that that adds overhead. In one timing test I did, the version with ClearComputedProps() took 107 seconds while the version with SanitizeMol() took 255 seconds.

You'll have to decide for yourself if the small number of additional correct structures is worth the extra time. And, hopefully it's been fixed by the time you read this essay so you don't need a workaround.

Bridgeheads

Many of the failures were due to chiral bridgehead atoms. I used the following two SMARTS to detect bridgeheads:

depiction of two bridgehead topologies
*~1~*~*(~*~*~2)~*~*~2~*~1
*~1~*~*(~*~*~*~2)~*~*~2~*~1
Before I added the SanitizeMol() call, there were 34 chiral nitrogen structures which failed. Of those 34, only 11 are still failures after adding the SanitizeMol(). Of those 11, one is a normal-looking bridgehead:
a nitrogen bridgehead that has a bistable SMILES

CC(C(=O)O[C@@H]1CC2CCC(C1)[N@]2C)(c1ccccc1)c1ccccc1
It's the only one of the simple nitrogen bridgehead structures which doesn't have a stable canonicalization. (I used the core bridgehead from this structure as the first test case in the previous section, where I showed a few bi-stable SMILES strings.)

The other 10 of the 11 nitrogen bridgehead failures have a more complex ring system, like:

a complex chiral nitrogen ring system
OCC12C3[N@](Cc4cccnc4)C4C5(CO)C([N@@](Cc6cccnc6)C1C3(CO)[C@H]5c1ccccc1)C4(CO)[C@H]2c1ccccc1
All of these have a bi-stable canonicalization.

I also looked at the chiral carbon bridgeheads which failed. Of the original 14, all 14 of them pass after I added the SanitizeMol() call.

The remaining structures

There are three chiral structures which fail even after sanitization, which do not contain a chiral nitrogen or chiral sulfur, and which do not contain a bridgehead. These are:

  
CCC[C@H]1CC[C@H]([C@H]2CC[C@@H](OC(=O)[C@H]3[C@H](c4ccc(O)cc4)[C@H](C(=O)O[C@H]4CC[C@@H]([C@H]5CC[C@H](CCC)CC5)CC4)[C@H]3c3ccc(O)cc3)CC2)CC1
COc1cc(-C=C-OC(=O)[C@H]2CC[C@@H](N(C)[C@H]3CC[C@H](C(=O)O-C=C-c4cc(OC)c(OC)c(OC)c4)CC3)CC2)cc(OC)c1OC
COc1cc(OC(=O)[C@@H]2CC[C@H](N(C)[C@@H]3CC[C@@H](C(=O)Oc4cc(OC)c(OC)c(OC)c4)CC3)CC2)cc(OC)c1OC
Upon investigation, all three seem involve the ring chirality solution that Greg called a "hack". I did not investigate further.

The final code

That was lot of text. And a lot of work. If you made it this far, congratualtions. Oddly, I still have more to write about on the topic.

I'll leave you with the final version of the code, with various tweaks and comments that I didn't discuss in the essay. As a bonus, it includes an implementation of fragment_chiral() which uses RDKit's FragmentOnBonds() function, which is the function you should be using to fragment bonds.

# Cut an RDKit molecule on a specified bond, and replace the old terminals with wildcard atoms ("*").
# The code includes test suite which depends on an external SMILES file.
#
# This code is meant as a study of the low-level operations. For production use,
# see the commented out function which uses RDKit's built-in FragmentOnBonds().
#
# Written by Andrew Dalke <dalke@dalkescientific.com>.

from __future__ import print_function

from rdkit import Chem

# You can get a copy of this library from:
# http://www.dalkescientific.com/writings/diary/archive/2016/08/09/fragment_achiral_molecules.html#smiles_syntax.py
from smiles_syntax import convert_wildcards_to_closures


CHI_TETRAHEDRAL_CW = Chem.ChiralType.CHI_TETRAHEDRAL_CW
CHI_TETRAHEDRAL_CCW = Chem.ChiralType.CHI_TETRAHEDRAL_CCW

def parity_shell(values):
    # Simple Shell sort; while O(N^2), we only deal with at most 4 values 
    values = list(values)
    N = len(values)
    num_swaps = 0
    for i in range(N-1):
        for j in range(i+1, N):
            if values[i] > values[j]:
                values[i], values[j] = values[j], values[i]
                num_swaps += 1
    return num_swaps % 2


def get_bond_parity(mol, atom_id):
    """Compute the parity of the atom's bond permutation

    Return None if it does not have tetrahedral chirality,
    0 for even parity, or 1 for odd parity.
    """
    atom_obj = mol.GetAtomWithIdx(atom_id)
    
    # Return None unless it has tetrahedral chirality
    chiral_tag = atom_obj.GetChiralTag()
    if chiral_tag not in (CHI_TETRAHEDRAL_CW, CHI_TETRAHEDRAL_CCW):
        return None
    
    # Get the list of atom ids for the each atom it's bonded to.
    other_atom_ids = [bond.GetOtherAtomIdx(atom_id) for bond in atom_obj.GetBonds()]
    
    # Use those ids to determine the parity
    return parity_shell(other_atom_ids)


def set_bond_parity(mol, atom_id, old_parity, old_other_atom_id, new_other_atom_id):
    """Compute the new bond parity and flip chirality if needed to match the old parity"""
    
    atom_obj = mol.GetAtomWithIdx(atom_id)
    # Get the list of atom ids for the each atom it's bonded to.
    other_atom_ids = [bond.GetOtherAtomIdx(atom_id) for bond in atom_obj.GetBonds()]
    
    # Replace id from the new wildcard atom with the id of the original atom
    i = other_atom_ids.index(new_other_atom_id)
    other_atom_ids[i] = old_other_atom_id
    
    # Use those ids to determine the parity
    new_parity = parity_shell(other_atom_ids)
    if old_parity != new_parity:
        # If the parity has changed, invert the chirality
        atom_obj.InvertChirality()

# You should really use the commented-out function below, which uses
# RDKit's own fragmentation code. Both do the same thing.

def fragment_chiral(mol, atom1, atom2):
    """Cut the bond between atom1 and atom2 and replace with connections to wildcard atoms

    Return the fragmented structure as a new molecule.
    """
    rwmol = Chem.RWMol(mol)
    
    atom1_parity = get_bond_parity(mol, atom1)
    atom2_parity = get_bond_parity(mol, atom2)
    
    rwmol.RemoveBond(atom1, atom2)
    wildcard1 = rwmol.AddAtom(Chem.Atom(0))
    wildcard2 = rwmol.AddAtom(Chem.Atom(0))
    new_bond1 = rwmol.AddBond(atom1, wildcard1, Chem.BondType.SINGLE)
    new_bond2 = rwmol.AddBond(atom2, wildcard2, Chem.BondType.SINGLE)
    
    if atom1_parity is not None:
        set_bond_parity(rwmol, atom1, atom1_parity, atom2, wildcard1)
    if atom2_parity is not None:
        set_bond_parity(rwmol, atom2, atom2_parity, atom1, wildcard2)
    
    # After breaking bonds, should re-sanitize See
    # https://github.com/rdkit/rdkit/issues/1022#issuecomment-239355482
    # or at least call ClearComputedProps().  I found that
    # SanitizeMol() improves chiral carbon bridgeheads handling,
    # though using it doubles the execution time. I'll stick with
    # ClearComputedProps(), which matches what FragmentOnBonds() does
    new_mol = rwmol.GetMol()
    # If you don't sanitize then you must call ClearComputedProps() 
    #mol.ClearComputedProps()
    Chem.SanitizeMol(new_mol)
    return new_mol

#### Use this code for production
## def fragment_chiral(mol, atom1, atom2):
##     bond = mol.GetBondBetweenAtoms(atom1, atom2)
##     new_mol = Chem.FragmentOnBonds(mol, [bond.GetIdx()], dummyLabels=[(0, 0)])
##     # FragmentOnBonds() calls ClearComputedProps() at the end.  There
##     # is a current bug report where, as a downstream effect, that may
##     # cause some chiralities to change, most notably on some
##     # bridgeheads.. A workaround for now is to call SanitizeMol(),
##     # though that ends up tripling the time. I'll stay compatible
##     # with FragmentOnBonds() and not call it.
##     #Chem.SanitizeMol(new_mol)
##     return new_mol


##### ##### ##### ##### Test code ##### ##### ##### ##### #####

# Create a canonical isomeric SMILES from a SMILES string
# Used to put the manually-developed reference structures into canonical form.
def Canon(smiles):
    mol = Chem.MolFromSmiles(smiles)
    assert mol is not None, smiles
    return Chem.MolToSmiles(mol, isomericSmiles=True)
        
def simple_test():
    for smiles, expected in (
            ("CC", Canon("*C.*C")),
            ("F[C@](Cl)(Br)O", Canon("*F.*[C@](Cl)(Br)O")),
            ("F[C@@](Cl)(Br)O", Canon("*F.*[C@@](Cl)(Br)O")),
            ("F[C@@H](Br)O", Canon("*F.*[C@@H](Br)O")),
            ):
        mol = Chem.MolFromSmiles(smiles)
        fragmented_mol = fragment_chiral(mol, 0, 1)
        fragmented_smiles = Chem.MolToSmiles(fragmented_mol, isomericSmiles=True)
        if fragmented_smiles != expected:
            print("smiles:", smiles)
            print("fragmented:", fragmented_smiles)
            print("  expected:", expected)

# Match a single bond not in a ring
BOND_SMARTS = "[!#0;!#1]-!@[!#0;!#1]"
single_bond_pat = Chem.MolFromSmarts(BOND_SMARTS)


_bridgehead1_pat = Chem.MolFromSmarts("*~1~*~*(~*~*~*~2)~*~*~2~*~1")
_bridgehead2_pat = Chem.MolFromSmarts("*~1~*~*(~*~*~2)~*~*~2~*~1")

def is_bridgehead(mol):
    """Test if the molecule contains one of the bridgehead patterns"""
    return (mol.HasSubstructMatch(_bridgehead1_pat) or
            mol.HasSubstructMatch(_bridgehead2_pat))

def file_test():
    # Point this to a SMILES file to test
    filename = "/Users/dalke/databases/chembl_20_rdkit.smi"
    
    with open(filename) as infile:
        num_records = num_successes = num_failures = 0
        for lineno, line in enumerate(infile):
            # Give some progress feedback
            if lineno % 100 == 0:
                print("Processed", lineno, "lines and", num_records,
                      "records. Successes:", num_successes,
                      "Failures:", num_failures)
            
            # Only test structures with a chiral atom
            input_smiles = line.split()[0]
            if "@" not in input_smiles:
                continue
            
            # The code doesn't handle directional bonds. Convert them
            # to single bonds
            if "/" in input_smiles:
                input_smiles = input_smiles.replace("/", "-")
            if "\\" in input_smiles:
                input_smiles = input_smiles.replace("\\", "-")
            
            mol = Chem.MolFromSmiles(input_smiles)
            if mol is None:
                continue
            
            ### Uncomment as appropriate
            if is_bridgehead(mol):
                pass
                #continue
            else:
                pass
                continue
            num_records += 1
            
            # I expect the reassembled structure to match this canonical SMILES
            expected_smiles = Chem.MolToSmiles(mol, isomericSmiles=True)
            
            # Cut each of the non-ring single bonds between two heavy atoms
            matches = mol.GetSubstructMatches(single_bond_pat)
            has_failure = False
            for begin_atom, end_atom in matches:
                # Fragment
                fragmented_mol = fragment_chiral(mol, begin_atom, end_atom)
                fragmented_smiles = Chem.MolToSmiles(fragmented_mol, isomericSmiles=True)
                assert "." in fragmented_smiles, fragmented_smiles # safety check
                
                # Convert the "*"s to the correct "%90" closures
                closure_smiles = convert_wildcards_to_closures(fragmented_smiles, (0, 0))
                assert "%90" in closure_smiles, closure_smiles # safety check
                closure_mol = Chem.MolFromSmiles(closure_smiles)
                
                # canonicalize and compare; report any mismatches
                final_smiles = Chem.MolToSmiles(closure_mol, isomericSmiles=True)
                if final_smiles != expected_smiles:
                    print("FAILURE in record", num_records+1)
                    print("     input_smiles:", input_smiles)
                    print("  begin/end atoms:", begin_atom, end_atom)
                    print("fragmented smiles:", fragmented_smiles)
                    print("   closure smiles:", closure_smiles)
                    print("     final smiles:", final_smiles)
                    print("  expected smiles:", expected_smiles)
                    has_failure = True
            if has_failure:
                num_failures += 1
            else:
                num_successes += 1
                #print("SUCCESS", input_smiles)
        
        print("Done. Records:", num_records, "Successes:", num_successes,
              "Failures:", num_failures)
            
if __name__ == "__main__":
    simple_test()
    file_test()

Thanks!

Thanks to Greg Landrum both for RDKit and for help in tracking down some of the stubborn cases. Thanks also to the University of Hamburg for SMARTSViewer, which I use as a SMILES structure viewer so I don't have to worry about bond type, aromaticity, or chiral re-intepretations.

Edits

This essay has had a few post-publication edits. I added more emphasis that this essay exists for pegadogical reasons, described more that SanitizeMol() is meant as solution to get around a problem currently under investigation, moved the ChEMBL SDF analysis code to its own post.



Fragment achiral molecules in RDKit using low-level functions #

This essay shows a simple way to fragment a non-chiral molecule (okay, a "molecular graph") using low-level RDKit calls, and then a complex way to re-connect the SMILES fragment by manipulating the SMILES string at the syntax level then recanonicalizing. (The correct way to fragment a bond in RDKit is to use FragmentOnBonds(), and not the code I cover here, which is meant as the basis for understanding how something like FragmentOnBonds() works.)

It is part of a series of essays on how to fragment a molecular graph.

I'll start by fragmenting the methoxy group of vanillin. (I wrote an essay on the vanillin functional groups.) I want to turn the SMILES of "c1(C=O)cc(OC)c(O)cc1" into two parts, '[*]OC' (the methoxy group) and 'O=Cc1cc[*]c(O)cc1' (the rest of the structure). The "[*]" terms are a way to represent an R-group in SMILES.

The method I'll use is to identify the bond and its two end atoms, delete the bond, and for each of the end atoms add a new "*" atom, which is an atom with atomic number 0.

I'll identify the bond manually, by counting the atom indices and identifying the substring which contains the methoxy:

                              1   ] atom indices, from 0 to 10.
            0  1 2 34 56 7 8 90   ]   (The top line is the tens.)
 vanillin = c1(C=O)cc(OC)c(O)cc1
                      ^^-- methoxy group
		    ^-^ break this bond between atom[4] and atom[5]
I want to break the bond between atoms at 4 and 5. I'll use RDKit for this. The default RDKit molecule does not support topological modifications, so I'll have to turn the molecule into a "RWMol" (Read/Write Molecule) to edit it, then convert the result back to a normal molecule:
>>> mol = Chem.MolFromSmiles("c1(C=O)cc(OC)c(O)cc1")
>>> rwmol = Chem.RWMol(mol)
>>> rwmol.RemoveBond(4, 5)
>>> new_mol = rwmol.GetMol()
>>> Chem.MolToSmiles(new_mol)
'CO.O=Cc1ccc(O)cc1'
Visual inspection shows that this broke the correct bond, but it's missing the new "[*]" atoms. I need to go back to the RWMol and add two atoms with atomic number 0 (which Daylight SMILES terms a 'wildcard' atom):
>>> rwmol.AddAtom(Chem.Atom(0))
11
>>> rwmol.AddAtom(Chem.Atom(0))
12
>>> new_mol = rwmol.GetMol()
>>> Chem.MolToSmiles(new_mol)
'CO.O=Cc1ccc(O)cc1.[*].[*]'
That's getting closer! Now I need to connect each new atom to the correct atom which was as the end of the deleted bond. Remember, the old atom ids were 5 and 6, and you can see from the above code that th new atom ids are 11 and 12. I'll connect atom 4 to 11 and atom 5 to 12, both using a single bond:
>>> rwmol.AddBond(4, 11, Chem.BondType.SINGLE)
11
>>> rwmol.AddBond(5, 12, Chem.BondType.SINGLE)
12
>>> new_mol = rwmol.GetMol()
>>> Chem.MolToSmiles(new_mol)
'[*]OC.[*]c1cc(C=O)ccc1O'
That's the output I'm looking for. Now to wrap it up in a function:
from rdkit import Chem

def fragment_simple(mol, atom1, atom2):
    rwmol = Chem.RWMol(mol)
    rwmol.RemoveBond(atom1, atom2)
    wildcard1 = rwmol.AddAtom(Chem.Atom(0))
    wildcard2 = rwmol.AddAtom(Chem.Atom(0))
    rwmol.AddBond(atom1, wildcard1, Chem.BondType.SINGLE) 
    rwmol.AddBond(atom2, wildcard2, Chem.BondType.SINGLE) 
    return rwmol.GetMol()
and test it out on the vanillin structure:
>>> vanillin = Chem.MolFromSmiles("c1(C=O)cc(OC)c(O)cc1")
>>> fragmented_vanillin = fragment_simple(vanillin, 4, 5)
>>> Chem.MolToSmiles(fragmented_vanillin)
'[*]OC.[*]c1cc(C=O)ccc1O'

Reassemble two fragments

The above code looks right, but the real test is to reassemble the fragments and see if it gives the same result. Rather than work with the molecule at the molecular graph level, I'll manipulate the SMILES string directly, using a ring closure method I described in 2004.

The term "ring closure" sounds like it's only used for rings, but it's actually simply an alternate way to specify a bond. As Daylight's SMILES theory documentation points out, "C1.C1 specifies the same molecule as CC(ethane)".

I'll replace the "[*]" symbols with the ring closure "%90", which is high enough that won't occur in the structures I deal with. I can't do a simple string substitution since that would turn '[*]OC' into '%90OC'. If the '[*]' is the first atom, or is after a '.', then I need to place the %90 after the second atom symbol, giving "O%90C.c%901cc(C=O)ccc1O". I'll canonicalize that string, and compare it to the canonicalized vanillin structure:

>>> test_smiles = "O%90C.c%901cc(C=O)ccc1O"
>>> Chem.CanonSmiles(test_smiles)
'COc1cc(C=O)ccc1O'
>>> Chem.MolToSmiles(vanillin)
'COc1cc(C=O)ccc1O'
As hoped for, they are the same.

Convert wildcards to closures

I apologize. I'm about to show you a big block of code with very little commentary besides the comments. Feel free to skip to the next section.

In the previous section, I manually tranformed a SMILES string containing wildcards to one with ring closures so I could test if the fragmentation code was correct. I can automated that step, but it's a bit complicated and I won't go into the details. The key observation is that the wildcard atom is only connected by a single bond to the rest of the fragment. If the wildcard is the first atom term in the SMILES then the ring closure goes on the second atom term. Otherwise, the ring closure goes on the atom immediately before the wildcard atom. (The last is RDKit-specific becuse RDKit's canonical SMILES ends up placing branches with wildcard atoms before any other branches.)

The following code automates that process. I call the file "smiles_syntax.py" and it defines the new function "convert_wildcards_to_closures(smiles, offsets=None)". Here's an example:

>>> import smiles_syntax
>>> fragmented_smiles = "[*]OC.[*]c1cc(C=O)ccc1O"
>>> smiles_syntax.convert_wildcards_to_closures(fragmented_smiles)
'O%90C.c%911cc(C=O)ccc1O'
>>> smiles_syntax.convert_wildcards_to_closures(fragmented_smiles, (0, 0))
'O%90C.c%901cc(C=O)ccc1O'
>>> closure_smiles = smiles_syntax.convert_wildcards_to_closures(fragmented_smiles, (0, 0))
>>> from rdkit import Chem
>>> Chem.CanonSmiles(closure_smiles)
'COc1cc(C=O)ccc1O'
By default each wildcard atom gets a new closure number, starting from 90, which is why you see the "%90" and "%91". The offsets parameter specifies which offsets from 90 to use. With "(0, 0)", both wildcards get the '%90' ring closure.

This will be used later on work with multiple cuts of the same structure.

Here's the content of "smiles_syntax.py".

# I call this "smiles_syntax.py"
from __future__ import print_function

import re

# Match a '*' in the different forms that might occur,
# including with directional single bonds inside of ()s.
_wildcard_regex = " |\n".join(re.escape(regex) for regex in
  ("*", "[*]", "(*)", "([*])", "(/*)", "(/[*])", "(\\*)", "(\\[*])"))
_wildcard_pattern = re.compile(_wildcard_regex, re.X)

# Match the SMILES for an atom, followed by its closures
_atom_pattern = re.compile(r"""
(
 Cl? |             # Cl and Br are part of the organic subset
 Br? |
 [NOSPFIbcnosp] |  # as are these single-letter elements
 \[[^]]*\]         # everything else must be in []s
)
""", re.X)

def convert_wildcards_to_closures(smiles, offsets=None):
    # This is designed for RDKit's canonical SMILES output. It does
    # not handle all possible SMILES inputs.
    if offsets is None:
        # Use 0, 1, 2, ... up to the number of '*'s
        offsets = range(smiles.count("*"))
    closure_terms = []
    for offset in offsets:
        if not (0 <= offset <= 9):
            raise ValueError("offset %d out of range (must be from 0 to 9)"
                             % (offset,))
        closure_terms.append("%%%02d" % (90 + offset))
    
    new_smiles = smiles
    while 1:
        # Find the first '*'. If none are left, stop.
        wildcard_match = _wildcard_pattern.search(new_smiles)
        if wildcard_match is None:
            break

        closure_term = closure_terms.pop(0)
        
        wildcard_start = wildcard_match.start()
        if wildcard_start == 0 or new_smiles[wildcard_start-1] == ".":
            # At the start of the molecule or after a ".". Need to
            # put the closure after the second atom. Find the second
            # atom. Since we only ever break on single non-ring bonds,
            # and since the first atom is a terminal atom, the second
            # atom must either be immediately after the first atom, or
            # there is a directional bond between them.
            wildcard_end = wildcard_match.end()
            second_atom_match = _atom_pattern.match(new_smiles, wildcard_end)
            if second_atom_match is None:
                # There was no atom. Is it a "/" or "\"? If so,
                # we'll need to swap the direction when we move
                # to a closure after the second atom.
                bond_dir = new_smiles[wildcard_end:wildcard_end+1]
                if bond_dir == "/":
                    direction = "\\"
                elif bond_dir == "\\":
                    direction = "/"
                else:
                    raise AssertionError(new_smiles)
                # Look for the second atom, which must exist
                second_atom_match = _atom_pattern.match(new_smiles, wildcard_end+1)
                if second_atom_match is None:
                    raise AssertionError((new_smiles, new_smiles[match_end:]))
            else:
                direction = ""

            second_atom_term = second_atom_match.group(1)
            # I changed the bond configuration, so I may need to
            # invert chirality of implicit chiral hydrogens.
            if "@@H" in second_atom_term:
                second_atom_term = second_atom_term.replace("@@H", "@H")
            elif "@H" in second_atom_term:
                second_atom_term = second_atom_term.replace("@H", "@@H")

            # Reassemble the string with the wildcard term deleted and
            # the new closure inserted directly after the second atom
            # (and before any of its closures).
            new_smiles = (new_smiles[:wildcard_start] + second_atom_term
                          + direction + closure_term
                          + new_smiles[second_atom_match.end():])
                    
        else:
            # The match is somewhere inside of a molecule, so we attach
            # assign the closure to the atom it's bonded to on the left
            c = new_smiles[wildcard_start-1]
            if c == "(" or c == ")":
                # In principle, this could be something like "CCC(F)(Cl)[*]", 
                # where I would need to count the number of groups back to
                # the main atom, and flip chirality accordingly. Thankfully,
                # RDKit always puts the "[*]" terms immediately after the
                # preceeding atom, so I don't need to worry.
                raise NotImplementedError("intermediate groups not supported",
                                          new_smiles, new_smiles[wildcard_start-1:])
            
            elif c in "CNcnOS]Pos0123456789ABDEFGHIJKLMQRTUVWXYZabdefghijklmpqrtuvwxyz":
                # Double-check the the previous character looks like part of an atom.
                wildcard_term = wildcard_match.group()
                # Preserve the direction, if present
                if "/" in wildcard_term:
                    direction = "/"
                elif "\\" in wildcard_term:
                    direction = "\\"
                else:
                    direction = ""
                new_smiles = (new_smiles[:wildcard_start] + direction + closure_term
                              + new_smiles[wildcard_match.end():])

            else:
                raise AssertionError((new_smiles, c, new_smiles[wildcard_start-1:]))

    return new_smiles

if __name__ == "__main__":
    for smiles in ("*C", "*/CO.*CN", "C*.C(*)N"):
        print(smiles, convert_wildcards_to_closures(smiles, (0,)*smiles.count("*")))

Test driver for fragment_simple()

While the above code looks right, it got that way only as the result of a lot of testing. There are a lot of corner cases, like deuterium and directional bonds, which caused the first versions of this function to fail.

I've found it best to pass a lot of real-world SMILES through code like this, to see if there are any problem. I usually draw from PubChem SMILES or ChEMBL. I'll show you what that looks like.

NOTE: the file I test against, "pubchem.smi", excludes structures with chiral terms. This is deliberate, for reasons I'll get to soon.

The "verify()" function, below, takes a molecule and a list of a pairs. For each pair of atom describing a bond, it fragments the molecule on that bond then reassembles it and tests that the result matches the input.

from __future__ import print_function  
 
from rdkit import Chem
from smiles_syntax import convert_wildcards_to_closures

def fragment_simple(mol, atom1, atom2):
    rwmol = Chem.RWMol(mol)
    rwmol.RemoveBond(atom1, atom2)
    wildcard1 = rwmol.AddAtom(Chem.Atom(0))
    wildcard2 = rwmol.AddAtom(Chem.Atom(0))
    rwmol.AddBond(atom1, wildcard1, Chem.BondType.SINGLE) 
    rwmol.AddBond(atom2, wildcard2, Chem.BondType.SINGLE) 
    return rwmol.GetMol()

def verify(mol, bond_atoms):
    # Even though I'm not checking for stereochemistry, some of
    # my input files contain isotopes. If I don't use isomeric
    # SMILES then the input '[2H]' gets represented as '[H]',
    # which during recanonicalization is placed on the main atom.
    # (RDKit 2016 added an an option to disable that.)
    expected_smiles = Chem.MolToSmiles(mol, isomericSmiles=True)
    
    for begin_atom, end_atom in bond_atoms:
        fragmented_mol = fragment_simple(mol, begin_atom, end_atom)
        fragmented_smiles = Chem.MolToSmiles(fragmented_mol, isomericSmiles=True)
        assert "." in fragmented_smiles, fragmented_smiles # safety check
        
        closure_smiles = convert_wildcards_to_closures(fragmented_smiles, (0, 0))
        assert "%90" in closure_smiles, closure_smiles # safety check
        closure_mol = Chem.MolFromSmiles(closure_smiles)
	
        final_smiles = Chem.MolToSmiles(closure_mol, isomericSmiles=True)
        if final_smiles != expected_smiles:
            raise ValueError( (expected_smiles, begin_atom, end_atom, fragmented_smiles,
                               closure_smiles, final_smiles) )
I can run it manually, and also check what it does when I break one of the ring bonds:
>>> vanillin = Chem.MolFromSmiles("c1(C=O)cc(OC)c(O)cc1")
>>> verify(vanillin, [(4, 5)])
>>> verify(vanillin, [(3, 4)])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 12, in verify
AssertionError: [*]COc1([*])cc(C=O)ccc1O

I can have it break all single bond which aren't in a ring and which aren't connected to a wildcard atom (#0) or a hydrogen (#1):

>>> single_bond_pat = Chem.MolFromSmarts("[!#0;!#1]-!@[!#0;!#1]")
>>> matches = vanillin.GetSubstructMatches(single_bond_pat)
>>> matches
((0, 1), (4, 5), (5, 6), (7, 8))
>>> verify(vanillin, matches)

Once I'm happy that the verification function is correct, I can put it in a test driver:

def main():
    # Match a single bond not in a ring
    single_bond_pat = Chem.MolFromSmarts("[!#0;!#1]-!@[!#0;!#1]")

    # NOTE! This file contains no chiral structures (nothing with a '@')
    with open("pubchem.smi") as infile:
        ## You might try processing every 100th record
        # import itertools
        # infile = itertools.islice(infile, None, None, 100)

        ## Or process the lines in random order
        # import random
        # lines = infile.readlines()
        # random.shuffle(lines)
        # infile = lines
        
        recno = -1
        num_tests = 0
        for recno, line in enumerate(infile):
            if recno % 100 == 0:
                print("Processing record:", recno, "#tests:", num_tests)
                
            smiles, id = line.split()
            if "*" in smiles:
                print("Skipping record with a '*':", repr(line))
                continue
            if "@" in smiles:
                print("Skipping record with a chiral '@':", repr(line))
                continue
            
            mol = Chem.MolFromSmiles(smiles)
            if mol is None:
                continue

            matches = mol.GetSubstructMatches(single_bond_pat)
            verify(mol, matches)
            num_tests += len(matches)
            
        recno += 1
        print("Processed", recno, "records and", num_tests, "tests")
            
if __name__ == "__main__":
    main()
The output from this looks like:
% python fragment2.py
Processing record: 0 #tests: 0
Processing record: 100 #tests: 1102
Processing record: 200 #tests: 2078
Processing record: 300 #tests: 3111
 ...
Processing record: 6800 #tests: 72322
[00:24:27] Explicit valence for atom # 2 Cl, 3, is greater than permitted
Processing record: 6900 #tests: 73275
  ...

fragment_simple() does not preserve chirality

That seems like a lot of test code just to show that the function, "fragment_simple()", is correct. It's only seven lines long, without any branches. It's the sort of funciton that should be easy to test by eye or with a few test cases.

Which is what I did the first ten or so times I wrote that function. But then I found out that the function doesn't preserve chirality. Here's an example of the verify() function failing on what should be a simple test case:

>>> mol = Chem.MolFromSmiles("C[C@](N)(C=O)c1ccccc1")
>>> verify(mol, [(1, 3)])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 21, in verify
ValueError: ('C[C@](N)(C=O)c1ccccc1', 1, 3, '[*]C=O.[*][C@@](C)(N)c1ccccc1',
'C%90=O.[C@@]%90(C)(N)c1ccccc1', 'C[C@@](N)(C=O)c1ccccc1')

You can see the problem in the fragmentation step, where the chirality somehow inverted:

>>> mol = Chem.MolFromSmiles("C[C@](N)(C=O)c1ccccc1")
>>> new_mol = fragment_simple(mol, 1, 3)
>>> Chem.MolToSmiles(new_mol, isomericSmiles=True)
'[*]C=O.[*][C@@](C)(N)c1ccccc1'
The second fragment should have "[C@]" instead of "[C@@]". Why did it flip?

The problem comes back to the "RemoveBond()/AddBond()" in fragment_simple(). RDKit stores the chirality as an atom property, but that property is actually based on the permutation order of the bonds. If the permutation order changes, then the chirality changes even if the chirality property hasn't changed.

To demonstrate, I need a helper function to show the connection table:

_bond_symbol = {
    Chem.BondType.SINGLE: "-",
    Chem.BondType.DOUBLE: "=",
    Chem.BondType.TRIPLE: "#",
    Chem.BondType.AROMATIC: ":",
    }
def get_bond_info(from_atom_idx, bond):
    to_atom_idx = bond.GetBeginAtomIdx()
    if from_atom_idx == to_atom_idx:
        to_atom_idx = bond.GetEndAtomIdx()
    return _bond_symbol[bond.GetBondType()] + str(to_atom_idx)

# Example output line:
#   5 C -1 :6 :10 
# This says that atom[5] is a "C"arbon with 3 bond.
# The first bond is a single bond to atom[1].
# The second bond is an aromatic bond to atom[6]. 
# The third bond is an aromatic bond to atom[10].

def show_connection_table(mol):
    for atom in mol.GetAtoms():
        bonds = " ".join(get_bond_info(atom.GetIdx(), bond) for bond in atom.GetBonds())
        print(atom.GetIdx(), atom.GetSymbol(), bonds)
then use it to show the connection table for the two molecules.
>>> show_connection_table(mol)
0 C -1
1 C -0 -2 -3 -5
2 N -1
3 C -1 =4
4 O =3
5 C -1 :6 :10  
6 C :5 :7
7 C :6 :8
8 C :7 :9
9 C :8 :10
10 C :9 :5
>>> show_connection_table(new_mol)
0 C -1
1 C -0 -2 -5 -11
2 N -1
3 C =4 -12
4 O =3
5 C -1 :6 :10
6 C :5 :7
7 C :6 :8
8 C :7 :9
9 C :8 :10
10 C :9 :5
11 * -1
12 * -3
I deleted the bond from atom[1] to atom[3]. It's atom[1] which has the chiral property so look at the second line of each of the outputs.

The original bond configuration order was (0, 2, 3, 5), while now it's (0, 2, 5, 11), where the third bond which went to the old atom[5] was replaced with the new fourth bond to the new wildcard atom[11]. The permutation order changed. The third and fourth bonds must be swapped to get back to the original permutation order. Each flip in the permutation order corresponds to a chirality inversion, which is why the output chirality is inverted.

Stay tuned to this channel for the continuing story of how to fragment a molecule and preserve chirality after reassembly. (Remember, RDKit has a built-in function for this, FragmentOnBonds(). Use that if you want to fragment.)



Molecular fragments, R-groups, and functional groups #

For a change of pace, I figured I would do a basic chemistry lesson about molecular structures, instead of a more computer oriented blog post.

Chemists often think about a molecule as a core structure (usually a ring system) and a set of R-groups. Each R-group is attached to an atom in the core structure by a bond. Typically that bond is a single bond, and often "rotatable".

Here's an example of what I mean. The first image below shows the structure of vanillin, which is the primary taste behind vanilla. In the second image, I've circled ellipsed the three R-groups in the structure.
vanillin structure vanillin structure with three R-groups identified
Vanillin structure  
(the primary taste of vanilla)  
Vanillin with three R-groups identified

The R-groups in this case are R1=a carbonyl group (*-CH=O2), R2=a methoxy group (*-O-CH3), and R3=a hydroxyl group (*-OH), where the "*" inidicates where the R-group attaches to the core structure.

The R-group concept is flexible. Really it just means that you have a fixed group of connected atoms, which are connected along some bond to a variable group of atoms, and where the variable group is denoted R. Instead of looking at the core structure and a set of R-groups, I can invert the thinking and think of an R-group, like the carbonyl group, as "the core structure", and the rest of the vanillin as its R-group.

With that in mind, I'll replace the "*" with the "R" to get the groups "R-CH=O2", "R-O-CH3", and "R-OH". (The "*" means that the fragment is connected to an atom at this point, but it's really just an alternative naming scheme for "R".)

All three of these group are also functional groups. Quoting Wikipedia, "functional groups are specific groups (moieties) of atoms or bonds within molecules that are responsible for the characteristic chemical reactions of those molecules. The same functional group will undergo the same or similar chemical reaction(s) regardless of the size of the molecule it is a part of."

These three corresponding functional groups are R1 = aldehyde, R2 = ether. and R3 = hydroxyl.

As the Wikipedia quote pointed out, if you have reaction which acts on an aldehyde, you can likely use it on the aldehyde group of vanillin.

Vanillyl group and capsaicin

A functional group can also contain functional groups. I pointed to the three functional groups attached to the central ring of a vanillin, but most of the vanillin structure is itself another functional group, a vanillyn:
vanillyn functional group

Structures which contain a vanillyl group are called vanilloids. Vanilla is of course a vanilloid, but surprisingly so is capsaicin, the source of the "heat" to many a spicy food. Here's the capsaicin structure, with the vanillyl group circled:
capsaicin with vanillyl group circled

The feeling of heat comes because the capsaicin binds to TrpV1 (the transient receptor potential cation channel subfamily V member 1), also known as the "capsaicin receptor". It's a nonselective recepter, which means that many things can cause it to activate. Quoting that Wikipedia page: "The best-known activators of TRPV1 are: temperature greater than 43 °C (109 °F); acidic conditions; capsaicin, the irritating compound in hot chili peppers; and allyl isothiocyanate, the pungent compound in mustard and wasabi." The same receptor detects temperature, capsaicin, and a compound in hot mustard and wasabi, which is why your body interprets them all as "hot."

Capsaicin is a member of the capsaicinoid family. All capsaicinoids are vanillyls, all vanillyls are aldehydes. This sort of is-a family membership relationship in chemistry has lead to many taxonomies and ontologies, including ChEBI.

But don't let my example or the existence of nomenclature lead you to the wrong conclusion that all R-groups are functional groups! An R-group, at least with the people I usually work with, is a more generic term used to describe a way of thinking about molecular structures.

QSAR modeling

QSAR (pronounced "QUE-SAR") is short for "quantitative structure-activity relationship", which is a mouthful. (I once travelled to the UK for a UK-QSAR meeting. The border inspecter asked me where I was going, and I said "the UK-QSAR meeting; QSAR is .." and I blanked on the expansion of that term! I was allowed across the border, so it couldn't have been that big of a mistake.)

QSAR deals with the development of models which relate chemical structure to its activity in a biological or chemical system. Looking at that, I realize I just moved the words around a bit, so I'll give a simple example.

Consider an activity, which I'll call "molecular weight". (This is more of a physical property than a chemical one, but I am trying to make it simple.) My model for molecular weight assumes that each atom has its own weight, and the total molecular weight is the sum of the individual atom weights. I can create a training set of molecules, and for each molecule determine its structure and molecular weight. With a bit of least-squares fitting, I can determine the individual atom weight contribution. Once I have that model, I can use it to predict the molecular weight of any molecule which contains atoms which the model knows about.

Obviously this model will be pretty accurate. It won't be perfect, because isotopic ratios can vary. (A chemical synthesized from fossil oil is slightly lighter and less radioactive than the same chemical derived from from environmental sources, because the heavier radioactive 14C in fossil oil has decayed.) But for most uses it will be good enough.

A more chemically oriented property is the partition coefficient, measured in log units as "log P", which is a measure of the solubility in water compared to a type of oil. This gives a rough idea of if the molecule will tend to end up in hydrophobic regions like a cell membrane, or in aqueous regions like blood. One way to predict log P is with the atom-based approach I sketched for the molecular weight, where each atom type has a contribution to the overall measured log P. (This is sometimes called AlogP.)

In practice, atom-based solutions are not as accurate as fragment-based solutions. The molecular weight can be atom-centered because nearly all of the mass is in the atom's nucleous, which is well localized to the atom. But chemistry isn't really about atoms but about the electron density around atoms, and electrons are much less localized than nucleons. The density around an atom depends on the neighboring atoms and the configuration of the atoms in space.

As a way to improve on that, some methods look at the extended local environment (this is sometimes called XlogP) or at larger fragment contributions (for example, BioByte's ClogP). The more complex it is, the more compounds you need for the training and the slower the model. But hopefully the result is more accurate, so long as you don't overfit the model.

If you're really interested in the topic, Paul Beswick of the Sussex Drug Discovery Centre wrote a nice summary on the different nuances in log P prediction.

Matched molecular pairs

Every major method from data mining, and most of the minor methods, have been applied to QSAR models. The history is also quite long. There are cheminformatics papers back from the 1970s looking at supervised and unsupervised learning, building on even earlier work on clustering applied to biological systems.

A problem with most of these is the black-box nature. The data is noisy, and the quantum nature of chemistry isn't that good of a match to data mining tools, so these prediction are used more often to guide a pharmaceutical chemist than to make solid predictions. This means the conclusions should be interpretable by the chemist. Try getting your neural net to give a chemically reasonable explanation of why it predicted as it did!

Matched molecular pair (MMP) analysis is a more chemist-oriented QSAR method, with relatively little mathematics beyond simple statistics. Chemists have long looked at activities in simple series, like replacing a ethyl (*-CH3) with a methyl (*-CH2-CH3) or propyl (*-CH2-CH2-CH3), or replacing a fluorine with a heavier halogen like a chlorine or bromine. These can form consistent trends across a wide range of structures, and chemists have used these observations to develop techniques for how to, say, improve the solubility of a drug candidate.

MMP systematizes this analysis over all considered fragments, including not just R-groups (which are connected to the rest of the structure by one bond) but also so-called "core" structures with two or three R-groups attached to it. For example, if the known structures can be described as "A-B-C", "A-D-C", "E-B-F" and "E-D-F" with activities of 1.2, 1.5, 2.3, and 2.6 respectively then we can do the following analysis:

  A-B-C transforms to A-D-C with an activity shift of 0.3.
  E-B-F transforms to E-D-F with an activity shift of 0.3.
  Both transforms can be described as R1-B-R2 to R1-D-R2.
  Perhaps R1-B-R2 to R1-D-R2 in general causes a shift of 0.3?

Its not quite as easy as this, because the molecular fragments aren't so easily identified. A molecule might be described as "A-B-C", as well as "E-Q-F" and "E-H" and "C-T(-P)-A", where "T" has three R-groups connected to it.

Thanks

Thank to the EPAM Life Sciences for their Ketcher tool, which I used for the structure depictions that weren't public domain on Wikipedia.



Reading ASCII file in Python3.5 is 2-3x faster as bytes than string #

I'm porting chemfp from Python2 to Python3. I read a lot of ASCII files. I'm trying to figure out if it's better to read them as binary bytes or as text strings.

No matter how I tweak Python3's open() parameters, I can't get the string read performance to within a factor of 2 of the bytes read performance. As I haven't seen much discussion of this, I figured I would document it here.

chemfp reads chemistry file formats which are specified as ASCII. They contain user-specified fields which are 8-bit clean, so sometimes people use them to encode non-ASCII data. For example, the SD tag field "price" might include the price in £GBP or €EUR, and include the currency symbol either as Latin-1 or UTF-8. (I haven't come across other encodings, but I've also never worked with SD files used internally in, say, a Japanese pharamceutical company.)

These are text files, so it makes sense to read it as text, right? The main problem is, reading in "r" mode is a lot slower than reading "rb" mode. Here's my benchmark, which uses Python 3.5.2 on a Mac OS X 10.10.5 machine to read the first 10MiB from a 3.1GiB file:

% python -V
Python 3.5.2
% python -m timeit 'open("chembl_21.sdf", "r").read(10*1024*1024)'
100 loops, best of 3: 10.3 msec per loop
% python -m timeit 'open("chembl_21.sdf", "rb").read(10*1024*1024)'
100 loops, best of 3: 3.74 msec per loop
The Unicode string read() is much slower than the byte string read(), with a performance ratio of 2.75. (I'll give all numbers in ratios.)

Python2 had a similar problem. I originally used "U"niversal mode in chemfp to read the text files in FPS format, but found that if I switched from "rU" to "rB", and wrote my code to support both '\n' and '\r\n' conventions, I could double my overall system read performance - the "U" option gives a 10x slowdown!

% python2.7 -m timeit 'open("chembl_21.sdf", "rb").read(10*1024*1024)'
100 loops, best of 3: 3.7 msec per loop
% python2.7 -m timeit 'open("chembl_21.sdf", "rU").read(10*1024*1024)'
10 loops, best of 3: 36.7 msec per loop

This observation is not new. A quick Duck Duck Go search found a 2015 blog post by Nelson Minar which concluded:

The Python3 open() function takes more parameters than Python2, including 'newline', which affects how the text mode reader identifies newlines, and 'encoding':

open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True)
  ...
    newline controls how universal newlines works (it only applies to text
    mode). It can be None, '', '\n', '\r', and '\r\n'.  It works as
    follows:
    
    * On input, if newline is None, universal newlines mode is
      enabled. Lines in the input can end in '\n', '\r', or '\r\n', and
      these are translated into '\n' before being returned to the
      caller. If it is '', universal newline mode is enabled, but line
      endings are returned to the caller untranslated. If it has any of
      the other legal values, input lines are only terminated by the given
      string, and the line ending is returned to the caller untranslated.

I'll 'deuniversalize' the text reader and benchmark newline="\n" and newline="":

% python -m timeit 'open("chembl_21.sdf", "rb").read(10*1024*1024)'
100 loops, best of 3: 3.81 msec per loop
% python -m timeit 'open("chembl_21.sdf", "r", newline="\n").read(10*1024*1024)'
100 loops, best of 3: 8.8 msec per loop
python -m timeit 'open("chembl_21.sdf", "r", newline="").read(10*1024*1024)'
100 loops, best of 3: 10.2 msec per loop
% python -m timeit 'open("chembl_21.sdf", "r", newline=None).read(10*1024*1024)'
100 loops, best of 3: 10.2 msec per loop
The ratio of 2.3 for newline="\n" slowndown is better than the 2.75 for univeral newlines and the newline="" case that Nelson Minar tested, but still less than half the performance of the byte reader.

I also wondered if the encoding made a difference:

% python -m timeit 'open("chembl_21.sdf", "r", newline="\n", encoding="ascii").read(10*1024*1024)'
100 loops, best of 3: 8.8 msec per loop
% python -m timeit 'open("chembl_21.sdf", "r", newline="\n", encoding="utf8").read(10*1024*1024)'
100 loops, best of 3: 8.8 msec per loop
% python -m timeit 'open("chembl_21.sdf", "r", newline="\n", encoding="latin-1").read(10*1024*1024)'
100 loops, best of 3: 10.1 msec per loop
My benchmark shows that ASCII and UTF-8 encodings are equally fast, and Latin-1 is 14% slower, even though my data set contains only ASCII. I did not expect any difference. I assume a lot of time has been spent making the UTF-8 code go fast, but don't know why the Latin-1 reader is noticably slower on ASCII data.

Nelson Minar also tested the codecs.open() performance, so I'll repeat it:
% python -m timeit -s 'import codecs' 'codecs.open("chembl_21.sdf", "r").read(10*1024*1024)'
100 loops, best of 3: 10.2 msec per loop
I noticed no performance difference between codec.open() and builtin open() for this test case.

I've left with a bit of a quandary. I work with ASCII text data, with only the occasional non-ASCII field. For example, chemfp has specialized code to read an id tag and encoded fingerprint field from an SD file. In rare and non-standard cases, the handful of characters in the id/title line might be non-ASCII, but the hex-encoded fingerprint is never anything other than ASCII. It makes sense to use the text reader. But If I use the text reader, it will decode everything in each record (typically 2K-8K bytes), when I only need to decode at most 100 bytes of the record.

In chemfp, I used to have a two-pass solution to find records in an SD file. The first pass found the fields of interest, and the second counted newlines for better error reporting. I found that even that level of data re-scanning caused an observable slowdown, so I shouldn't be surprised that an extra pass to check for non-ASCII characters might also be a problem. But, two-fold slowdown?

This performance overhead leads me to conclude that I need to process my performance critical files as bytes, rather than strings, and delay the byte-to-string decoding as much as possible.

RDKit and (non-)Unicode

I checked with the RDKit, which is a cheminformatics toolkit. The core is in C++, with Python extensions through Boost.Python. It treats the files as bytes, and lazily exposes the data to Python as Unicode. For example, if I places a byte string which is not valid UTF-8 in the title or tag field, then it will read and write the data without problems, because the data is stored in C++ data structures based on the byte string. But if I try to get the properties from Python, I get a UnicodeDecodeError.

Here's an example. First, I'll get a valid record, which is all ASCII:

>>> block = open("chembl_21.sdf", "rb").read(10000)
>>> i = block.find(b"$$$$\n")
>>> i
973
>>> record = block[:i+5]
>>> chr(max(record))
'm'
I'll use the RDKit to parse the record, then show that I can read the molecule id, which comes from the first line (the title line) of the file:
>>> from rdkit import Chem
>>> mol = Chem.MolFromMolBlock(record)
>>> mol.GetProp("_Name")
'CHEMBL153534'
If I then create a 'bad_record', by prefixing chr(0xC2) as the first byte to the record, then I can still process the record, but I cannot get the title:
>>> bad_record = b"\xC2" + record
>>> bad_mol = Chem.MolFromMolBlock(bad_record)
>>>bad_mol.GetNumAtoms()
16
>>> bad_mol.GetProp("_Name")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 0: invalid continuation byte
This is because character 0xC2 is not a valid byte in UTF-8, and the RDKit uses a UTF-8 to bytes error handler which fails for invalid bytes.

I can even use C++ to write the string to a file, since the C++ code treats everything as bytes:

>>> outfile = Chem.SDWriter("/dev/tty")
>>> outfile.write(mol); outfile.flush()
CHEMBL153534
     RDKit          2D

 16 17  0  0  0  0  0  0  0  0999 V2000
   ...
>>> outfile.write(bad_mol); outfile.flush()
?CHEMBL153534
     RDKit          2D

 16 17  0  0  0  0  0  0  0  0999 V2000
 ...
(The symbol could not be represented in my UTF-8 terminal, so it uses a "?".)

On the other hand, I get a UnicodeDecodeError if I use a Python file object:

>>> from io import BytesIO
>>> f = BytesIO()
>>> outfile = Chem.SDWriter(f)
>>> outfile.write(bad_mol)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 0: invalid continuation byte
(It doesn't matter if I replace the BytesIO with a StringIO. It hasn't gotten to that point yet. When the SDWriter() is initialized with a Python file handle then told to write a molecule, it writes the molecule to a C++ byte string, converts that to a Python string, and passes that Python string to the file handle. The failure here is in the C++ to Python translation.)

The simple conclusion from this is the same as the punchline from the old joke "Doctor, doctor, it hurts when I do this." "Then don't do that." But it's too simple. SD files come from all sorts of places, including sources which may use '\xC2' as the Latin-1 encoding of Â. You don't want your seemingly well-tested system to crash because of something like this.

I'm not sure what the right solution is for the RDKit, but I can conclude that I need a test case something like this for any of my SD readers, and that correct support for the bytes/string translation is not easy.

Want to leave a comment?

If you have a better suggestion than using bytes, like a faster way to read ASCII text as Unicode strings, or have some insight into why reading an ASCII file as a string is so relatively slow in Python3, let me know.

But don't get me wrong. I do scientific programming, which with rare exceptions is centered around the 7-bit ASCII world of the 1960s. Python3 has made great efforts to make Unicode parsing and decoding fast, which is important for most real-world data outside of my niche.



Copyright © 2001-2013 Andrew Dalke Scientific AB