Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2016/12/02/Weiningers_realization

Weininger's Realization

Dave Weininger passed away recently. He was very well known in the chemical informatics community because of his contribution to the field and his personality. Dave and Yosi Taitz founded Daylight Chemical Information Systems to turn some of these ideas into a business, back in the 1980s. It was very profitable. (As a bit of trivia, the "Day" in "Daylight" comes from "Dave And Yosi".)

Some of the key ideas that Dave and Daylight introduced are SMILES, SMARTS, and fingerprints (both the name and the hash-based approach). Together these made for a new way to handle chemical information search, and do so in significantly less memory. The key realization which I think lead to the business success of the company, is that the cost of memory was decreasing faster than the creation of chemical information. This trend, combined with the memory savings of SMILES and fingerprints, made it possible to store a corporate dataset in RAM, and do chemical searches about 10,000 times faster than the previous generation of hard-disk based tools, and do it before any competition could. I call this "Weininger's Realization". As a result, the Daylight Thor and Merlin databases, along with the chemistry toolkits, became part of the core infrastructure of many pharmaceutical companies.

I don't know if there was a specific "a-ha" moment when that realization occurred. It certainly wasn't what drove Dave to work on those ideas in the first place. He was a revolutionary, a Prometheus who wanted to take chemical information from what he derisively called 'the high priests' and bring it to the people.

An interest of mine in the last few years is to understand more about the history of chemical information. The best way I know to describe the impact of Dave and Daylight is to take some of the concepts back to the roots.

You may also be interested in reading Anthony Nicholls description of some of the ways that Dave influenced him, and Derek Lowe's appreciation of SMILES.

Errors and Omissions

Before I get there, I want to emphasize that the success of Daylight cannot be attributed to just Dave, or Dave and Yosi. Dave's brother Art and his father Joseph were coauthors on the SMILES canonicalization paper. The company hired people to help with the development, both as employees and consultants. I don't know the details of who did what, so I will say "Dave and Daylight" and hopefully reduce the all too easy tendency to give all the credit on the most visible and charismatic person.

I'm unfortunately going to omit many parts of the Daylight technologies, like SMIRKS, where I don't know enough about the topic or its effect on cheminformatics. I'll also omit other important but invisible aspects of Daylight, like documentation or the work Craig James did to make the database servers more robust to system failures. Unfortunately, it's the jockeys and horses which attract the limelight, not those who muck the stables or shoe the horses.

Also, I wrote this essay mostly from what I have in my head and from presentations I've given, which means I've almost certainly made mistakes that could be fixed by going to my notes and primary sources. Over time I hope to spot and fix those mistakes in this essay. Please let me know of anything you want me to change or improve.

Dyson and Wiswesser notations

SMILES is a "line notation", that is, a molecular representation which can be described as a line of text. Many people reading this may have only a vague idea of the history of line notations. Without that history, it's hard to understand what helped make SMILES successful.

The original line notations were developed in the 1800s. By the late 1800s chemists began to systematize the language into what is now called the IUPAC nomenclature. For example, caffeine is "1,3,7-trimethylpurine-2,6-dione". The basics of this system are taught in high school chemistry class. It takes years of specialized training to learn how to generate the correct name for complex structures.

Chemical nomenclature helps chemists index the world's information about chemical structures. In short, if you can assign a unique name to a chemical structure (a "canonical" name), then it you can use standard library science techniques to find information about the structure.

The IUPAC nomenclature was developed when books and index cards were the best way to organize data. Punched card machines brought a new way of thinking about line notations. In 1946, G. Malcolm Dyson proposed a new line notation meant for punched cards. The Dyson notion was developed as a way to mechanize the process of organizing and publishing a chemical structure index. It became a formal IUPAC notation in 1960, but was already on its last legs and dead within a few years. While it might have been useful for mechanical punched card machines, it wasn't easily repurposed for the computer needs of the 1960s. For one, it depended on superscripts and subscripts, and used characters which didn't exist on the IBM punched cards.

William J. Wiswesser in 1949 proposed the Wiswesser Line Notation, universally called WLN, which could be represented in EBCIDIC and (later) ASCII in a single line of text. More importantly, unlike the Dyson notation, which follows the IUPAC nomenclature tradition of starting with the longest carbon chain, WLN focuses on functional groups, and encodes many functional groups directly as symbols.

Chemists tend to be more interested in functional groups, and want to search based on those groups. For many types of searches, WLN acts as its own screen, that is, it's possible to do some types of substructure search directly on the symbols of the WLN, without having to convert the name into a molecular structure for a full substructure search. To search for structures containing a single sulfur, look for WLNs with a single occurrence of S, but not VS or US or SU. The chemical information scientists of the 1960s and 1970s developed several hundred such clever pattern searches to make effective use of the relatively limited hardware of that era.

WLNs started to disappear in the early 1980s, before SMILES came on the scene. Wendy Warr summarized the advantages and disadvantages of WLNs in 1982. She wrote "The principle disadvantage of WLN is that it is not user friendly. This can only be overcome by programs which will derive a canonical WLN from something else (but no one has yet produced a cost-effective program to do this for over 90% of compounds), by writing programs to generate canonical connection tables from noncanonical WLNs, or by accepting the intervention of a skilled "middle man"."

Dyson/IUPAC and WLNs were just two of dozens, if not hundreds, of proposed line notations. Nearly every proposal suffered from a fatal flaw - they could not easily be automated on a computer. Most required postgraduate-level knowledge of chemistry, and were error-prone. The more rigorous proposals evaluated the number of mistakes made during data entry.

One of the few exceptions are the "parentheses free" notations from a pair of papers from 1964, one by Hiz and the other by Eisman, in the same issue of the Journal of Chemical Documentation. In modern eyes, they very much like SMILES but represented in a postfix notation. Indeed, the Eisman paper gives a very SMILES-like notation for a tree structure, "H(N(C(HC(CIC(N(HH)C(HN(IH)))I)H)H))" and a less SMILES-like notation for a cyclic structure, before describing how to convert them into a postfix form.

I consider the parentheses-free nomenclatures a precursor to SMILES, but they were not influential to the larger field of chemical information. I find this a bit odd, and part of my research has been to try and figure out why. It's not like it had no influence. A version of this notation was in the Chemical Information and Data System (CIDS) project in the 1960s and early 1970s. In 1965, "CIDS was the first system to demonstrate online [that is, not batch processing] searching of chemical structures", and CIDS wasn't completely unknown in the chemical information field.

But most of the field in the 1970s went for WLN for a line notation, or a canonical connection table.


Dave did not know about the parentheses free line notations when he started work on SMILES, but he drew from similar roots in linguistics. Dave was influenced by Chomsky's writings on linguistics. Hiz, mentioned earlier, was at the Department of Linguistics at the University of Pennsylvania, and that's also where Eugene Garfield did his PhD work on the linguistics of chemical nomenclature.

Dave's interest in chemical representations started when he was a kid. His father, Joseph Weininger, was a chemist at G.E., with several patents to his hame. He would draw pictures of chemical compounds for Dave, and Dave would, at a non-chemical level, describe how they were put together. These seeds grew into what became SMILES.

SMILES as we know it started when Dave was working for the EPA in Duluth. They needed to develop a database of environmental compounds, to be entered by untrained staff. (For the full story of this, read Ed Regis's book "The Info Mesa: Science, Business, and New Age Alchemy on the Santa Fe Plateau.") As I recall, SMILES was going to the the internal language, with a GUI for data entry, but it turns out that SMILES was easy enough for untrained data entry people to write it directly.

And it's simple. I've taught the basics of SMILES to non-chemist programmers in a matter of minutes, while WLN, Dyson, and InChI, as example of other line notations, are much more difficult to generate either by hand or by machine. Granted, those three notations have canonicalization rules built into them, which is part of the difficulty. Still, I asked Dave why something like SMILES didn't appear earlier, given that the underlying concepts existed in the literature by then.

He said he believes it's because the generation of people before him didn't grow up with the software development background. I think he's right. When I go to a non-chemist programer, I say "it's a spanning tree with special rules to connect the cycles", and they understand. But that vocabulary was still new in the 1960s, and very specialized.

There's also some conservatism in how people work. Dyson defended the Dyson/IUPAC notation, saying that it was better than WLN because it was based on the longest carbon chain principle that chemists were already familiar with, even though the underlying reasons for that choice were becoming less important because of computer search. People know what works, and it's hard to change that mindset when new techniques become available.

Why the name SMILES? Dave Cosgrove told me it was because when Dave woud tell people he had a new line notations, most would "groan, or worse. When he demonstrated it to them, that would rapidly turn to a broad smile as they realised how simple and powerful it is."

Exchange vs. Canonical SMILES

I not infrequently come across people who say that SMILES is a proprietary format. I disagree. I think the reason for the disagreement is that two different concepts go under the name "SMILES". SMILES is an exchange language for chemistry, and it's an identifier for chemical database search. Only the second is proprietary.

Dave wanted SMILES as a way for chemists from around the world and through time to communicate. SMILES describes a certain molecular valence model view of the chemistry. This does not need to be canonical, because you can always do that yourself once you have the information. I can specify hydrogen cyanide as "C#N", "N#C", or "[H][C]#[N]" and you will be able to know what I am talking about, without needing to consult some large IUPAC standard.

He wanted people to use SMILES that way, without restriction. The first SMILES paper describes the grammar. Later work at Daylight in the 1990s extended SMILES to handle isotopes and stereochemistry. (This was originally called "isomeric SMILES", but it's the SMILES that people think of when they want a SMILES.) Daylight published the updated grammar on their website. It was later included as part of Gasteiger's "Handbook of Chemoinformatics: From Data to Knowledge in 4 Volumes". Dave also helped people at other companies develop their own SMILES parsers.

To say that SMILES as an exchange format is proprietary is opposite to what Dave wanted and what Daylight did.

What is proprietary is the canonicalization algorithm. The second SMILES paper describes the CANGEN algorithm, although it is incomplete and doesn't actually work. Nor does it handle stereochemistry, which was added years later. Even internally at Daylight, it took many years to work out all of the bugs in the implementation.

There's a good financial reason to keep the algorithm proprietary. People were willing to pay a lot of money for a good, fast chemical database, and the canonicalization algorithm was a key part of Daylight's Thor and Merlin database servers. In business speak, this is part of Daylight's "secret sauce".

On the other hand, there's little reason for why that algorithm should be published. Abstractly speaking, it would mean that different tools would generate the same canonical SMILES, so a federated data search would reduce to a text search, rather than require a re-canonicalization. This is one of the goals of the InChI project, but they discovered that Google didn't index the long InChI strings in a chemically useful way. They created the InChI key as a solution. SMILES has the same problem and would need a similar solution.

Noel O'Boyle published a paper pointed out that the InChI canonicalization assignment could be used to assign the atom ordering for a SMILES string. This would give a universal SMILES that anyone could implement. There's been very little uptake of that idea, which gives a feel of how little demand there is. [Edited 20 March 2017: Noel points out that CDK uses Universal SMILES for canonical isomeric SMILES generation, so there is some uptake.]

Sometimes people also include about the governance model to decide if something is proprietary or not, or point to the copyright restrictions on the specification. I don't agree with these interpretations, and would gladly talk about them at a conference meeting if you're interested.

Line notations and connection tables

There are decades of debate on the advantages of line notations over connection tables, or vice versa. In short, connection tables are easy to understand and parse into an internal molecule data structure, while line notations are usually more compact and can be printed on a single line. And in either case, at some point you need to turn the text representation into a data structure and treat it as a chemical compound rather than a string.

Line notations are a sort of intellectual challenge. This alone seems to justify some of the many papers proposing a new line notation. By comparison, Open Babel alone supports over 160 connection table formats, and there are untold more in-house or internal formats. Very few of these formats have ever been published, except perhaps in an appendix in a manual.

Programmers like simple formats because they are easy to parse, often easy to parse quickly, and easy to maintain. Going back the Warr quote earlier, it's hard to parse WLN efficiently.

On the other hand, line notations fit better with text-oriented systems. Back in the 1960s and 1970s, ISI (the Institute for Scientific Information) indexed a large subset of the chemistry literature and distributed the WLNs as paper publications, in a permuted table to help chemists search the publication by hand. ISI was a big proponent of WLN. And of course it was easy to put a WLN on a punched card and search it mechanically, without an expensive computer,

Even now, a lot of people use Excel or Spotfire to display their tabular data. It's very convenient to store the SMILES as a "word" in a text cell.

Line notations also tend to be smaller than connection tables. As an example, the connection table lines from the PubChem SD files (excluding the tag data) average about 4K per record. The PUBCHEM_OPENEYE_ISO_SMILES tag values average about 60 bytes in length.

Don't take the factor of 70 as being all that meaningful. The molfile format is not particularly compact, PubChem includes a bunch of "0" entries which could be omitted, and the molfile stores the coordinates, which the SMILES does not. The CAS search system in the late 1980s used about 256 bytes for each compact connection table, which is still 4x larger than the equivalent SMILES.

Dave is right. SMILES, unlike most earlier line notations, really is built with computer parsing in mind. Its context-free grammar is easy to parse using simple stack, thought still not as easy as a connection table. It doesn't require much in the way of lookup tables or state information. There's also a pretty natural mapping from the SMILES to the corresponding topology.

What happens if you had a really fast SMILES parser? As a thought experiment which doesn't reflect real hardware, suppose you could convert 60 bytes of SMILES string to a molecule data structure faster than you could read the additional 400 bytes of connection table data. (Let's say the 10 GHz CPU is connected to the data through a 2400 baud modem.) Then clearly it's best to use a SMILES, even if it takes longer to process.

A goal for the Daylight toolkit was to make SMILES parsing so fast that there was no reason to store structures in a special internal binary format or data structure. Instead, when it's needed, parse the SMILES into a molecule object, use the molecule, and throw it away.

On the topic of muck and horseshoes, as I recall Daylight hired an outside company at one point to go through the code and optimize it for performance.


SMARTS is a natural recasting and extension of the SMILES grammar to define a related grammar for substructure search.

I started in chemical information in the late 1990s, with the Daylight toolkit and a background which included university courses on computer grammars like regular expressions. The analogy of SMARTS to SMILES is like regular expressions to strings seems obvious, and I modeled my PyDaylight API on the equivalent Python regular expression API.

Only years later did I start to get interested in the history of chemical information, though I've only gotten up to the early 1970s so there's a big gap that I'm missing. Clearly there were molecular query representations before SMARTS. What I haven't found is a query line notation, much less one implemented in multiple systems. This is a topic I need to research more.

The term "fingerprint"

Fingerprints are a core part of modern cheminformatics. I was therefore surprised to discover that Daylight introduced the term "fingerprint" to the field, around 1990 or so.

The concept existed before then. Adamson and Bush did some of the initial work in using fingerprint similarity as a proxy for molecular similarity in 1973, and Willett and Winterman's 1986 papers [1, 2] (the latter also with Bawden) reevaluated the earlier work and informed the world of the effectiveness of the Tanimoto. (We call it "Tanimoto" instead of "Jaccard" precisely because of those Sheffield papers.)

But up until the early 1990s, published papers referred to fingerprints as the "screening set" or "bit screens", which describes the source of the fingerprint data, and didn't reifying them into an independent concept. The very first papers which used "fingerprint" were by Yvonne Martin, an early Daylight user at Abbott, and John Barnard, who uses "fingerprint", in quotes, in reference specifically to Daylight technology.

I spent a while trying to figure out the etymology of the term. I asked Dave about it, but it isn't the sort of topic which interests him, and he didn't remember. "Fingerprint" already existed in chemistry, for IR spectra, and the methods for matching spectra are somewhat similar to those of cheminformatics fingerprint similarity, but not enough for me to be happy with the connection. Early in his career Dave wrote software for a mass spectrometer, so I'm also not rejecting the possibility.

The term "fingerprint" was also used in cryptographic hash functions, like "Fingerprinting by Random Polynomial" by Rabin (1981). However, these fingerprints can only be used to test if two fingerprints are identical. They are specifically designed to make it hard to use the fingerprints to test for similarity of the source data.

I've also found many papers talking about image fingerprints or audio fingerprints which can be used for both identity and similarity testing; so-called "perceptual hashes". However, their use of "fingerprint" seems to have started a good decade after Daylight popularized it in cheminformatics.

Hash fingerprints

Daylight needed a new name for fingerprints because they used a new approach to screening.

Fingerprint-like molecular descriptions go back to at least the Frear code of the 1940s. Nearly all of the work in the intervening 45 years was focused on finding fragments, or fragment patterns, which would improve substructure screens.

Screen selection was driven almost entirely by economics. Screens are cheap, with data storage as the primary cost. Atom-by-atom matching, on the other hand, had a very expensive CPU cost. The more effective the screens, the better the screenout, the lower the overall cost for an exact substructure search.

The best screen would have around 50% selection/rejection on each bit, with no correlation between the bits. If that could exist, then an effective screen for 1 million structures would need only 20 bits. This doesn't exist, because few fragments meet that criteria. The Sheffield group in the early 1970s (who often quoted Mooers as the source of the observation) looked instead at more generic fragment descriptions, rather than specific patterns. This approach was further refined by BASIC in Basel and then at CAS to become the CAS Online screens. This is likely the pinnacle of the 1970s screen development.

Even then, it had about 4000 patterns assigned to 2048 bits. (Multiple rare fragments can be assigned to the same bit with little reduction in selectivity.)

A problem with a fragment dictionary is that it can't take advantage of unknown fragments. Suppose your fragment dictionary is optimized for pharmaceutical compounds, then someone does a search for plutonium. If there isn't an existing fragment definition like "transuranic element" or "unusual atom", then the screen will not be able to reject any structures. Instead, it will slowly go through the entire data set only to return no matches.

This specific problem is well known, and the reason for the "OTHER" bit of the MACCS keys. However, other types of queries may still have an excessive number of false positives during screening.

Daylight's new approach was the enumeration-based hash fingerprint. Enumerate all subgraphs of a certain size and type (traditionally all linear paths with up to 7 atoms), choose a canonical order, and use the atom and bond types in order t generate a hash value. Use this value to seed a pseudo random number generator, then generate a few values to set bits in the fingerprint; the specific number depends on the size of the subgraph. (The details of the hash function and the number of bits to set were also part of Daylight's "secret sauce.")

Information theory was not new in chemical information. Calvin Mooers developed superimposed coding back in the 1940s in part to improve the information density of chemical information on punched cards (and later complained about how computer scientists rediscovered it as hash tables). The Sheffield group also used information theory to guide their understanding of the screen selection problem. Feldman and Hodes in the 1970s developed screens by the full enumeration of common subgraphs of the target set and a variant of Mooers' superimposed coding.

But Daylight was able to combine information theory and computer science theory (i.e., hash tables) to develop a fingerprint generation technique which was completely new. And I do mean completely new.

Remember how I mentioned there are SMILES-like line notations in the literature, even if people never really used them? I've looked hard, and only with a large stretch of optimism can I find anything like the Daylight fingerprints before Daylight, and mostly as handwaving proposal for what might be possible. Nowadays, almost every fingerprint is based on a variation of the hash approach, rather than a fragment dictionary.

In addition, because of the higher information density, Daylight fingerprints were effective as both a substructure screen and a similarity fingerprint using only 1024 bits, instead of the 2048 bits of the CAS Online screen. This will be important in the next section.

Chemical data vs. compute power and storage size

CAS had 4 million chemical records in 1968. The only cost-effective way to store that data was on tape. Companies could and did order data tapes from CAS for use on their own corporate computers.

Software is designed for the hardware, so the early systems were built first for tape drives and then, as they became affordable, for the random-access capabilities of hard disks. A substructure search would first check against the screens to reject obvious mismatches, then for each of the remaining candidates, read the corresponding record off disk and do the full atom-by-atom match.

Apparently Derek Price came up with the "law of exponential increase" in "Science Since Babylon" (1961), which describes how science information has exponential growth. I've only heard about that second hand. Chemical data is no exception, and its exponential growth was noted, I believe, in the 1950s by Perry.

In their 1971 text book, Lynch, et al. observed the doubling period was about 12 years. I've recalculated that number over a longer baseline, and it still holds. CAS has 4 million structures in 1968 and 100 million structure in 2015, which is a doubling every 10-11 years.

On the other hand, computers have gotten more powerful at a faster rate. For decades the effective computing power doubled every 2 years, and the amount of RAM and data storage for constant dollars has doubled even faster than that.

In retrospect it's clear that at some point it would be possible to store all of the world's chemistry data, or at least a corporate database, in memory.

Weininger's Realization

Disks are slow. Remember how the atom-by-atom search needed to pull a record off the disk? That means the computer needs to move the disk arm to the right spot and wait for the data to come by, while the CPU simply waits. If the data were in RAM then it would be 10,000x faster to fetch a randomly chosen record.

Putting it all in RAM sounds like the right solution, but in in the early 1980s memory was something like $2,000 per MB while hard disk space was about $300 per MB. One million compounds with 256 bytes per connection table and 256 bytes per screen requires almost 500MB of space. No wonder people kept the data on disk.

By 1990, RAM was about $80/MB while hard disk storage was $4/MB, while the amount of chemistry data had only doubled.

Dave, or at least someone at Daylight, must have realized that the two different exponential growth rates make for a game changer, and that the Daylight approach would give them a head start over the established vendors. This is explicit in the Daylight documentation for Merlin:

The basic idea behind Merlin is that data in a computer's main memory can be manipulated roughly five orders of magnitude faster than data on its disk. Throughout the history of computers, there has been a price-capacity-speed tradeoff for data storage: Large-capacity storage (tapes, drums, disks, CD-ROMS) is affordable but slow; high-speed storage ("core", RAM) is expensive but fast. Until recently, high-speed memory was so costly that even a modest amount of chemical information had to be stored on tapes or disks.

But technology has a way of overwhelming problems like this. The amount of chemical information is growing at an alarming rate, but the size of computer memories is growing even faster: at an exponential rate. In the mid-1980's it became possible for a moderately large minicomputer to fit a chemical database of several tens of thousands of structures into its memory. By the early 1990's, a desktop "workstation" could be purchased that could hold all of the known chemicals in the world (ca. 15 million structures) in its memory, along with a bit of information about each.

On the surface, in-memory operations seem like a straightforward good deal: A computer's memory is typically 105 times faster than its disk, so everything you could do on disk is 100000 times faster when you do it in memory. But these simple numbers, while impressive, don't capture the real differences between disk- and memory-based searches:

Scaling down

I pointed out earlier that SMILES and fingerprints take up less space. I estimate it was 1/3 the space of what CAS needed, which is the only comparison I've been able to figure out. That let Daylight scale up to larger data set for a given price, but also scale down to smaller hardware.

Let's say you had 250,000 structures in the early 1990s. With the Daylight system you would need just under 128 MB of RAM, which meant you could buy a Sun 3, which maxed out at 128 MB, instead of a more expensive computer.

It still requires a lot of RAM, and that's where Yosi comes in. His background was in hardware sales, and he know how to get a computer with a lot of RAM in it. Once the system was ready, Dave and his brother Art put it in the back of a van and went around the country to potential customers to give a demo, often to much astonishment that it could be so fast.

I think the price of RAM was the most important hardware factor to the success of Daylight, but it's not the only one. When I presented some of these ideas at Novartis in 2015, Bernhard Rohde correctly pointed out that decreasing price of hardware also meant that computers were no longer big purchase items bought and managed by IT, but something that even individual researchers could buy. That's another aspect of scaling down.

While Daylight did sell to corporate IT, their heart was in providing tools and especially toolkits to the programmer-chemists who would further develop solutions for their company.

Success and competitors

By about 1990, Daylight was a market success. I have no real idea how much profit the company made, but it was enough that Dave bought his own planes, including a fighter jet. When I was at the Daylight Summer School in 1998, the students over dinner came up with a minimum of $15 million in income, and maximum of $2 million in expenses.

It was also a scientific success, measured by the number of people talking about SMILES and fingerprints in the literature.

I am not a market analyst so I can't give that context. I'm more of a scholar. I've been reading through the old issues of JCICS (now titled JCIM) trying to identify the breakthrough transition for Daylight. There is no bright line, but there is tantalizing between-the-lines.

In 1992 or so (I'll have to track it down), there's a review of a database vendors' product. The reviewer mentions that the vendor plans to have an in-memory database the next year. I can't help but interpret it as a competitor responding to the the new Daylight system, and having to deal with customers who now understood Weininger's Realization.

Dave is a revolutionary

The decreasing price of RAM and hardware may help explain the Daylight's market success, but Dave wasn't driven by trying to be the next big company. You can see that in how the company acted. Before the Oracle cartridge, they catered more towards the programmer-chemist. They sold VMS and then Unix database servers and toolkits, with somewhat primitive database clients written using the XView widget toolkit for X. I remember Dave once saying that the clients were meant as examples of what users could do, rather than as complete applications. A different sort of company would have developed Windows clients and servers, more tools for non-programmer chemists, and focused more on selling enterprise solutions to IT and middle managers.

A different sort of company would have tried to be the next MDL. Dave didn't think they were having fun at MDL, so why would he want to be like them?

Dave was driven by the idea of bringing chemical information away from the "high priests" who held the secret knowledge of how to make things work. Look at SMILES - Dyson and WLN required extensive training, while SMILES could be taught to non-chemists in an hour or so. Look at fingerprints - the CAS Online screens were the result of years of research in picking out just the right fragments, based on close analysis of the types of queries people do, while the hash fingerprints can be implemented in a day. Look even at the Daylight depictions, which were well known as being ugly. But Dave like to point out that the code, at least originally, needed only 4K. That's the sort of code a non-priest could understand, and the sort of justification a revolutionary could appreciate.

Dave is a successful revolutionary, which is rare. SMILES, SMARTS and fingerprints are part of the way we think about modern cheminformatics. Innumerable tools implement them, or variations of them.

High priests of chemical information

Revolutionary zeal is powerful. I remember hearing Dave's "high priests" talk back in 1998 and the feeling empowered, that yes, even as a new person in the field, cheminformatics is something I can take on on my own.

As I learn more about the history of the field, I've also learned that Dave's view is not that uncommon. In the post-war era the new field of information retrieval wanted to challenge the high priests of library science. (Unfortunately I can't find that reference now.)

Michael Lynch would have been the high priest of chemical information if there ever was one. Yet at ICCS 1988 he comments "I can recollect very little sense, in 1973, that this revolution was imminent. Georges Anderla .. noted that the future impact of very large scale integration (VLSI) was evident only to a very few at that time, so that he quite properly based his projections on the characteristics of the mainframe and minicomputer types then extant. As a result, he noted, he quite failed to see, first, that the PC would result in expertise becoming vastly more widely disseminated, with power passing out of the hands of the small priesthood of computer experts, thus tapping a huge reservoir of innovative thinking, and, second, that the workstation, rather than the dumb terminal, would become standard."

A few years I talked with Egon Willighagen. He is one of the CDK developers and an advocate of free and open source software for chemical information. He also used the metaphor of talking information from the high priests to the people, but in his case he meant the previous generation of closed commercial tools, like the Daylight toolkit.

Indeed, one way to think of it is that Dave the firebrand of Daylight became it high priest of Daylight, and only the priests of Daylight control the secret knowledge of fingerprint generation and canonicalization.

That's why I no longer like the metaphor. Lynch and the Sheffield group published many books and papers, including multiple textbooks on how to work with chemical information. Dave and Daylight did a lot of work to disseminate the Daylight way of thinking about cheminformatics. These are not high priests hoarding occult knowledge, but humans trying to do the right thing in an imperfect world.

There's also danger in the metaphor. Firebrand revolutionaries doesn't tend to know history. Perhaps some of the temple should be saved? At the very least there might be bad feelings if you declare your ideas revolutionary only to find out that not only they are not new, and you are talking to the previous revolutionary who proposed them.

John Barnard told me a story of Dave and Lynch meeting at ICCS in 1988, I believe. Dave explained how his fingerprint algorithm worked. Lynch commented something like "so it's like Calvin Mooers' superimposed coding"? Lynch knew his history, and he was correct - fingerprints and superimposed coding are related, though not the same. Dave did not know the history or how to answer the question.

My view has its own danger. With 75 years of chemical information history, one might feel a paralysis of not doing something out of worry that it's been done before and you just don't know about it.

Leaving Daylight

In the early 2000s Dave became less interested in chemical information. He had grown increasingly disenchanted with the ethics of how pharmaceutical companies work and do research. Those who swear allegience to money would have no problems just making a sale and profits, but he was more interested in ideas and people. He had plenty of money.

He was concerned about the ethics of how people used his work, though I don't know how big a role that was overall. Early on, when Daylight was still located at Pomona, they got purchase request from the US military. He didn't want the Daylight tools to be used to make chemical weapons, and put off fulfilling the sale. The military group that wanted the software contacted him and pointed out that they were actually going to use the tools to develop defenses against chemical weapons, which helped him change his mind.

I think the Daylight Oracle cartridge marked the big transition for Daylight. The 1990s was the end of custom domain-specific databases like Thor and Merlin and the rise of object-relational databases with domain-specific extensions. Norah MacCuish (then Shemetulskis) helped show how the Daylight functionality could work as an Informix DataBlade. Oracle then developed their data cartridge, and most of the industry, including the corporate IT that used the Daylight servers, followed suit.

Most of Daylight's income came from the databases, not the toolkit. Companies would rather use widely-used technology because it's easier to find people with the skills to use it, and there can be better integration with other tools. If Daylight didn't switch, then companies would turn to competitive products which, while perhaps not as elegant, were a better fit for IT needs. Daylight did switch, and the Daylight user group meetings (known as "MUG" because it started off as the Medchem User Group) started being filled by DBAs and IT support, not the geek programmer-chemist that were interested in linguistics and information theory and the other topics that excited Dave.

It didn't help that the Daylight tools were somewhat insular and static. The toolkit didn't have an officially supported way to import SD file, though there was a user-contributed package for that, which Daylight staff did maintain. Ragu Bharadwaj developed Daylight's JavaGrins chemical sketcher, which was a first step towards are more GUI-centered Daylight system, but also the only step. The RDKit, as an example of a modern chemical informatics toolkit, includes algorithms Murko scaffolds, matched molecular pairs, and maximum common substructure, and continues to get new ones. But Daylight didn't go that route either.

What I think it comes down to is that Dave was really interested in databases and compound collections. Daylight was also a reseller of commercial chemical databases, and Dave enjoyed looking through the different sets. He told me there was a different feel to the chemistry done in different countries, and he could spot that by looking at the data. He was less interested in the other parts of cheminformatics, and as the importance of old-school "chemical information" diminished in the newly renamed field of "chem(o)informatics", so too did his interest, as did the interest from Daylight customers.


Dave left chemical information and switched to other topics. He tried to make theobromine-free chocolate, for reasons I don't really understand, though I see now that many people buy carob as a chocolate alternative because it's theobromine free and thus stimulant free. He was also interested in binaural recording and hearing in general. He bought the house next door to turn it into a binaural recoding studio.

He became very big into solar power, as a way to help improve the world. He bought 12 power panels, which he called The Twelve Muses, from a German company and installed them for his houses. These were sun trackers, to maximize the power. Now, Santa Fe is at 2,300 m/7,000 ft. elevation, in a dry, near-desert environment. He managed to overload the transformers because they produced a lot more power than the German-made manufacturer expected. Once that was fixed, both houses were solar powered, plus he had several electric cars and motorcycles, and could feed power back into the grid for profit. He provided a recharging station for the homebrew electric car community who wanted to drive from, say, Albuquerque to Los Alamos. (This was years before Teslas were on the market). Because he had his own DC power source, he was also able to provide a balanced power system to his recording studio and minimize the power hum noise.

He tried to persuade the state of New Mexico to invest more in solar power. It makes sense, and he showed that it was possible. But while he was still a revolutionary, he was not a politician, and wasn't able to make the change he wanted.

When I last saw him in late 2015, he was making a cookbook of New Mexico desserts.

Dave will always have a place in my heart.

Andrew Dalke
Trollhättan, Sweden
2 December 2016


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me

Copyright © 2001-2013 Andrew Dalke Scientific AB