Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2020/09/14/similarity_principle_variations

Similarity principle variations

Maggiora, Vogt, Stumpfe, and Bajorath in their 2014 J. Med. Chem miniperspective Molecular Similarity in Medicinal Chemistry write:

In the context of a seminal book publication8 that appeared in the early 1990s when molecular similarity analysis first became popular, the similarity property principle (SPP) emerged, which stated that similar compounds should have similar properties, the most frequently studied property being biological activity.
That citation "8" is Concepts and applications of molecular similarity, edited by Mark A. Johnson and Gerald M. Maggiora (1990), published by Wiley. This is an often-cited reference in the cheminformatics literature. Google Scholar knows about 1411 citations to it.

There's a subtle nuance to the Maggiora et al. miniperspective quote which I think has been overlooked by many of the people who cite it - the 1990 book doesn't actually define a similarity principle! That's why the miniperspective uses the phrases In the context of and emerged.

As it turns out, there isn't even a widely accepted name for this principle. (Defined as having more than 50% of Google Scholar searches.)

I want to be clear - in almost all cases and for most people it's still the correct reference to use. But I think many people aren't aware of the context. At least, I wasn't a few years ago when I first looked at the book.

Some citations from 2020 to Johnson and Maggiora (eds.) (1990)

I used Google Scholar to find papers published since 2020 which cite the book. Here are some of the relevant quotes:

Now for Wikipedia's entry on Chemical similarity: Even from this small selection you can see a diversity of names: similarity-property principle (in various spellings), similarity principle and Molecular Similarity Principle

(You can also see a couple of citations omit the important qualifier should.)

Side note: chemical or molecular similarity?

Wikipedia consider these terms the same. Quoting the Maggiora et al. miniperspective:

Chemical or Molecular Similarity? Although the terms chemical and molecular similarity are often used synonymously, this may not be entirely accurate. Chemical similarity is based primarily on the physicochemical characteristics of compounds (e.g., solubility, boiling point, log P, molecular weight, electron densities, dipole moments, etc.) while molecular similarity focuses primarily on the structural features (e.g., shared substructures, ring systems, topologies, etc.) of compounds and their representation.
I am unable to judge if there is widespread agreement with this interpretation. A Google Scholar search for "chemical similarity" Maggiora finds about 1,070 matches while "molecular similarity" Maggiora finds about 2,520 matches. I tried reading a few, but quickly gave up on trying to figure the nuance of each one and how it applies.

Similarity principles in Johnson and Maggiora

Given the diversity of names, what does the original book use?

The book is out of print. Used copies go for over US$150. Happily for me, the Chalmers library has an excellent chemistry collection, including that book. I was able to scan and OCR it to help me search for phrases related to similarity principle or similar compounds have similar properties. The closest I found, citing the author(s) of the relevant contributed chapter, are:

Without doubt, the underlying premise of the book is that similar molecules often have similar properties, that measures of similarity can be automated, and that these measures can be used for property prediction and optimization. This is the reason why so many people cite the book.

But the only use of the term similarity principle is Rouvray's molecules undergo transitions or reactions they always do so in a way that minimizes changes in the positions of the nuclei, there is no use of similarity property principle, and the closest definition to similar compounds should have similar properties is the structure-macroscopic-property concept.

Hardly the standard modern formulation!

Earlier references to the similarity property principle

The general idea that similar compounds have similar properties is very old, and definitely not new with the book. In the preface, Johnson and Maggiora are very clear they are not trying to claim any new observation:

Applications that make use, either explicitly or implicitly, of the concept of molecular similarity in chemistry are numerous, and indeed lie at the heart of a significant body of chemical research. Recently, attempts have been made to place molecular similarity on a more rigorous mathematical and conceptual footing. The fact remains, however, that the principal results lie scattered and isolated in unrelated journals and proceedings from diverse symposia. Moreover, the unifying concept of molecular similarity remains unstated and largely unrecognized. Currently, there is no single source from which one might obtain a reasonable introduction to the broad notion of molecular similarity or to an overview of current developments in the field. Thus, the time appears right for an edited volume of definitive overviews of the topics related to the definition, computation, and application of molecular similarity that emphasizes current research trends and highlights molecular similarity as the unifying concept.

Which means people clearly don't cite Johnson and Maggiora (1990) because it the first to state the similarity principle, nor because it's the first to describe the underlying concept. Let's look for some earlier uses.

My go-to tool to find earlier citations is Google Scholar (because it doesn't cost me anything.) I searched for "similar compounds" "similar properties" (413 results) and "similar molecules" "similar properties" (98 results). The large majority are of the form we made compound X and measured property Y. We also tested compounds similar to X and found they had similar properties. But I'm looking for broader characterizations which deserve the term principle.

First off, there are publications concerning patent law. I covered these in two previous, but in short the influential decision in re Hoch(Application of Paul E. Hoch, 428 F.2d 1341 (C.C.P.A. 1970)) from 1970 includes:

Such actual differences in properties are required to overcome a prima facie case of obviousness because the prima facie case, at least to a major extent, is based on the expectation that compounds which are very similar in structure will have similar properties.
Second, there are previous publications with Johnson and/or Maggiora as (co-)authors: The first of these cite Randić, who wrote chapter 5 the 1990 book (Design of Molecules with Desired Properties. Papers (co-)authored by Randić make up a third set of earlier uses of the similarity principle: There's are a couple of relevant papers by Herndon and Bertz, in 1987 which appear are along the lines of Randić, and likely directly influenced by that earlier work. They define one rigorous similarity definition, but do not give the comprehensive treatment that the later 1990 book does. There's a 1985 paper by Carhart, Smith, and Venkataraghavan which looks at using atom pairs as part of the then-new QSAR field: Finally, there are a few late 1980s references which appear to be part of the same molecular similarity zeitgeist concerning molecular similarity and QSAR:


Randić? Or Johnson and Maggiora? … Or Hoch?

As you see, the observation similar compounds should have similar properties is not original to Johnson and Maggiora, and nor do they claim it is. If you really need the first use of that sort of phrase, see in re Hoch (1970) or Randić (1979 or 1984).

And yes, some people citing prefer the first use of a concept, rather than the use which popularizes it. Which is fine!

Otherwise, in cheminformatics we don't cite Hoch because it's a patent case with no applicability to an underlying point of the 1990 book, which is that we can automate definitions of molecular similarity and apply it to property prediction and optimization.

Randić's work used automated definitions of similarity, with a focus on correlating graph invariants with molecular properties. This is much more aligned with the 1990 book, and indeed Randić wrote one of the chapters of the book. But his earlier work - which includes the phrase Principle of Similarity - doesn't tie the concept together with other approaches to similiarty, which is likely why most people don't cite Randić - even though in cheminformatics he appears to be the first to use what is essentially the modern phrase.

Carhart et al., and Herndon and Bertz, are other possible candidates as a precursor to Johnson and Maggiora (1990), but their treatment of the topic is, like Randić's, more focused on given approaches rather than the larger context, and after Randić.

From second- and third-hand accounts, what I've heard is that in the 1980s Johnson and Maggiora were key figures in a movement to consider similarity more rigorously, and make it more prominent.

They succeeded.

And that's why I think their names, as editors of the book, are so often cited, even though others earlier made the same observation.

What is the correct name of the principle?

All that, and oddly, I still don't know what to call it. Different people use different phrases. If I had to pick a name, I would follow the lead of Maggiora et al.'s miniperspective and call it the similarity property principle. But that's a clear miniority term.

I used Google Scholar to give me citation counts for different 5-year periods, all of the form maggiora "$PHRASE", resulting in the folllowing:

similar property
similarity property
similarity principle
molecular similarity
1995-1999 21 2 4 0 14
2000-2004 39 9 1 6 20
2005-2009 82 31 3 18 73
2010-2014 68 47 4 16 98
2015-2019 87 54 8 9 109

We need to take those numbers with a big grain of salt (cum grano salis, quoting Ugi et al.) because I didn't inspect each one. Some are low-quality papers, and might have copied from Wikipedia. Some use similarity principle but in a context where it's clear that they mean something else, like the following two:

Here's another alternative formulation:

(My guess is that similar property principle is more often used by people who went to Sheffield, but I haven't looked at the distribution of authors.)

And still others cite Johnson and Maggiora in general for the book's impact (citation 1 in the below quote), and Randić (citation 15 in the below quote) for the earlier scientific publication with the specific phrasing - which I think is perfectly reasonable:

Take home message?

I don't have one.

I started this essay 10 days ago to point out the oddity that Johnson and Maggiora's 1990 book didn't quite contain the succinct name and phrasing now associated with it. It took a long time for me to get there because the basic similarity principle has been around since the 1800s. I had to show that while the principle appears in earlier contexts (especially patent law), they weren't really the same, as those earlier contexts depend on human judgment for the whole process, while the similarity movement in the 1980s was based on using automated methods to help with property prediction and optimization.

Even then, it seems Randić deserves some credit that is overshadowed by the ease of saying Johnson and Maggiora (eds.) (1990) and the comprehensiveness of that book. But enough to get people to start citing him instead? I don't know. Probably not. And probably they shouldn't?

The Journal of Cheminformatics recently adopted the Citation Typing Ontology. I'm not even sure what to use as the ontology for a citation to Johnson and Maggiora. Perhaps cites as authority? That's The citing entity cites the cited entity as one that provides an authoritative description or definition of the subject under discussion. But it's not authoritative about the name of the principle. Perhaps cites as recommended reading? Or credits, which is The citing entity acknowledges contributions made by the cited entity?

I'm going to go back to talking about chemfp for a while. ;)

Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me

Copyright © 2001-2020 Andrew Dalke Scientific AB