Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2014/06/19/Calvin_Mooers

Calvin Mooers

Mooers became an information scientist at the time when information science was just getting started. I came across his name in "An efficient design for chemical structure searching" by Feldman and Hodes, JCICS 1975 15 (3) pp 147-152. The paper is basically built on work Mooers had done a few decades previous, and includes a magic value of 0.69 as the "Mooers limit." I made a mental note to follow up on that later. This essay is a result of that followup, and is a small biography of Moeers' involvement with chemical documentation.

Influence of Mooers: connection tables, screens, and canonicalization

As I read more of the literature, I realized that Mooers had a big influence on the early decades of cheminformatics. He seems to have been the first person to use connection tables for coding molecular information on a computer, the first to describe substructure enumeration-based screens, and the first to consider a canonical representation of a molecule.

Here are some quotes to show you what I mean.

connection table and screens

Perhaps the founding paper of cheminformatics is Ray and Kirsch's "Finding Chemical Records by Digital Computers" in Science 25 October 1957, pp 814-819. It describes a project to use computers for substructure search of the patent database. It has the earliest use I know for "screen" (or "screening device"), as used for substructure search.

It contains two references to Mooers, the first concerning a connection table:

An example of a code suitable for machine searching was described by Mooers in the "Zatopleg" (5) system of ciphering structual formulas. Mooers' method of representing compounds provided the basis for representing the input data in the SEAC structure search routine described below. Methods for actually searching such data had to be developed.
where (5) is C. N. Mooers, Ciphering Structural Formulas – the Zatopleg System (Zator Co., Cambridge, Mass., 1951).

The second concerns a substructure screen, which Ray and Kirsch call "Mooers' N-tuple descriptors":

It has been suggested by Mooers (7) that, for purposes of retrieval, complex structures such as chemical diagrams can be represented in terms of a list of, say, all of the triples of atoms and bonds occurring within the structure. This, chloral (Fig. 3) would be described as consisting of combinations of the triples in the following list: Cl-C; C-C; -C-; -C=; C-H; C=O.
where (7) is C. N. Mooers, "Information retrival on structural content," in Information Theory (Academic Press, New York, 1956), pp. 121-134. (See below for the full citation; there are many books titled "Information Theory".)

Cossum, Krakiwsky, and Lynch in "Advances in Automatic Chemical Substructure Searching Techniques", J. Chem. Doc., 1965, 5 (1), pp 33-35 reaffirm that the Mooers' "code suitable for machine searching" is actually a connection table.

The connection table which we use is a development of that furst suggested by Mooers, and tested in search by Ray and Kirsch.

canonical connection table

People usually refer to Morgan's "The Generation of a Unique Machine Description for Chemical Structures J. Chem. Doc., 1965, 5 (2), pp 107-113 as the first paper to describe a molecular canonicalization algorithm. In the paper, Morgan writes:

Since it can be shown that the set is finite for any graph composed of a finite number of nodes, it is possible to select the unique table by generating all members of the set, lexicographically ordering the members of the set based on the characters involved in the description, and then selecting the first member of the resulting list as the unique table. This concept is a restatement of a technique proposed by C. N. Mooers for generating a unique cipher based on a process of making all possible "cuts" and comparing the resulting ciphers. [10, 11].
In this case, [10] is the same "Ciphering Structural Formulas – The Zatopleg System" as before, and [11] is C. N. Mooers, "Generation of Unique Ciphers for a Finite Network," Zator Technical Bulletin No. 49, Zator Co., 79 Milk St., Boston 9, Mass.

I can't help but conclude that Mooers' ideas had a big influence on the early days of cheminformatics.

Speaking of Morgan, I came across a reference to the Morgan canonicalization method in an essay by Charles Davis titled "Indexing and Index Editing at Chemical Abstracts before the Registry System":

Dyson, however, is remembered for having succeeded in lobbying for his system of linear notation, which won the approval of the International Union of Pure and Applied Chemistry (1961). This triumph over the more popular Wiswesser notation was something of a pyrrhic victory since linear notation ultimately would never be important to CAS. Other organizations, especially the Institute for Scientific Information, would go on to use Wiswesser notation, and it became an industrial standard for those who needed linear notation in their work (Davis & Rush, 1974a). However, the principal reason that CAS did not use either system was that during the early 1960s a young mathematician named Harry Morgan developed the famous algorithm that lead to the Registry System (Morgan, 1965; Davis & Rush, 1974b).

... Lynch's paper (M. F. Lynch, personal communication, 15 November 2002) expands on events during this era; moreover, he makes it clear that the Morgan algorithm was actually a revised version of the Gluck algorithm developed at DuPont.
When I read Gluck's paper about the Du Pont system, I wondered what algorithm they used in 1964 in order to canonicalize their molecules, since it was a year before Morgan's paper. Now I know! (For what it's worth, the priority rule is based on publication, not use, so we're still correct in citing Morgan, not Gluck.)

Who was Calvin Mooers?

Calvin Mooers and his wife gave an oral history of their - mostly Calvin's - lives. I used that as the scaffolding for this mini-biography.

He worked for the Naval Ordnance Laboratory during World War II. At the time, von Neumann was an advisor for the Navy. Von Neumann wanted a computer, and convinced the Navy to build one. The Navy decided to do so at NOL. Atanasoff, who invented the first electronic digital computer, was put in charge of the project. (Though apparently Atanasoff never talked about his earlier computer work while at NOL, nor was much of a decision-making manager.)

At about 26, he decided to go to graduate school at MIT. He wanted to apply some of the skills he had from working with computers, considered a few possibilities, and decided to work on library science. His wartime experience showed that library systems didn't handle the enormous amount of new publications which didn't fit into the existing classification system.

In his history he recounts Not long after (at MIT), I went to a lecture by Claude Shannon that he gave about 'information theory.' One of the conclusions of the lecture was that a random process had the statistics required for passing the highest quantity of information. Shannon and Weaver's The Mathematical Theory of Communication (1949) and Norbert Wiener's Cybernetics (1948) mark the dawn of the information theory age. Wiener was also at MIT, and Mooers in one of his papers notes that both books are very interesting.

His Master's thesis, Application of random codes to the gathering of statistical information, describes his "zatocoding" system of superimposed codes, which I'll get back to in a future essay. Appendix Z of the thesis was originally presented at an American Chemical Society meeting.

Indeed, he has a long association with the ACS. At MIT he met the chemist James Perry, who was interested in "chemical literature" and punched card-based information systems for chemical literature. (Punched cards were the new hotness, in part because of the 1946 article by Cox, Baily, and Casey in C&EN 1945, "Punched Cards for a Chemical Bibliography".) Perry arranged things so that Mooers could present his early ideas at the ACS in 1947. This interaction helped lead to Zatocoding. (See The State Of The Library Art, Volumne 4 for more about punched card systems from that era, including a summary of the Cox et al. method, mentions of Mooers' influence, and disagreements over the correct mathematical treatment of superimposed codes.)

He continued to be associated with the ACS. For example, he presented "Making Information Retrieval Pay" before the Chemical Literature Division of the ACS in September 1950, which described his audacious plan to index the Library of Congress catalog on a set of Zatocoded punch cards, and build a mechanical search engine that could conduct a search within a few minutes.

(I'll add that he is widely acknowledged for coining the term "information retrieval" and presenting it earlier that year. He also coined the term "descriptor":

Before 1948 the word did not exist, of course, and was not in the dictionary. It's now in the dictionary and most people don't know that it was my neologism. I made it up because I wanted the new word to mean exactly what I described and, unfortunately, that never happened. That is, the word descriptor now means almost anything.
According to the Wikipedia entry for index term, Moore's definition is in particular used about a preferred term from a thesaurus.)

He started a company to commercialize his ideas. The first sale was to Merck, Sharp and Dohme (the US Merck, known still as "MSD" outside of the US). I assume it was a punchcard-based indexing system for chemical record lookup. In any case, it was this period around 1950 where he did most of his thinking about applying computers to chemistry records, which resulted in his Zatopleg system.

I think he didn't continue with chemistry in large part because he wanted to work on larger problems of library and information management, and of thinking machines. In the previously mentioned "Making Information Retrieval Pay", he proposed a DOKEN ("documentary engine"), a mechanical retrieval engine capable of searching a 100 million record catalog in 2 minutes, or the 10 million catalogued items of the Library of Congress in 10 seconds. I don't think chemistry documentation was as interesting to him.

In any case, history shows he was right. It took another 10 years before cheminformatics as its own field really started, and there were and are a lot more library systems than molecular database systems.

For various reasons, his business didn't do so well. It seems like librarians didn't like his ideas much. Not only did he want to replace a lot of manual indexing systems with random numbers, but the random numbers didn't make much sense to a non-mathematician. I've read some of his papers, and his style is an unfortunate combination of abstruse and opinionated that could put people on edge. It also the combination that can energize people. The trick is to energize the people who will pay you money.

Other pioneers

My view is that while his specific background put him ahead of most others, in how to think about information theory and computing devices, there were many others very close behind him, and who were less prickly to deal with.

Hans Peter Luhn

For example, at the suggestion of the Hollerith Company, Dr. Dyson presented to IBM his ideas for using punched cards [containing molecules in Dyson notation]. Accordingly, he and Mr. Peter Luhn (of IBM) build, in 1949, a machine which would sort free field code cards. In this way, the Luhn scanner came into operation. (From the book "Survey of Chemical Notation Systems.") This led to [Luhn's] interest in literary data processing.

"Mr. Peter Luhn" here is Hans Peter Luhn, another pioneer of information science. Among other things, he invented the checksum algorithm used in every credit card. He kept in touch with Dyson. Lunh developed the KWIC permuted index in 1958 or 1960 (sources differ, and this is too much of a tangent to track down). Dyson, who by then was in charge of research and development at Chemistry Abstracts, invited Luhn to visit and to present this work. That's when CAS realized they should be looking towards computers, and then acquired IBM hardware. One result of this was the KWIC index for Chemical Titles, which served as a product to compete with the success of Eugene Garfield's ISI.

Mortimer Taube

Mortimer Taube, who was a librarian before becoming the chief of general reference and bibliography of the Library of Congress in 1945, started Documentation Inc in 1952. He had a much closer ties to the world of library science, and to government. Its first customer was the US military, and it provided library services to NASA when NASA was created in 1958. The underlying technology was based on "uniterms", which was presented in a paper titled "Coordinate Indexing of Scientific Fields", delivered at the Symposium on Mechanical Aids to Chemical Documentation, Division of Chemical Literature, American Chemical Society, New York, Sept. 4, 1951.

According to Heting Chu in the book "Information Representation and Retrieval in the Digital Age", Taube introduced boolean search systems to information retrieval, in the form of coordinate indexing. (On the other hand, other sources say that Mooers was the first propose using Boolean operations. I think the difference here is between "propose" and "convince customers to use.")

According to Mooers, in his autobiography (which means he has every right to tell it the way he wants to):

What [Taube] did was that he cooked up a simplified variety of my descriptors which he called Uniterms. He was a great salesman and a smooth talker and he charmed the librarians. He had worked as a librarian. So he set up Documentation, Inc. which made quite a commercial splash. Taube's message was that you don't have to worry about the fact that you can't understand Mooers, you do it the Uniterm way, you can understand it, and it's easy. So they flocked in his direction. Well, his methodology can be cynically characterized as follows: How do you index documents? You take a collection of documents in a certain field and you give them to somebody that is not really in that field. You sit him down with a colored pencil and ask him to go through the documents and to underline every term that he doesn't understand [laughs], and to use those underlined terms for index terms. You've heard of key terms, key words? Well, key words are the direct descendants of Mortimer Taube's Uniterms and have the same sort of loose-jointed semi-applicability to the field at hand.
If Mooers thinks that's "loose-jointed" then imagine what despair he might have had with folksonomy.

Chemistry and documentation

Did you notice how Mooers, Luhn, and Taube all had ties with the ACS?

I had always wondered why the "Journal of Chemical Documentation" had that name. I have a better understanding now. The American Documentation Institute started in 1937 with money from the Science Service of the National Academy of Sciences. In the post-war era, the number and rate of scientific publications grew enormously, and especially the number of technical reports. I recall from Eugene Garfield's essays that Chemical Abstracts at the time was years behind indexing the literature. This presented a market opportunity for his ISI (Institute for Scientific Information), which used computers to index the most popular chemical journals.

With funding from the Carnegie Corporation, they started the journal American Documentation. This seemed to be the focal point for a lot of papers the new field.

The ADI at this time consisted mostly people with scientific and technical backgrounds. This caused some animosity, as many of the newcomers believed that specific library training "was outdated or unnecessary", while others believed, as the ADI link quotes from elsewhere, "documentation was librarianship performed by amateurs." Also around this time, the American Chemical Society Division of Chemical Literature group started, as one of many more specialist groups. There was a large overlap in the readership between these various organizations.

Then the Soviets launched Sputnik in 1957. The US and other western governments started pouring money into science and technology. Quoting Mike Lynch:

There were great stirrings in science information at that time because of Sputnik, the challenge to the United States from the Soviet Union in October 1957. Sputnik's beep-beep tones took the world totally by surprise. When the dust had settled, it became apparent that the Soviets had published their intentions in the open literature, but the science information system in the West was in disarray. The system had not been considered sufficiently important nor was it well enough funded to keep up with the vast increases in the numbers of scientists employed and publishing in the postwar period. There was said to be a cocktail called Sputnik, one part vodka and three parts sour grapes.

The Journal of Chemical Documentation started shortly after Sputnik, in 1961. My guess, though I've not read any of the American Documentation articles, is that the topics in each field became too specalized to have a single, wide-ranging journal. Quoting the ADI link further, by the end of the 1950s, the 75 papers at the ICSI conference were a prominent display of the multi-faceted nature of the world of documentation and the rich research potential of the field, but they also showed the field lacked a clear synthesis and direction. That's a good reason to start a specialist journal.

(As an example of continued work in superimposed coding, published in that journal, Ronald Kline comments that:

[I]n the mid-1960s, academic researchers applied the mathematics of information theory as heavily as Mooers had done. In 1967 Pranas Zunde and Vladimir Slamecka ... used Shannon's entropy equations to calculate an optimal frequency distribution of descriptors by the number of postings that maximized the use of the index.
where the reference is Zunde, P., & Slamecka, V. (1967). Distribution of indexing terms for maximum efficiency of information transmission. American Documentation, 18, 104-108. Yet another paper for me to follow up on!)

I also guess that the money going into science research meant more money going towards developing computers meant more drug companies could have a computer, so there were enough people interested in a narrow topic to make the journal viable. Also, the ADI link points out that [b]y the early 1960s the term documentation was beginning to sound old-fashioned and inadequate in this new world of computers ... The ADI Council first considered a name change in 1963 but the official decision to change the name to American Society for Information Science (ASIS) was not made until 1968. J. Chem. Doc. followed the trend and became the Journal of Chemical Information and Computer Sciences in 1975, only to change its name again in January 2005 to the Journal of Chemical Information and Modeling.

That said, I still don't understand how the ACS, as compared to any other field, was so tightly coupled to these three key figures of information science. Anyone know?

Copyright, patents, and trademarks

I get the feeling the Mooers wanted to follow the ideal of an American inventer: a thinker who comes up with ideas, patents them, and makes money by licensing the right to use the patents.

For example, he tried patenting his Zatocoding system. According to his oral history, it took 23 years for the USPTO to grant the parent, which was past the time it was commercially viable.

I double-checked. The granted patent is "Battery controlled machine", US 3521034 A. (I linked to Google instead of the authoritative USPTO because the latter only has images in a hard-to-use interface, while the Google has an OCR'ed copy on a single page.) Note the text This is a continuation-in-part of application Ser. No. 392,444, filed Nov. 16, 1953, which was a continuation-in-part of application Ser. No. 774,620, filed Sept. 17, 1947.

In his oral history he comments that he:

... was becoming more and more critical of what I could do in the library field. That is, by 1960 there were now computers and "operators" like Herb[ert R. J.] Grosch at General Electric (GE) were moving in, and being the big boss of the computer at a company and were going around looking for business. And the library field was beginning to wake up to the fact [that] there might be something here. You don't take your business to a little hole in the wall like Mooers was operating. You take it to GE or you take it to MIT. There was an "operator" – Overhage at MIT – who set up a big project, INTREX, to solve all the problems for all time of libraries with computers at MIT. Herb Grosch was taking contracts at GE. This was the situation. The result of all of this was that in the mid 1960s, I more or less turned off my public interest to the information and library field, although I kept following it to some extent in private, and turned on my interests in programming languages and TRAC.
It's indeed hard for a lone inventor to compete against someone with close library and government connections (Taube) or big business connections (Luhn). That's where the limited monopolies of copyright and patent might help, but at the time no one, including lawyers, thought they could be used for software. Instead, he filed for a trademark on the name "TRAC", to the detest of many. Quoting from Mooers:
The first issue of Dr. Dobb's Journal, one of the early publications in the personal computer field, has a vitriolic editorial against Mooers and his rapacity in trying to charge people for his computing language.
Many people still have a fondness for TRAC, because Ted Nelson talks about TRAC in his widely-read Computer Lib. Here's one of the RESISTORS describing the fondness, along with a re-implementation of TRAC. (Which would have annoyed Mooers to no end, had he still been alive.)

Show me the documentation!

For someone so intersted in document retrieval, and whose ideas seem to inspire much of core cheminformatics, it's surprisingly hard to read his important papers. The most critical ones (for my interests) were published basically as white papers from his company. They aren't in WorldCat, or accessible by various Google general search engines. These are:

as well as Information retrival on structured content, pp. 121-134 from Information theory; papers read at a symposium on information theory held at the Royal Institution, London, September 12th to 16th, 1955 (Academic Press, New York, 1956). (It appears that the National Library of Sweden has a copy of this book, so I didn't put it on the list.)

I'll even go so far as to say that any paper published in the last 20 years, which referenced one of those first three citations, is likely making the reference second-hand through other papers, and not from actually reading the original paper. The only possibility I've found to get copies of the papers is to contact the curators at the Charles Babbage Institute. They have 39 boxes of his papers, including those three. I'm waiting for a reply to my email asking how to get a copy.

Zatopleg in the patent literature

Meanwhile, I have some sideways views through the patent records. These are:

US 4118788: Associative information retrieval (1978):

One prior art technique, originated in the 1940's, that is designed to permit associative retrieval in mechanical type systems rather than in conjunction with computers, is sometimes referred to as "Zatocoding". A complete description of a Zatocoding system, including some of the background mathematics, is contained in British Pat. No. 681,902 issued to Calvin Mooers on Oct. 29, 1952.

... While many of the features of the Zatocoding system, including the theory of superimposed coding, may be quite valuable in enabling associative retrieval, it nevertheless remains that the technique was generally oriented toward manual type storage systems and was never expanded so as to be useful in the environment of modern day computers.
(Thanks to an anonymous commenter for the link to the British patent. Follow the link to "Original Document" on the left column to see the text.)

More importantly for my interests, the patent literature is the only Internet-accessible source I've found which describes the Zatopleg system, in US3476311: Two-dimensional structure encoding (1969):

According to the Zatopleg system, a random number is attached to each atom, which number cannot be assigned more than once within a molecule. This would be followed by lists showing which other atoms each atom is linked to. Therefore a Zatopleg code for one atom of a molecule would consist of: first, an arbitrary number assigned to the atom within the molecule; second, an identification number for the kind of atom (e.g. its atomic number); and, third, the numbers of the atoms to which it is connected.
So close, but frustratingly incomplete.

Eugene Garfield's tribute

Eugene Garfield wrote a tribute to Mooers in The Scientist, Vol: 11(4)March 17, 1997. If you've made it this far through my text, you'll be able to understand nearly all of the context of that tribute; something I couldn't have done two weeks ago. (He uses the term "hashcoding" for zatocoding; I'll need to follow up on that as well.)

Garfield pointed out one last difficulty that Mooers had as an independent, for-profit researcher:

I remember resenting the fact that he was "selling" us on a commercial, for-profit product, which I inherently mistrusted. I hadn't yet overcome the idealistic notion that all good things were nonprofit, which was probably a reflection of youthful naïveté. I appreciated how difficult and frustrating it was to compete with the arrogance and market advantages of these nonprofit establishments. However, my later experiences with large government agencies and nonprofit institutions changed that view. When I had to survive as a private consultant and for-profit entrepreneur, as Calvin did, I appreciated how difficult and frustrating it was to compete with the arrogance and market advantages of these nonprofit establishments.

That's something that I, an independent, for-profit researcher should bear in mind.

Comments?

This is incomplete research, and I don't know where I'll go with it. There are a lot of books about the history of information science, and I know so very little about the topic. I know there are still many fans of Mooers out there; to them, I hope I did a good job.

My interest is in understanding the evolution of cheminformatics systems, especially machine-based systems. If you have any details or comments to add, please do so, or send me email to dalke at dalkescientific dot com.


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2013 Andrew Dalke Scientific AB