Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2003/10/15/WLN

WLN -- History of Chemical Nomenclature

The first commerical general purpose computer was the UNIVAC, in 1951. There were other computers in the 1940s but they had very little role in the history of chemistry nomenclature. What was important in the first half of the 1900s was typewriters and punch card machines.

Note: I am not a chemist nor a historian nor have I researched this throughly. Don't trust your business plan on what I've written. Please let me know of any mistakes I've made or if you want to hire me in regards to that business plan.

I'm writing this essay in 2003 from my home office. I have several computers, a phone line, DSL, fax machine, printer, email, and the web. Based on the good will of strangers, I am able to search computers around the world to research more about the history of nomenclature, download some of the relevant papers and print them out, and even find photographs of the relevant people and objects (like example pages from Beilstein). It's hard for me to understand what it was like 100 years ago when telegraph, typewriters, card catalogs, and the vertical filing system were the core of the world's information system. Once I get into that mindset, it becomes much easier to understand how big an impact punch cards -- no computers yet! -- had on things.

I mentioned some of the diverging pulls behind nomenclature. A chemist describing a new compound wants a good way to name it. It needs to be pronouncable, easy to write on paper, and based on the chemist's model of how the compound works. (Eg, using the most important functional group as the parent, or using a domain specific name like androstane for a steroid.) A chemist reading that paper wants to understand the writer without spending a lot of time learning yet another nomenclature (unless it's useful). A chemist looking for information about a compound wants to find it in several ways; based on its structure, based on its name (systematic, trivial, commercial, or other), or based on its similarity to other compounds (structural and/or functional). The indexer wants help chemists do those searches on paper but knowing that more paper means higher prices and more required shelf space so fewer customers. The inventory manager wants to make sure when a chemist wants compound X that it can be pulled from stock, even if it's named compound Y, or bought from a supplier.

The biggest tension is between the bench chemist, who wants an expressive way to describe the compound of interest, and the indexer, who wants a stable nomenclature across a wide range of domains. By comparison, the supplier and the inventory manager can bypass most of the mess by just agreeing on an arbitrary compound identifier, like the CAS#.

Machines bring in a new constraint. Typewriters are fast, produce consistent output, and by the 1920s could even be used as teleprinters. The Linotype printer ("line o' type"; *groan*) revolutionized the printing industry in the late 1800s. A skilled operator could produce a couple lines of type in a minute, compared to the older movable type systems where a human compositor moved letters from a various bins to the plate. However, these new machines have a limited character set. A human type setter could easily pull out a β but the machine couldn't. While I expect the machines used for typesetting chemisty did have some appropriate keys, the limitations did affect a few things, like using numbers for position indicators instead of greek letters (except that protein researchers still say α carbon).

Another way to see the effect of machine typesetting is to compare math journals over time. The old ones, where each page was laid out and even drawn in by hand, were gorgeous. They required a lot of specialized human work. As more and more math journals appeared, focusing on smaller niches, there weren't enough people with the skills to typeset everything and few could afford the cost. Instead, once phototypesetting was available, the papers and even textbooks were typed up on a regular typewriter, with the greek and math symbols drawn by hand. Ugly. I had to use a few of those books in college, because the author was the teacher.

It got a bit better once typewriters with replacable print heads came out, which meant different fonts and alphabets could be used, but it still looked ugly. Donald Knuth, frustrated with the ugly quality of print available for his series The Art of Computer Programming, wrote TeX in the 1970s as way to use computers to restore that long-lost beauty.

Sorting, selecting, and collating machines, based on punch cards, had a big impact on the fundamental idea of what could be done with a line notation. For an example, suppose each card encodes the count of the number of atoms of each element atoms. Then this gives a machine-based way to search for all compounds with a given molecular formula. Searches aren't isn't limited to that. Beilstein is sorted by structual characteristics (acyclic vs. cyclic, oxo- or hydroxyl- functional groups, contains selenium, etc). This is a decision tree, but there's only one way to go through the tree; what if you want everything which contains selenium and a hydroxyl functional group but don't care if it's cyclic or acyclic?

Punch cards gave another way to handle that. Suppose each of those characteristics is encoded in a card. Simply set the machine up to pick out all cards with those characteristics, feed the cards through the machine, and you've got the answer. This sort of search would have been almost impossible, or at least very expensive, using the traditional book-based techniques. Granted, the machines cost a lot, but they are general business machines and are getting cheaper, faster, and more powerful. (Where have I heard that before?)

Nowadays that sort of description is called a fingerprint, and more specifically a MACCS key-style fingerprint. I'll talk about a couple of the different fingerprint styles in the future.

During World War II, Wisswesser worked on the problem of coming up with a chemical nomenclature which could be expressed on the information processing machines of the time. That means limitations like 80 characters per punch card and only the upper case letters, the digits 0-9 (remember using "l" for 1?), and a few special characters like "&". The result was called the Wisswesser Line Notation, or WLN. Here's an example of WLN, which I got from John Bradshaw's talk at MUG 2001. As you can see, there's no requirement that it be pronounceable.

compound with WLN of 'L66J BMR& DSWQ IN1&1'

WLN is a fragment-oriented description of a molecule, which is similar to how a chemist thinks of the molecule. It has a long list of predefined fragments, like L66J for napthalene moiety (the two fused rings in the center), and descriptions of how to attach framgents to other fragments. In this example, the "B" stands for the second atom of the napthalene and "MR" is the sulfur with the three oxygens bound to it. The & is a terminator, so the D for the next group means attach the SWQ (phenylamino) to the fourth atom of the napthalene. I don't know why it doesn't have a terminator; my guess is the next character, an I, means the following group (the N) is attached to the 9th atom and the phenylamino doesn't have 9 atoms so it gets terminated automatically. The N is the nitrogen of the dimethylamino. The 1 the methyl, which is closed with the & so the next 1 is another methyl attached to the nitrogen.

The systematic name for this compound is 6-dimethylamino-4-phenylamino-naphthalene-2-sulphonic acid which is interesting because the numbers are different. WLN has the dimethylamino off position 9 instead of position 6. The count must have gone in an S shape, or there's a typo. Hmmmm...

WLN is a very terse way to describe a compound, to search for compounds based on chemist-defined criteria (instead of indexer defined criteria) and, when combined with cleverness, provided new ways to deal with chemical data. For example, sorting a list of compounds gives a way to do similarity searches, since the parent fragment (which is the most important part for most chemists) is given first.

WLN provided a way to describe a molecule, but it didn't produce a canonical name. That is, it didn't include a set of rules which could be applied to a molecule to get the same name every time. It, like the IUPAC nomenclature, required a chemist to identify the parent, and chemists have different opinions on how that's done. These were resolved over time but Dave Weininger (who filled me in on the background of WLN and how the above example works) pointed out that the definitive reference for WLN

Smith, E. G., Baker, P. A. (eds.), The Wisswesser Line-Formula Chemical Notation (WLN), 3rd ed., Cherry Hill: Chemical Information Management Inc., 1975.
uses rules for branches and bridges like "[take] whichever way leads to the shortest final nomenclature....". This proves to require exponential lookahead, making it intractable for many classes of compounds. (Quoting Dave again; "neither a human nor a computer (nor a billion computers starting at the Big Bang) can write a unique WLN for a protein.")

Despite these problems, the new ability to do chemistry searches by machine instead of by hand proved quite useful and several pharmaceutical companies began in-house database projects based on WLN. At the same time, companies like ISI (Institute for Scientific Information) began publishing indicies of the current literature, using WLN and machines to automate many of the tasks. This helped make them competitive to and more up-to-date than CAS (Chemistry Abstract Service), which at that time was running three years behind publications.

If I read things correctly, ISI also offered a notification service when new information came in given a compound. In essense, a clipping service for chemistry ... or one part of the path which also contains RSS. The more things change ...


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2013 Andrew Dalke Scientific AB