Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2003/10/07/naming_molecules

Naming molecules

Suppose you are a physicist. After some analysis in your home-built NMR machine you've figured out the active ingredient in your vodka has the following chemical structure:

If you really had no chemistry training at all you probably wouldn't even include the bonds. A bond is a way of representing electron density, which can be computed knowing the atoms' positions and applying some quantum mechanics and computer power. And if you want to show off, toss in the phrase "Born-Oppenheimer approximation" so you don't have to worry about treating the nucleon as anything other than a fixed point.

Odds are you probably had a chemistry class in high school so you know about drawing the structure with bonds. Now you want to find out more about it. But how? Image search is still very immature and there are many ways to depict the graph, so that's not going to work.

One way is to look for the molecular formula, which is the counts of the number of each atom type. This structure is C2H6O but it's hard to input subscripts into a web form so try "C2H6O". The first hit is for the "c2h6o -- happy hour" mailing list at Georgia Tech, which suggests people already know about this compound. But it doesn't give you much clue as to what it is.

The next hit is for dimethyl ether which has the same molecular formula but looks like

There you see the problem. The molecular formula isn't unique. You really would like a compound to have one and only one name, and for a name to refer to only one molecule. After thinking about it some more you realize that the molecular formula itself could be written several ways, like H6C2O (lightest element first) or OC2H6 (heavest element first). There are six possible permutations for three atoms.

Searching for the first alternative you come across lecture slide which says "H6C2O could correspond to both Ethanol (H3CH2COH) and dimethyl ether (H3COCH3)". Ahh! A clue! Maybe this is called ethanol. But it's kinda worrying to see the formula written as H3CH2COH, which is different than the six permutations listed above.

Further searching finds links to sites promoting the commercial use of ethanol, but not until the sixth link do you find some useful chemical information and verification that you've got the right structure. But it is still disconcerting that they use the formula CH3CH2OH which is yet another possibility.

What are you going to do the next time you want to find information about a molecule? It seems these things have names, so you look into that some more and find out that the International Union of Pure and Applied Chemistry (IUPAC to its friends and enemies alike) have a huge amount of documentation related to nomenclature. Using their rules gives a way to assign a unique name to a molecule.

And look, that page says ethanol is written C2H5OH. *sigh*.

The documentation is overwhelming so in growing frustation you find an introduction to the naming of compounds, which conviently uses ethanol as its example.

At its simplest, the IUPAC name for an organic compound contains these two parts:

The longest carbon chain is two carbons so it has the prefix "eth". There is a single bond between them (that's "single bond" as in a bond with bond order of 1, not that there's only one bond between them) so it's an "ethane". There's an OH on the end which uses the suffix "ol". Drop the "e" and join them to make "ethanol". Ta-da!

Upon reading that tutorial you realize there's a lot of memorization of names, and you went into physics because you prefered formulas and math over names. And because you would rather be electrocuted or irradiated instead of being around chemical containers with big warning stickers like "Danger: Bone Seeker" or "The toxicity of this substance has not yet been determined."

After digging around a bit you realize that even trained chemists have problems with names. Chemistry librarians were worth their weight in platinum in their knowledge of the arcane magic of finding the right literature references.

Good thing you've got a computer. There is software to help generate an IUPAC name. But my, the results sure looks complicated, the process is opaque (to non-experts and even non-specialists in a domain) and there's the fine print that "from time-to-time" some compounds can't be named because "some classes of compounds may not yet have systematic nomenclature definitions available."

The names look complicated in part because they derive from a system originally designed to be pronouncable and to reflect the way that a chemist understands the system. The result is a name like (from the ACD/Name example on that ACD/Labs link -- it's got cool mouseovers!):

(2S,3R,6R,7S)-7-amino-3-[(1Z)-2-methylbut-1-en-1-yl]-8-oxo-5-thia-1-azabicyclo[4.2.0]octane-2-carboxylic acid
There's a mouthful for you.

That just doesn't seem elegant. Surely there must be a cleaner way to name a molecule.


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2013 Andrew Dalke Scientific AB