Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2003/10/08/naming_novel_molecules

Naming novel molecules

There you are, having fun with your NMR machine, determining structures left and right, using CAS to get a compound id, and seeing what the world knows about it. You've set up a local database around the CAS# which tracks how you made a compound and any interesting information about it.

One day though, you're playing around with your plasma vapor deposition equipment (remember, you're a physicist; chemical reagents are icky and dangerous) and your mass spec gets a new signal. You isolate it, get the structure, and, ... well look at that, CAS doesn't know about it. You've created a previously unknown molecule! (Or previously unpublished. For all you know, some big pharma could apply for a patent on it tomorrow.)

Now you've got a problem. How do you stick it in your database? It uses the CAS# as the primary key, and that's the flaw. If you had taken the database class back in school instead of field theory, you would have learned that your primary keys shouldn't depend on someone else. (Well, unless you have some very good certainty about its appropriateness, and even then you should be wary.)

A solution is to provide your own naming service. You can use CAS for the compounds they've identified and your own unique identifiers for those not in CAS, but someday your compounds might be in CAS so you'll have a duplicate name. So the best is just to give each unique compound your own indentifier, and include an optional CAS# for those which are in CAS.

(There are well-known problems you should be aware of in your quest to put molecular information in a database. The structure could have been misidentified, so there needs to be a way to handle corrections. The compound may be in one of several tautomeric forms, or described in one of a couple different ways for handling stereochemistry. You can rediscover them from scratch if you want, or pay experienced people to help you out. The process of getting information about a compound into a standard form for database entry is called registration.)

(You might think you could just publish the compound and get a CAS# for it, but I think you need to characterize more about the compound than just the structure. Even if that's not the case, the combinitorial chemists add another complication. They start with a core template and have ways to stick almost any side group off one or more of the atoms in that core. They can easily make any one of an essentially an unlimited number of compounds. CAS won't give them an infinite number of identifiers, so how do you ask them to make a specific compound for you? And of course if you're trying to make the next blockbuster drug, you don't want to publish it until after you've applied for a patent.)

You fixed that problem in your database. All your compounds have a unique name you assigned to them. You had to implement your own graph isomorphism search program to ensure that the new compounds were unique, but that was a fun bit of programming. Then one evening you're out salsa dancing and meet a physical chemist who is studying the chemistry of plasma vapor deposition. You're curious to know if they know about the novel compound you made so she asks you to email her the structure. (Score! You now know her email address. BTW, since you're a physicist you're almost certainly a male.)

What do you send her? There's no CAS#, and she doesn't know about your identifiers. You could send the IUPAC name, unless it's one of those compounds which can't be named under IUPAC rules. Or you could send the chemical graph, either as an image (using the visual depiction language of chemists) or as some standard graph data structure (listing all the atoms and bonds and their types, charges, chirality, etc.) The last of these is the most common because it always works and it means the receiver doesn't need to sketch the structure back into the computer. The most popular of the connection table ("CT") formats is the SD file or molfile, from MDL. You register with their free download service, get the file format definition, and send her your compound's structure.

This works, but something feels wrong. The molecular graph unambiguously describes a compound. Why can't it be used as an unique identifier in its own right? Well, besides that the order of atoms and bonds is arbitrary. And besides that MDL's connection table is verbose and takes about 60 bytes per atom and 15 bytes per bond. (That can easily be shrunk; their CT stores coordinates, which isn't needed for a graph.)

If only there was some way to represent any chemical graph as a single "word" such that 1) it could be stored on a line in a file and easily imported into a cell of Excel, 2) all isomorphic graphs are mapped to the same word, 3) the word is unique, so that no non-isomorphic graphs are mapped to the same word. Which takes you back to what you wanted originally -- a unique, unambiguous name for every compound.


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2020 Andrew Dalke Scientific AB