At the CRBM conference I showed the basic graph-like API that I and
Brian Kelley (and others) found useful for molecular data structures.
It says that atoms have bonds and bonds have atoms, and the molecule
has a list of all atoms and all bonds. Iddo Friedberg asked why we
chose that over a connection matrix. I don't think we answered it
fully, so here's part of my follow-up....
connection matrix vs. node with edges for molecules
I wanted to follow up on your comments about using a connection matrix for bond information. Both Brian and I were against it, but weren't that clear in explaining why. My comment was that the matrix itself would take up too much space, which isn't correct since it can be stored as a sparse matrix, as you pointed out.
Brian gave a different reason - that an atom with a list of ordered bonds - helps with chirality, but didn't full explain why. I thought at first I understood his reason, but now I'm less certain. Reading the SMILES docs at http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html#RTFToC24
SMILES uses a very general type of chirality specification based on local chirality. Instead of using a rule-based numbering scheme to order neighbor atoms of a chiral center, orientations are based on the order in which neighbors occur in the SMILES string.This means Daylight doesn't believe the actual order of bonds is important, only the order of atoms in the SMILES string (or more generally, the insertion order of the atom into the molecule).
If that's the case, and Brian has more experience in dealing with chirality so he'll make the final call on that, then that isn't a strong reason against using a sparse matrix.
However, the data model we use is
- a molecule has a list of atoms
- an atom has a list of bonds / "other" atoms
API-wise, all that's missing in what we propose compared to a matrix API is the [i][j] lookup, that is, "given an atom, return the bond to another atom, or None/KeyError if there isn't one." Easy enough to add, but something I haven't needed often.
In the cases where I do need a connection matrix, I just create one from the graph. In that way I don't have to worry about changing the matrix when I add/delete atoms as it only reflects the connectivity which existed when I made the matrix. I do think maintaining a proper, generic sparse distance matrix in the face of edits is the major reason to use a representation which more closely fits the graph-like data model.
The chirality argument says that you may want to change the order of the bonds for a given atom. With a connection matrix, the bond order is invariant and it depends on the order of atoms in the matrix. If I want the bonds for a given atom to be in a different order (which is effectively never, and so why I consider this a weak argument) then I'll need a rearrangement vector which tells me how to get the atoms in the correct order. A performance hit and cumbersome code.
(If you do change the bond ordering then the library may need to automatically reflect those changes so the chirality stays correct. The toolkits I know, Daylight and OEChem, don't let users change the bond ordering - and it's not a feature I've needed - so I don't know the expected behaviour for this case.)
Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me
Copyright © 2001-2010 Dalke Scientific Software, LLC.