Finding the MCSes for the ChEBI ontology
This is part 3 of a series on MCS:
The industrious folks at EBI have been developing ChEBI, which expands to "Chemical Entities of Biological Interest." Quoting Wikipedia, "[ChEBI] is a database and ontology of molecular entities focused on 'small' chemical compounds, that is part of the Open Biomedical Ontologies effort."
They define several distinct ontologies. One is a chemical structure ontology. For example, the identifier CHEBI:33567 contains catecholamine, and a few examples of catecholamines are hexoprenaline (CHEBI:37950), arbutamine (CHEBI:50580), L-isoprenaline (CHEBI:6257). In addition, catecholamine is a catechol (CHEBI:33566), which in turn is a benzenediol (CHEBI:33570), and so on. A group can have more than one parent; catecholamine is also a monoamine molecular messenger (CHEBI:25375).
The end result is a hierarchial structure. The bottom of the hierarchy are structures, and intermediate nodes are such that all children of the node have some common property.
Some of these common properties map directly to a common substructure. For example, CHEBI:33853 contains phenols, so every compound under that node has "one or more hydroxy groups attached to a benzene or other arene ring."
However, not all of them do. As Chepelev, Hastings, Ennis, Steinbeck, and Dumontier pointed out in "Self-organizing ontology of biochemically relevant small molecules", BMC Bioinformatics 2012, 13:3, the term "'ester' includes compounds that conform to C(=O)OC (i.e. carboxylic esters) and C(=S)OC patterns, among others."
Other cases can't even be represented as SMARTS. They give "bicyclic" as one such example.
Can I find the MCS of all structures in a node in the ontology?
I was curious to see if I could use their data set as a test of fmcs. If their intermediate nodes have a machine-readable way to tell if it's a purely substructure-based node, and if I could get the size information, then I could get all the structures underneath it, find the MCS, and compare my answer to theirs.
Alas, they don't have that annotation information. It's something they are working on, but I didn't get the impression that it's a high priority. (I don't see why it should, either.)
Still, it's an interesting thought - what if I were to generate the MCS for all nodes, and visualize the results somehow?
It took a bit longer than I thought, but I finally downloaded their ontology (in OBO format), parsed it, extracted the hierarchy, figured out the compounds in each node, tossed out the structures that RDKit couldn't parse, and the nodes which didn't have at least two remaining structures in them.
One that was done, I let my MCS algorithm at it. It took about 50 minutes to process. (Well, I had a 15 second timeout on the MCS. I've found that 15 seconds is usually good enough.)
Oooh! More pictures!
Here's a snapshot of one of the successful cases, CHEBI:16648, which is dialkyl phosphate:
Most of the results aren't as clear-cut. For example, CHEBI:16389 contains the ubiquinones. I found the MCS:
which is nearly right, but the Wikipedia page for Coenzyme_Q10 ("Coenzyme Q10, also known as ubiquinone, ...") shows a methyl attached to the top-most oxygen this SMARTS depiction. This is because CHEBI:18238 is a structure in the set which does not have that methyl attached!
It this methyl important? I don't know. I'm not a chemist, and this requires expertise I simply don't have.
An oopsie in the oxolanes?
What I do know is that there's a mistake in the oxolanes, CHEBI:26912. Wikipedia calls this tetrahydrofuran and says it's an 5-membered ring with the formula (CH2)4O. I would write it as the SMILES/SMARTS "O1CCCC1".
However, my search finds only "OCCCC"; it doesn't find the cycle. There shouldn't be a problem with this one so I investigated further, wondering if it was a bug. It ended up that acetylblasticidin S (CHEBI:2413) is considered an oxolane. A quick look at the structure though shows that it has no 5-membered ring.
I think that's an annotation error. BTW, I do not envy the job of annotator. There's a lot of data to review, and people like me end up pointing out the mistakes, not the huge amount of work to get all the other parts right.
Even more pictures... most of the ChEBI ontology!
Do you want to see the output of my full analysis? Do you have a lot of memory on your computer? If so, download fmcs_chebi.html.bz2. It's only 7.5 MB but it bzip2 uncompresses to 166 MB. Open fmcs_chebi.html in your browser, and have fun! (Note: I'll probably delete it after a month or so.)
BTW: the images are computed on-demand using servers from the University of Hamburg and from Daylight. I didn't want to show everything at once since that would put a huge demand on those servers. Instead, you'll need to press the "Toggle images" button in order to see the SMARTS and the graphical depiction of the matches.
If you have comments or questions, leave them here.
Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me
Copyright © 2001-2010 Dalke Scientific Software, LLC.