Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2010/03/16/knime_and_beginners

KNIME and beginners

I gave a presentation at OpenEye's CUP last week. More precisely, I was assigned a talk with the title "Evils of KNIME." I don't chose that sort of name, but the CUP organizers like to be a bit confrontational with presentation titles. I used my speaking slot as a platform for expressing my views on dataflow/visual languages. I don't like them, and think their effectivity is limited compared to a text language, so I explained why. Other people do like them and enjoy them. I've asked them why, and they have some good reasons. My presentation outlined those responses with some observations of my own, including suggestions for ways to improve the text-based toolkits so they are more accessible to "non-programmers."

The next few posts will be based on parts of that talk. Feel free to leave comments.

Upcoming training classes (pre-announcement)

I ended by pointing out that these are technological solutions. Why not spend some time training computational chemists to be more effective at writing software? I provide that sort of training. If you are interested, email me. I'm pinning down the dates for a course in Leipzig in mid-May (likely 18-20 May), and another in Boston in late July. I'll announce them when the dates are determined. if you want to influence those dates or schedule a course at your site, let me know.

Sample test case for KNIME

I haven't used KNIME for about two years. That experience was with KNIME 1.x. People told me that it's gotten better, so I decided it was well time to take a fresh look. Last time I couldn't get it to work on my Mac. I'm happy to report that things have changed, although there are still some difficulties with it regarding updates.

My test case was the first example from the Chemistry Toolkit Rosetta, specifically, to compute the heavy atom counts from an SD file. The pybel solution is:

import pybel
 
for mol in pybel.readfile("sdf", "benzodiazepine.sdf.gz"):
    print mol.OBMol.NumHvyAtoms()
It's not as short as I would like because I had to specify "sdf" twice and because it had to reach down into the underlying OpenBabel molecule object. Still, it's a lot more succint than using any of the base toolkits directly, and a good reference of what a text-based programming language is capable of when designed for ease of use.

What molecular properties can I compute? And how do I do it?

The first step was to find out if KNIME could compute the number of heavy atoms. When I say "KNIME" I mean "the CDK nodes which come with KNIME" since KNIME is a dataflow-based visual programming language with support for a number of extension packages, including chemistry nodes based on the CDK. Schrodinger, Tripos, ChemAxon and likely other companies provide nodes based on their respective toolkits, but I don't have a license to those tools. In any case the Mac version of KNIME doesn't yet support adding new nodes.

The most likely candidate was "Molecular Properties." The help says:

Create new columns holding molecular properties, computed for each structure. The computations are based on the CDK toolkit and include logP, molecular weight, number of aromatic bonds, and many others.
What other properties does it compute? I put the node on the workspace and double clicked on it to bring up the dialog box. The result is:
The dialog cannot be opened for the following reason:
No column in spec compatible to "CDKValue".
Huh? What does that mean?

A Google search for that error message found the same question from 9 September 2009 although concerning a different node. Bernd Wiswedel answered:

We obviously need to improve on the error messages. You need to process the output of the SD reader with the "Molecule to CDK" node, which will parse the structures into an appropriate format for the Lipinski node. Reason is that the Lipinski node is contributed from the CDK plugin, so it needs its desired input format.
What this means is the inputs need to be set up correctly before I can see more details. However, it's more complicated then that. If I set up the nodes as shown:
I still get the same error message when I click on the "Molecular Properties" box. Double-clicking on the "Molecule to CDK box" gives me
The dialog cannot be opened for the following reason:
No column in spec compatible to "SdfValue" "SmilesValue" "MolValue" "Mol2Value" or "CMLValue".
Turns out I need to put in a valid SD filename in the "SDF Reader" box (the one with the exclaimation point under it), in order to get the right inputs to "Molcule to CDK", in order to see the "Molecular Properties."

How accessible is KNIME to first-time users?

Is that really friendly for first-time users? That is, how is a first-time user supposed to: 1) know which options are available if they can't open an unconnected node, 2) know which inputs are required for a node, or for that matter see what outputs are available, 3) know that the "SDF Reader" needs to be converted from "Molecule to CDK" before it can be used by the CDK nodes?

Of course all those can be explained in the documentation, and perhaps they are explained. I admit I haven't read it, but then again the knime.org documentation doesn't show how to use the CDK nodes. And should someone have to read the documentation in order to do something basic like this task? If so, are dataflow systems really any easier than working with a text-based programming language?

Can't compute the number of heavy atoms?

I looked through the list of properties which could be computed:

(BTW, it really does have mixed capitalization. Why yes, I am a nitpicker. How did you guess? ;) )

No "heavy atom count." Next option is to see if there's a way to specify the counts based on a SMARTS pattern. Nope, didn't find anything.

As far as I can tell, there's no way with the default nodes to do much of anything with KNIME. I assume there are additional packages which I can install, but why aren't there more useful CDK nodes as part of the standard installation? An obvious one to me would be a SMARTS count pattern matcher, where I could specify the SMARTS pattern, the option for unique or non-unique matche counts, and the output column name.

Is my problem because I'm on a Mac? Do Linux users get more nodes? Or is there something else I'm missing? How would you find the number of heavy atoms using KNIME? Is there a solution using the default CDK nodes or do I have to use one of the commercial toolkits?

Leave answers and comments here.


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2013 Andrew Dalke Scientific AB