Visual dataflow programming

[ previous | newer ] /home/writings/diary/archive/2003/09/22/VisualProgramming

Visual dataflow programming

I've been thinking about Michel Sanner's ViPEr system. It's a visual dataflow system for structural biology, akin in some repects to general purpose tools like AVS, IBM's Data Explorer or domain specific tools like Pipeline Pilot for chemistry. I started from a point of skepticism and it seems I'm still there.

I first got interested in dataflow systems back in college. I remember talking to one of my professors about it, and I tried sketching out on paper possible ways to make it work, but I never managed to get good control flow in the system. It turns out there are ways to do it, but they end up looking rather complicated.

I played around with IBM's Data Explorer a bit in the mid-90s. Cornell Theory Center had an add-in package for doing structure visualization. I went there for a two-day workshop with Dorina, another grad student in the group. She's not a programmer but she can write scripts. She was able to make some very nice depictions that I couldn't touch with VMD. The main reason was the support for constructive solid geometry, which is rare for structure programs but pretty common in other tools, but part of it was the ease of changing the dataflow.

(There was an interesting demo at the CRBM workshop showing that CSG really should be more common in structure visualization codes.)

Afterwards I looked for free codes for data-flow coding. The only one at the time was Khorus, then at UNM and now distributed through a company. It was designed for visualizations which can be done on an array, like 2D images and 3d fluid flows, and couldn't easily support data structures appropriate to molecules. (Michel says it's still that way.)

Data Explorer went open source around 1998/1999 and I looked at it a bit; got it to compile under IRIX and looked at the docs. I didn't have anything to test it out on so that's where I stopped.

I also read some reports of people who used dataflow systems. One striking critique was the difficulty in scaling. A small system is easy to understand. When there's only a few nodes - those built during demos - then it's easy to see what's going on. When a system gets large there are a few problems: it's hard to distinguish the transformation nodes, the connection lines dominate the canvas, and the graph simply gets too large to display on a single canvas.

There are solutions, or perhaps just workarounds, for these. Each node could be given a distinct shape, making it easier to see. Pipeline Pilot does this - it looks like they paid a decent graphical designer for their nodes. When the graph gets too large, a subgraph can be collapsed into a single node, much like a function. (But what shape is the new node? How can be be made distinctive?) This also reduces the number of lines in the system, except that if a subgraph can be collapsed into a node then there weren't many overlaps between the internal connections and the external ones.

When the graph gets larger the layout algorithms for the connections become more complex. As I recall, Data Explorer reused algorithms from circuit layouts, to try and reduce overlaps, but I still needed to tweak them to make them easier to understand. Another solution is to use transmitter/receiver nodes to connect regions even on different canvases without laying out a line between them. This helps make module-like systems, but not really. Also, if the canvas is scaled large enough to see everything at once then it's hard to make everything out and if you've zoomed in close enough then you look track of what's in the connections which came in from out of screen.

There are other problems too, like standardizing how certain things are arranged. You can think of this as making the transition from spaghetti code to structured code. Indeed, many of these complaints above have parallels to textual programming.

You could also argue that there are large-scale data-flow based systems; circuits. My Dad did video broadcast engineering and could look at a circuit board and tell what the different parts did, just because of the standard way things were arranged and how they looked. He designed studio layouts by drawing nodes and connecting them with lines; originally on a sheet of paper and later using AutoCAD.

These are both suggestive, but I think that's as far as it goes. Even in the first few years of computer languages, people were able to write interesting programs. I know there's been decades of work on visual programming languages, and believe that if they were good for programming then people would be using them for some tasks; if only from sheer obstinacy.

I take that back. In looking up research in visual programming and workflows, I came across an undergrad review paper which concludes:

Despite the move toward graphical displays and interactions embodied by VPLs, a survey of the field quickly shows that it is not worthwhile to eschew text entirely. While many VPLs could represent all aspects of a program visually, such programs are generally harder to read and work with than those that use text for labels and some atomic operations. For example, although an operation like addition can be represented graphically in VIPR, doing so results in a rather dense, cluttered display. On the other hand, using text to represent such an atomic operation produces a less complicated display without losing the overall visual metaphor.

You might also be interested in some comments on the Joel on software site.

Michel's point though is that people don't need to or even want to learn to program, so that iniability to "represent all aspects of a program visually" isn't a problem. Why introduce something with the complexities needed for general programming?

I think that's an interesting viewpoint, but I don't know how realistic it is. There are ways to introduce people to simple programming, as with people who enter things in Excel, then write formulas, then write simple VB functions. This has the advantage that it doesn't dead-end like dataflow systems seem to do.

The dataflow user interface is also quite different than most apps people use. In a normal UI you see the controls and some indicators of how they related, perhaps from grouping or some sort of high-level schematic. The innards are one big black box. With a dataflow system, you change parameters by editing nodes in the network. You can directly see the way those fields iteract. The black box has transparent walls. Is seeing that detail really helpful? Perhaps, but I'm not convinced. (I can think of a system where the nodes can be dragged into a GUI builder, which might help.)

I also believe most people who can't program do know someone who can help configure systems, write macros, etc.. This might be another graduate student, a coworker, or technical support staff. (I've been all of these. :) This doesn't mean a programmer; there are plenty of places which have the house expert for making Excel macros but who isn't a software developer. So I don't think the lack of programming skills in a given user is necessarily a problem.

But visual programming does appear tantalizing for some domains. It seems to work best when it's strongly data-flow oriented, with only a a few different data types but many possible transformations yet where only a few (a dozen or two) transformations are used at a time. Image analysis is the most obvious one, and most of the standard dataflow applications have libraries for that.

The question though is the appropriateness of this paradigm for chemistry or biology. I've seen several data-flow/pipeline projects for bioinformatics, which has many similarities to the style of analyzes in chemical informatics. One is Piper, from bioinformatics.org, which started in 1998 but is now discontinued.

When I was at ISMB in Edmonton a couple years ago, I saw several different companies with dataflow products for bioinformatics. I picked up the literature, and I'm leafing through them now. Let's see:

Übertool from science-factory.com - web site now says "... discontinue its operations as of August 31, 2003 due to insufficient funding and the inability to attract new investors". Too bad. I rather liked this one. It looked pretty and implemented quite a few algorithms.
Hyperthesis in gRNA from Helixense - doesn't seem to have done much since ISMB 2002, according to their web site. At the conference they said they had 23 employees then, so I'm suspecting they aren't doing well. Oh, and they used Jython underneath.

There was another, but I don't see it in my notes.

When I think about the topic some more, I realize that the style of doing analyzes in chemical and bio- informatics are really not all that different than other fields, at least when expressed in a visual progamming style. This sugggests that if it's useful then there should be more generic libraries for this, both open-source and proprietary. I've been looking around and I can find very few. Freshmeat told me about Taverna and xFloWS, but after an hour or two all told and I couldn't find much more. I really did expect to find a commercial library for Windows using COM. I also looked for academic work, but it seems to have been done in the 1980s and early 1990s, as there's very little available from the last 10 years. The newsgroups are also remarkably empty - a couple posts a years!

I'm left with the conclusion that dataflow visual programming isn't really that effective, despite what Pipeline Pilot and Michel argue.

But I can be wrong about that. I haven't used either system directly nor seen people use Pipeline Pilot so this whole argument is based just on my general knowledge. Suppose I wanted to develop code like Pipeline Pilot, that is, a visual dataflow system for chemical informatics. It would need some way to read and manipulate chemical structures. OpenEye's OEChem is a good choice for that, as is Daylight, if some format conversion tools are available. It needs the GUI framework. Michel's code should do that, but I must say it isn't as aesthetically pleasing as PP's is. (PP looks very pretty!) It needs ways to talk to different servers, but Python code can handle that just fine. I don't think it would be a hard thing; perhaps a couple of months, depending on what you want. However, the result would not, IMO, be commercially viable.

It may be appropriate as an in-house or open source project. the first really depends on a company's needs and the second, well, I just don't have time for the second so it would require volunteers, and most programmers like programming using text, not pictures.

BTW, there are some other alternatives to highly visual programming language. One is to use simple text. Start with the example graphic, shown at http://www.scripps.edu/~sanner/images/work/ViperIntro.jpg. I'll write code for a hypothetical Python API which produces the same result.


mol = load("1crn.pdb")
assign_radii(mol)
atoms = select_atoms(mol, "selection")
surface = msms(atoms, atoms["radius"])
coloring = ColorMap(atoms, atoms["x"])
cpk = CPK(atoms, atoms["radius"] * 0.66, coloring)

viewer = Viewer3D()
viewer.add(surface)
viewer.add(cpk)
viewer.show()

img = Filter(Scale(viewer.grabImage()), "contour")
img.show()

Personally, I find the text easier to understand, but that's in part because I've been doing that for a couple decades. What my text version doesn't do is provide a GUI. Something along the lines of PythonCard might be a useful way to add that.

I've not used PythonCard, but the idea behind it is to make it easy to develop the sorts of GUIs people expect, and something that's easy for beginning programmers to use. And unlike ViPEr, the resulting UI looks normal.

In any case, either of these are interesting projects and something I would like to work on. If you're also interested in these ideas and can fund us to help out, contact me<a/>. We are available for both consulting and custom software development.

Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me