An interesting thing happend recently while doing work for a client. I needed to make an interface to Dragon, a program that can compute a large number of chemical descriptors. It's was originally a GUI program for MS Windows, written in Borland Delphi but there is now a command-line version for Linux using Borland's Kylix.
It's a batch oriented program. It reads configuration data from a file, including the filenames used for structure input and descriptor output. The structure input file is actually a list of filenames to the actual structure files, one filename per line, so there are two levels of indirection here. The filename '-' for stdin/stdout is not supported. I wanted to turn this into a stream oriented program so I could give it one structure at a time. Using multiple "batches" of one structure at a time caused too much overhead.
This looked like a good chance to use Unix named pipes, also known as a FIFO for "First In First Out." The mkfifo function (in the os module) creates a named pipe in the file system. This acts like a normal files to the standard open/fopen functions. One program opens the named pipe for reading and the other opens it for writing. If a read occurs when there is no data the read process hangs until there is data or the write process closes the pipe. Similarly, writes block until there is a read.
Turning a simple batch program into a stream program is conceptually easy. Make two named pipes, one for the structure input and the other for the descriptor output. Tell Dragon to use those pipes instead of normal files. To compute the Dragon descriptors, write the structure to a temporary file and write the filename to Dragon's input pipe then read the descriptors from the output pipe.
Tricking programs like this can be, err, tricky . Dragon (or more likely the Kylix I/O library) reads a block of text from the input and extracts lines from the block. This is identical to how Python's for line in open("filename"): works. If there isn't enough data for a block then Dragon hangs waiting for more data. As it happens, Dragon ignores blank lines so I padded the filename with about 1000 extra newlines.
I figured that out using the strace command. It's a debugging tool that lets you see all of the system calls made by a program. In this case I used strace to see that Dragon was hanging trying to read 1024 bytes from the structure input file handle.
Problems caused by processes blocking for input is pretty common. Dragon though had one condition that was more unusual. Its output file looks like this:
Dragon version 1 2 2 Name MW AMW SV Mol1 18.2 18.2 1.4 Mol2 348.4 349.1 65.3That is, the first line is a version string, the second lists the number of compounds processes and the number successfully processed. The remaining lines are tab separated columns with the third line listing the property names. I'm describing this from memory and I'm pretty sure I've made a mistake because I think the header names are really on the 4th line. Still, close enough for this essay.
For some strange reason Dragon opens and writes the header several times. That is, it opens the file, seeks to the beginning, writes the header, and closes the file, then repeats this process several times before it starts writing the data. I think it does this once per descriptor group. My program, reading from the named pipe hooked into Dragon's output, needed to ignore the multiple closes and wait until it gets actual data.
Also, if you'll look back at the example output you'll see the second line reports how many compounds were computed. Dragon can't write this line until it knows how many structures are in the input and how many can be processed. What Dragon does instead is write the output with that line omitted. After all the input has been processed it renames the output file to a temporary file then copies the temporary file back into the original filename, inserting the proper second line into the copy.
This meant Dragon renamed the named pipe then opened it for input. It was blocking on the read because there was no data written to that pipe. My wrapper didn't need the counts so when it was done with Dragon I had it rename the named pipe, replace it with a normal containing only a few lines, and only then close the output to Dragon's input pipe. Dragon then moved the short file and inserted the count information in the copy. It's a bit of a dance but it works and is sufficiently robust.
I was worried about the wrapper's performance. My first version was very simple because it restarted Dragon once per structure; a batch size of one. Simple but very slow because the setup and startup costs are much larger than the time needed compute the descriptors for a given compound.
I tried my wrapper code with a moderate sized data set. It was faster than the normal command-line version. That normally doesn't happen! With more research and staring at the strace output I found out the performance bottleneck was disk I/O. There was a small speedup because the output was read by another program instead of going to a file. The big part was from the dance to use a named pipe and also allow Dragon to insert the second line. It takes a good chunk of time to copy a large file (I suspect the internal Dragon code for the copy isn't that fast) so when I replace a large output file with a small file that overhead goes away.
Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me
Copyright © 2001-2010 Dalke Scientific Software, LLC.