New Cheminformatics Projects
I've started two new open projects for cheminformatics and I'm looking for help in both of them.
Chemistry Toolkit Rosetta
The Chemistry Toolkit Rosetta (CTR) is a set of common cheminformatics tasks implemented using a variety of different toolkits and approaches. It is meant primarily as a way for people to understand and compare how the different APIs work.
Currently there are 16 tasks, 14 of which are well-defined and have at least one solution (in OpenEye/Python since that's what I know best). Several also have solutions in Pybel, and there are a couple RDKit and CDK solution as well.
Some of the CTR tasks are:
- Heavy atom counds from an SD file
- Working with SD tag data
- Find the 10 nearest neighbors in a data set
- Calculate TPSA
It needs your help. The project started in part because I don't know RDKit, CDK, or Indigo that well - to say nothing of the commercial tools available from Symyx, Accelrys, Schrodinger, and others. I know them a bit better now, but not enough.
Feel free to contribute a solution in your toolkit of choice! Or provide commentary, feedback, or improve an existing solution. You can even contribute a new task, if it's characteristic of a frequently encountered cheminformatics-related problem which several toolkits can handle.
By the way, I give a big thanks to Noel O'Boyle for his feedback on the project direction and for his Pybel and Cinfony contributions to help flesh out CTR before this public annoucement.
The other project I started is called "chem-fingerprints" or "chemfp" for short. Its goal is to develop a couple of file formats for cheminformatics fingerprints as well as tools and libraries which work with those formats.
The main problem it addresses is that there is no widely used fingerprint format, so each research group or even individual researcher ends up making a new one, as well as the tools to work with it. See the use cases for some more detailed examples.
So far I've written a proposal for a line-oriented text format called "FPS" meant to be easy to generate and parse, and have sketched out a inary format called FPB meant for fast loading, at the expense of some preprocessing.
The FPS format is simple enough that you can likely figure out most of it from this example, taken from the specification:
#FPS1 #num_bits=256 #software=RDKit/2009Q3_1 #params=RDKit-Fingerprint/1 minPath=1 maxPath=7 fpSize=256 nBitsPerHash=4 useHs=True #source=/Users/dalke/databases/Compound_00000001_00025000.sdf.gz #date=2010-01-27T02:22:26 fffeffbfb7fffedff7beefdbddf7ffffabff76cf6df7fcf6f7fffebf7d7ffd6f 1 fffeffbfb7fffedff7beefdbddf7ffffabff76cf6df7fcf6f7fffebf7d7ffd6f 2 ffffbfdfffffffffbfeffffffffffffffffffffffff77efffffffebfffffffef 3 00c02010002610000080800041100002084000440d100000c055048801224400 4
I've developed a set of tools to generate FPS fingerprints from OpenEye, OEChem, and RDKit, as well as to extract fingerprints from SD tags; specifically the CACTVS substructure keys in PubChem. These are available from the Mercurial repository.
These tools are in development status, and are primarily meant at this time as a way to get concrete feedback for the specification.g
Other tools I would like to develop, perhaps with your help, are command-line programs for similarity search and substructure filters.
I'm also looking for input and feedback on the format definitions, and for people who want to add support for these formats in their tools.
If you are interested in chemfp, then sign up on the chemfp mailing list.
Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me
Copyright © 2001-2010 Dalke Scientific Software, LLC.