Diary RSS | All of Andrew's writings | Diary archive
Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure. Code to taste. Best served at room temperature.Is TDD better than Apps Hungarian? #
Greg Wilson and Jorge Aranda's essay Two Solitudes describes the different viewpoints and additudes between software engineering researchers and practicioners. Here is one of the complaints:
At industry-oriented gatherings, it seems that a strong opinion, a loud voice, and a couple of pints of beer constitute "proof" of almost any claim. Take test-driven development, for example. If you ask its advocates for evidence, they'll tell you why it has to be true; if you press them, you'll be given anecdotes, and if you press harder, people will be either puzzled or hostile.Greg worked with Andy Oram to collect and edit the essays in Making Software: What Really Works, and Why We Believe It, as a way to present some of that research evidence to software practicioners. The Two Solitudes essay refers to that book, saying its "topics range from the impact of test-driven development on productivity (probably small, if there is one at all) to whether machine learning techniques can predict fault rates in software modules (yes, with the right training)."
Most working programmers simply don't know that scientists have been doing empirical studies of TDD, and of software development in general, for almost forty years. It's as if family doctors didn't know that the medical research community existed, much less what they had discovered. Once again, the first step toward bridging this gulf seemed to be to get each group to tell the other what they knew.
Why is there a debate over TDD?
Did you catch the "probably small, if there is one at all" part? Why then is there a continued debate over the effectiveness of TDD when the conclusion from reseach is that there is at best only a small improvement in productivity? The Two Solitudes essay goes into some of the possible reasons. You should read it.
That got me to thinking about other well-accepted or well-rejected practices: version control software and Apps Hungarian notation. We as practicioners generally accept that version control software is a Good Thing and generally view Apps Hungarian as a negative. If we "know" that version control systems are useful, then can software engineering researchers find experimental evidence to support or reject the claim? If they have, then it might be an example of what a successful research experiment might look like. If it can't, then perhaps it says something about the weakness of the current experimental approaches compared to personal experience.
We can ask the same about other development practices. Can software testing show if Apps Hungarian improves productivity over other variable naming systems?
With those experiments in place we could start asking questions like "are the gains from introducing TDD to the process more like the gains in introducing Apps Hungarian, or more like the gains in introducing Mercurial?" Or more crudely, "is TDD more effective than Apps Hungarian?"
(For "TDD" put pair programming, unit tests, coverage analysis, version control systems, or most any other practice.)
So let's ask the software engineering researchers if they can provide evidence for those beliefs of the practicioners. Perhaps they've done it already?
Do version control systems make people more productive?
As far as I can see (which isn't very far), no one has done experimental research to tell if a version control system makes people more productive. I asked Greg, and he was a bit chagrined to reply that he didn't know of any published research either. Yes, he promotes that everyone use a version control system, despite not knowing of research evidence to back that belief.
Curious, isn't it?
My challenge to researchers is, can you carry out experiments to test the effectiveness of version control software for different types of projects?
This research, like TDD, is complicated by many real-world factors. There are alternatives to using version control software:
- Don't use any version control software, just one directory,
- Depend on system backups (including Time Machine for the Mac),
- Make manual backups of the working code from time, as a tar/zip file, or saved to another directory,
- Use a versioning filesystem, which might be a distributed file system;
- Single developer,
- Pair-programming using the same account,
- Multiple developers all using the same file system,
- Multiple developers across multiple sites;
- Trained as software developers,
- ... electrical engineers,
- ... computational scientists,
- Undergraduates in a first semester programming course;
- Research spikes where the code is thrown away at the end,
- Scientific research where one person works on a new approach for several months, publishes a paper, and never works on it again,
- Traditional main-branch-development-only coding,
- Multi-branch development and multiple supported releases.
It sounds like a lot of work to test these possibilities when we already know the answer, but that's my point. If research methods can't address this question then how do we trust those methods on other topics, like TDD or pair-programming?
Perhaps the research will point out something interesting. Perhaps it will find that research projects which take less than three months and where there is only one developer are 10% more effective using automatic versioning file system over a manual version control system. If that were true, perhaps we should recommend that most scientists use a versioning file system instead of version control.
Or perhaps we should develop better connections between the various file system monitoring tools (like inotify) and the version control systems, to give an automatic/no-touch interface to the version control system.
For that matter, some people start continuous integration tests based on checkins. Why not do CI based on filesystem changes? Obviously some will commit less often, but perhaps productivity will improve.
I can continue to conjecture, but of course, without evidence it's little better than blather. The world is full of blather. That's why we need research.
Does Apps Hungarian make people more productive?
Hungarian Notation is a classic topic. Quoting from Michael Feathers on the c2.com wiki page:
Hungarian notation inspires some of the most bitter religious wars among programmers. Detractors claim that it is indecipherable and that no real standards exist for its usage. Adherents claim that it becomes more likeable with use, and the instant mnemonic recognition of type that one gains with familiarity aids program comprehension immensely.
Hungarian notation seems to be in decline. Microsoft's .Net "General Naming Conventions" document says outright "Do not use Hungarian notation." Nor is Hungarian the de facto style in any of the few dozen languages that I'm familiar with, though some libraries, especially those with a Microsoft heritage, do use it.
This despite the efforts of people like Joel Spolsky to promote a distinction between Apps Hungarian and System Hungarian. The former was is "extremely valuable" while the latter is from "the dark side.")
The Wikipedia page lists other "notable opinions" both for and against.
Some people say it's useful in untyped languages. Others say it Hungarian Notation just gets in the way. Is there any research to back up any of these opinions, or are variable naming schemes yet another unexplored field?
Does TDD make people more productive?
I'm picking on TDD, but this section can be applied to many other topics.
In any TDD debate you'll hear from people who find it useful, and those who don't. You'll hear about people who tried TDD but didn't find it useful, and you'll hear from others who want to help the first group do TDD correctly. And you'll hear from people who tried git and found it confusing ... though they often move to other version control systems instead.
Look at the old Hungarian discussions and you'll see the same patterns, including people that tried it, liked it, won't move away, and believe that others just don't understand Hungarian correctly.
It looks like the effectiveness of TDD has had many more studies then the effectiveness of either version control systems or Hungarian Notation, so let's use that as the reference. Is the difference in effectiveness between using PEP-8 style names and Apps Hungarian more or less than the difference between TDD and coverage-based testing? Is the difference in effectiveness between using version control instead of manual backups more or less than the difference between TDD and test-last?
And more importantly, are the answers to those questions opinions based on experience or are they backed by evidence? Can we use these experiments to calibrate our understanding of what software engineering researchers can identify?
Feedback
Do you know of any research which tests the effectiveness of version control systems over other approaches? What about of App Hungarian over other approaches? Perhaps you have a "strong opinion,", "loud voice," or have had a "couple of pints of beer" and want to leave a comment?
I sent the seed of this essay to Greg, and he posted it on "It will never work in theory." I think that's the best place to continue this dicussion, so I hope to see you there.
I'll end with a quote from Peter Medawar's "Advice to a Young Scientist":
I cannot give any scientist of any age better advice than this: the intensity of a conviction that a hypothesis is true has no bearing over whether it is true or not. The importance of the strength of our conviction is only to provide a proportionately strong incentive to find out of the hypothesis will stand up to critical evaluation.
Postscript
The above essay places the entire burden on software engineering researchers to do new experiments. I don't mean for things to be so one sided. This was also meant for software practicioners.
Software practioners claim a lot of things. Some, like version control systems make sense. Hungarian notation, not so much. I don't think TDD makes much sense, and the research seems to back me up.
If you are software practioner then take that final quote from Medawar to heart. You can make claims about your method X or approach Y, but in the back of your head, remember that I'll be asking you "how do you know that your approach is any better at improving productivity than switching to Apps Hungarian?" You won't be able to answer that question ... unless X or Y is "Apps Hungarian," of course.
At best you'll point to success cases, and then I'll point to the success of Microsoft Word and Excel in the 1990s, and to shops like Fog Creek.
Of course, your proper response will be "but you use version control, right, and where's your evidence for that.". Sigh. Researchers? Help!
Interview with Igor Filippov about OSRA #
A new "Molecular Coding" podcast is out! I interviewed Igor Filippov about OSRA. We recorded it at ICCS in summer of 2011 but I didn't do the editing and transcript until the last couple of days. Good news on the personal side. After about two years of temporary housing, we have finally moved into some place more permanant, in Trollhättan, Sweden. (Yes, these sentences are related.)
Tomorrow I leave for the 2012 German Conference on Chemoinformatics, in Goslar, Germany. I present the fmcs work with Janna Hastings on Monday at 12:05pm, and I have a poster about chemfp at poster P38.
I hope to see some of you there. Let me know if you listen to the podcast!
PyRSS2Gen for Python3 #
Nine years ago I wrote and released PyRSS2Gen. It's a Python library for creating RSS2 feeds. You give it a Python data structure as input, and out pops XML on the other side.
It does one thing, and it does it reasonably well. People still use the package. A cursory search finds that packages for it in Debian and Ubuntu, examples of using it for various projects, and 16 questions about it in StackOverflow.
Not bad for something which only took a few days to develop.
I don't get much mail about it. People really only send email when something's broken, and PyRSS2Gen is a small enough that the few bugs it did have were quickly fixed for the 1.0.0 release.
Times change. PyRSS2Gen-1.0.0 wasn't Python3 compatible. The end of the world was neigh. But then I got a patch from Graham Bell with the two things which needed to change. The world was saved!
For your enjoyment, update to PyRSS2Gen-1.1. The module is a single file which is compatible with Python2.3 and later; and where "later" includes Python3.
The Structure Query Collection #
Finally! Long time readers might recall my long-term goal of improving small molecule search performance. There's my chemfp project for high-performance Tanimoto searches, and I have ideas on how to optimize substructure keys. But they suffer from a lack of a good performance benchmark.
Well, I still don't have a benchmark, since that requires targets. What I do have are several sets of structure queries, in a form that you can easily use to make your own benchmarks. I call it the Structure Query Collection, or SQC for short.
So you want to know if search is faster
There are dozens, if not hundreds of papers on ways to improve search performance. Let's suppose you have a faster way to do a substructure search. (By this I mean that that someone sketches a compound, which your search program interprets as a substructure query.) To show improvement, you need a baseline. The easiest is to select a random set of compounds from the database itself, to use a queries. A problem is that all of the queries will have at least one match - the query compound itself. Real-world queries will sometimes not find matches. Perhaps you can partition your data so that some of the compounds aren't in the and can be used as queries. What's a good ratio for that?
There's even a more fundamental problem: I think people who do a substructure search tend to use smaller structures than what's in the database. If the benchmark queries and targets come from the same set of compounds, then the queries are probably not characteristic of how the dataset will be used. This is only a suspicion: am I right?
The search for real-world data
The only way to find out is to look at real-world query sets. I know people at pharmas who of course have logs from internal searches, but it's considered confidential information since it may contain proprietary compound structure data. Even if the dataset was filtered so that only the public structures are released, there still may be enough for a competitor to get an idea of the types of structures the company is interested in. So that's a no.
PubChem by law can't reveal any information about the user-submited queries. I contacted ChemSpider, but at the time they didn't have enough queries to be worthwhile. In general, companies which provide search don't want to reveal their queries since they get paid by people who want that information to be secret. And so on.
(BTW, if you have a chemical query data set that you can make public, email me and let me know so that I can add it to the collection.)
BindingDB queries
But all is not lost. Last year I talked with Michael Gilson of BindingDB. They have no promises or expectation of privacy of user-submitted data, and he had no problem with getting me that information. More specifically, he asked Tiqing Liu to do the data extraction for me.
That meant some munging of the log files, and a bit of back and forth while I figured out what Tiqing did and he could figure out what information I wanted. The result was a set of SMILES, sketched in Marvin, for each of three classes of search: 3.3 million for exact search, 550,000 for similarity search, and 14,000 for substructure search.
(While I first asked for this data last year, this topic is very much a background task and I didn't have the time to follow up on it until a couple of weeks ago.)
SMARTS queries
That covers searches where the input is a sketched structure. What about SMARTS searches? Rather fewer people sketch or write SMARTS queries directly. Instead, SMARTS are more often part of the automated workflow for feature counts and structure alerts. While I would love to get my hand on real-world user-created SMARTS query sets, it's also useful to have workflow query sets too. Luckily, I know of a few.
A few weeks ago, Ehrlich and Rarey published Systematic benchmark of substructure search in molecular graphs - From Ullmann to VF2. This has a set of SMARTS patterns extracted from various sources. They used it to benchmark two different subgraph isomorphism implementations. The SMARTS patterns are available in the supplementary information, although it took me a while to figure out how to get the patterns.
Greg Landrum had developed a similar benchmark for RDKit, but it was somewhat hidden away in its source distribution and not many people knew about it.
For a classical touch, RDKit defines 162 of the 166 MACCS keys as SMARTS patterns. The MACCS keys are widely used as a cheminformatics fingerprint, so this could be used to benchmark fingerprint generation. SQC actually includes the tweaked version of these SMARTS which I put into chemfp. I changed a few of the SMARTS to be more cross-toolkit friendly.
The "Structure Query Collection"
I've collected all of these together into one spot, called the Structure Query Collection. Not a very creative name, I know, but it is descriptive. There are currently the three BindingDB structure sets and the three SMARTS query sets.
Each set contains the raw data or relevant subset of the data as I got it from the respective sources, as well as source details. The raw data needs processing - at the very least to put the SMILES or SMARTS into a easily used form - so I also included the details, including software, for how I extracted and cleaned up the data so it can be used in a benchmark.
First exploration: substructure query sizes
Earlier I conjectured that structures used as a substructure query will tend to be smaller than the molecules in the target set. Now I have the data to test that hypothesis.
I'll use my OpenSMILES-ragel project, which has a command-line program to count the number of atoms and bonds in a SMILES string. It does very little SMILES processing, and mostly assumes that the SMILES are correct beyond the basic syntax level.
Here I get the number of atoms in each substructure query:
% smiles_counts < sqc/BindingDB_substructure/BindingDB_substructure.dat | head atoms: 20 bonds: 22 atoms: 21 bonds: 23 atoms: 24 bonds: 27 atoms: 17 bonds: 19 atoms: 21 bonds: 24 atoms: 23 bonds: 26 atoms: 23 bonds: 26 atoms: 20 bonds: 23 atoms: 10 bonds: 11 atoms: 16 bonds: 18Or to get the average and standard deviation:
% smiles_counts < sqc/BindingDB_substructure/BindingDB_substructure.dat |
awk '{sum+=$2; sumsq+=$2*$2} END {print sum/NR, "+/-", sqrt(sumsq/NR - (sum/NR)**2)}'
17.8607 +/- 11.1145
For representative targets, I'll use the same ZINC subset from Ehrlich and Rarey's supplemental file "supp3.zip", in datasets/zinc_lead-like_2011-02-12_first100k.smi. This contains some SMILES which the OpenSMILES grammar doesn't accept (it uses an odd SMILES dialect I've not seen before), but I can filter out the failures by removing lines where the number of atoms is "-1".
% SMILESFILE=~/databases/Ehrlich_and_Rarey_VF2_Ullmann/datasets/zinc_lead-like_2011-02-12_first100k.smi
% smiles_counts < $SMILESFILE | head
atoms: 20 bonds: 21
atoms: 17 bonds: 17
atoms: 19 bonds: 20
atoms: -1 bonds: -1
atoms: 22 bonds: 23
atoms: -1 bonds: -1
atoms: 17 bonds: 17
atoms: 14 bonds: 16
atoms: 21 bonds: 23
atoms: 22 bonds: 24
% smiles_counts < $SMILESFILE | fgrep -v -- -1 |
awk '{sum+=$2; sumsq+=$2*$2} END {print sum/NR, "+/-", sqrt(sumsq/NR - (sum/NR)**2)}'
20.238 +/- 2.92336
Since I wasn't sure if there was a bias in the '-1' reports, I
double-checked this by changing the failure cases to something which
was no longer a valid SMILES but where smiles_counts would still
report the same number of atoms. First, I check that the
transformation works, and then I redo the count:
% perl -pe 's/\)(\d|%\d\d)+/)/g' $SMILESFILE | fgrep -- -1
% perl -pe 's/\)(\d|%\d\d)+/)/g' $SMILESFILE | smiles_counts |
awk '{sum+=$2; sumsq+=$2*$2} END {print sum/NR, "+/-", sqrt(sumsq/NR - (sum/NR)**2)}'
20.4499 +/- 2.90956
So, BindingDB substructure queries average 18 atoms while the ZINC targets average about 20 atoms. This agrees with my hypothesis, but I would need to get a copy of BindingDB for better assurance.
Optimizing substructure keys #
Advertisement: This essay describes a research project I've slowly been working on during the last 18 months. I believe I can do very fast substructure searches of large compound data sets. I would like to make it more real, but that requires funding. Contact me if you want to collaborate and can pay for it.
There are two related goals in my quest for an inverted index library, which is unfortunate because what I want for one is not necessarily the best for the other. One goal is to see if I can do PubChem-scale substructure searches with "interactive" performance (under 0.1 seconds). I think I can do that with a lot of precomputation based on substructure enumeration and optimized key development. Based on the feedback from various people, I think this is best handled within an existing database, and most probably PostgreSQL.
The first goal requires developing new database substructure search methods. My other goal is to see if I can optimize substructure keys within the framework used in most existing substructure search programs.
The existing approaches use either a explicit substructure-based fingerprint like the MACCS or CACTVS/PubChem keys, or an exhaustive hash-based fingerprint like the Daylight fingerprints. The substructure-based ones are hand-optimized and small (under 1000 bits), but don't do well with unusual chemistry. The exhaustive ones can handle diverse chemistries, but are larger (typically 1024-4096 bits).
I want to see if I can use my subgraph enumeration approach to do target- and query-based key optimization. That is, given a set of structures in the database and a characteristic set of queries, I would like to develop short, dense fingerprints which can optimize those queries.
Problem: I lack a query set
I'm not there yet. The biggest problem is that I lack any sort of representative sample of the types of queries people do. People at pharmas don't want to contribute their data sets because it's hard to remove all of the proprietary queries from the dataset. By law, PubChem is not allowed to release this data.
Do you have a good (1,000+) query set that I can use for a benchmark? If yes, email me!
So until I get a good query set, I'll have to assume that the targets themselves are representative queries. This of course is incorrect. By definition, a substructure search will likely be smaller than what's in the database. Plus, if someone uses a sketch program to make the query, then they are going to build it up an atom or ring at a time, so the first dozen queries or so will likely be small and have lots of matches.
Optimizing the query keys
At its heart, a substructure query is a subgraph isomorphism test. This is slow. A substructure filter is a quick test which tries to reduce the number of required isomorphism tests. For example, if an "N" (nitrogen) exists in the query but does not exist in the target then there's no need to do the full isomorpism test.
However, if "N" does not exist in the query then it might still match a target molecule which has an "N" in it. For example, the query "O" should still match the target "N#[N+][O-]" (nitrous oxide).
So if the query feature is "1" then the target feature must be a "1", otherwise the isomorphism test is required. This is a simple decision tree; all compounds with featurei take one branch, and all which don't take the other. Since I'm assuming that the queries are sampled from the targets, then what I want is to end up with the set of features which minimize the size of the number of nodes at the end of the decision tree.
Last year I tried a genetic algorithm to find this, but it didn't converge well, and the results weren't pleasing. For example, if I decide that I want to add 32 bits to a fingerprint, then I have to redo the search from scratch and I end up with a completely different set of fingerprints.A greedy optimization algorithm
I came up with a greedy optimization instead. The base of the decision tree has all of the records in a single node. I find the "most divisive" feature, defined as the feature which best separates the tree into two branches, with about half of the records in one branch and half in the other. Each iteration finds the features which most evenly tries to split all of the nodes. I use the scoring function:
def score(current_leaves, feature):
score = 0
for node in current_leaves:
num_on = len(node & feature)
num_off = len(node) - num_on
score += (num_on-num_off)**2
return score
This does a lot of set intersection tests; about 30 million per
iteration in my benchmark. Actually, since all I care about is the
number of elements in the intersection, and not the set itself, I wish
that Python had a "set.intersection_size()" method.
I use the scoring function to find the most divisive feature. This is the one which generates the smallest score. I use that feature to split all of the current leaves into "haves" and "have nots." As an optimization, I've decided that if there are fewer than 10 elements in a node then I don't care about further splitting that node. That's small enough that I can easily do a subgraph isomorphism test.
There are a few other optimizations. I ignore duplicate fingerprints, and I ignore features with fewer than 10 records in them. Set intersections are expensive, so I also do a short-circuit evaluation while computing the scoring function and abort the calculation if the current score is ever worse than the current best score. I also sort the nodes so the ones with the largest size go first. This increases chance of early short-circuiting.
I do some other optimizations, but you'll need to see the code for all the details. Yesterday I found that pypy 1.9's set intersection was about twice as fast as the one in cpython, so that's what I used to run the code.
Preliminary results
I used the same benchmark set (37 MB 7zip compressed) from yesterday's essay, which is a small subset of my full data set. From previous analysis, I know it has high local fingerprint diversity, so I figure it's a good enough start. Here's the subset of output containing the feature keys and the amount of time it took to find each one:
0 Cccc 5.5 seconds 1 CCO 3.6 seconds 2 CN 3.8 seconds 3 C=O 3.8 seconds 4 CCCC 5.1 seconds 5 cN 4.7 seconds 6 Cl 9.0 seconds 7 CC(C)C 10.2 seconds 8 n 8.2 seconds 9 C=C 10.5 seconds 10 cO 10.1 seconds 11 S 11.3 seconds 12 COC 11.7 seconds 13 cc(c)c 24.3 seconds 14 CC(C)O 24.1 seconds 15 O 24.5 seconds 16 CN(C)C 27.4 seconds 17 C1CCCCC1 41.4 seconds 18 CC 36.8 seconds 19 CCC(=O)O 44.1 seconds 20 F 55.2 seconds 21 CCCCCC 52.1 seconds 22 Br 48.2 seconds 23 CCC 52.3 seconds 24 C 51.1 seconds 25 C(CO)O 53.5 seconds 26 CccC 64.2 seconds 27 CCNCC 66.1 seconds 28 P 60.2 seconds 29 CCC(C)C=O 59.6 seconds 30 c1ccccc1 68.4 seconds 31 N 74.9 seconds 32 C1CCCC1 80.9 seconds 33 cCO 85.1 seconds 34 NO 96.5 seconds 35 C(CCO)CO 97.8 seconds 36 C(CN)N 96.4 seconds 37 cc(c)cc(c)c 99.7 seconds 38 CC(C)(C)C 99.2 seconds 39 I 102.6 seconds 40 CcccC 208.8 seconds 41 CCCCCCC 100.7 seconds 42 cc(c)nc 98.0 seconds 43 [Si] 93.7 seconds 44 CNC 95.3 seconds 45 CC(=O)C 94.5 seconds 46 c(cO)O 91.7 seconds 47 CCCCC 89.6 seconds 48 cCc 90.5 seconds 49 CC(C)(C)O 82.6 seconds 50 [Na] 82.6 seconds 51 o 84.3 seconds 52 [As] 85.6 seconds 53 NN 84.6 seconds 54 C(O)O 81.6 seconds 55 c(n)n 83.3 seconds 56 [Se] 80.4 seconds 57 CC=CC 80.1 seconds 58 COCCCOC 79.6 seconds 59 c(cCl)cCl 78.1 seconds 60 C=N 105.0 seconds 61 C#C 59.6 seconds 62 [Cr] 58.5 seconds 63 c(ccN)cN 58.1 seconds 64 CccccC 58.0 seconds 65 CC(C)C(C)C 54.9 seconds 66 ccc 52.7 seconds 67 B 53.3 seconds 68 CCCOCCC 55.1 seconds 69 CccO 51.0 seconds 70 [K] 48.7 seconds 71 CCN(CC)CC 48.6 seconds 72 CCCC=C(C)C 47.4 seconds 73 [Te] 47.8 seconds 74 cc=O 47.4 seconds 75 [Ti] 47.4 seconds 76 O=S(=O)O 47.7 seconds 77 CCC(CC)CO 46.7 seconds 78 [Fe] 46.8 seconds 79 c1ccncc1 46.7 seconds 80 CCOCCOC 52.5 seconds 81 c(ccN)ccN 35.2 seconds 82 [Al] 34.5 seconds 83 CC(C)N 34.2 seconds 84 [Sn] 32.4 seconds 85 c(cCl)Cl 32.4 seconds 86 CSC 31.9 seconds 87 C=CC=C 30.9 seconds 88 c(ccO)ccO 30.3 seconds 89 [Mo] 30.0 seconds 90 s 30.0 seconds 91 C(Cl)Cl 29.3 seconds 92 [Ca] 28.7 seconds 93 CccN 28.3 seconds 94 C(CO)CO 27.3 seconds 95 [W] 26.7 seconds 96 O(P)P 26.8 seconds 97 [Cu] 27.0 seconds 98 CCC(=O)CC 27.5 seconds 99 CCCNCCC 26.8 seconds 100 CO 28.0 seconds 101 [Pb] 19.6 seconds 102 CCOC(=O)C 19.6 seconds 103 [Zn] 18.8 seconds 104 N=N 18.6 seconds 105 [Mg] 18.2 seconds 106 CC(C)CC(C)C 18.0 seconds 107 [Hg] 17.7 seconds 108 C=CCCO 17.2 seconds 109 [V] 17.1 seconds 110 [H] 17.0 seconds 111 cc(c)c(c)c 17.1 seconds 112 c(ccO)cO 16.8 seconds 113 [Zr] 16.5 seconds 114 CC(=O)N 16.5 seconds 115 c(ccCl)cCl 15.9 seconds 116 cc(c)cN 15.8 seconds 117 [Mn] 15.6 seconds 118 COP(=O)OC 15.5 seconds 119 Ccnc 15.4 seconds 120 [Ni] 15.7 seconds 121 CCCCCC=O 12.4 seconds 122 C1CC1 12.1 seconds 123 cOc 11.9 seconds 124 [Co] 11.7 seconds 125 c1cccc1 11.7 seconds 126 [Sb] 11.5 seconds 127 [Pt] 11.5 seconds 128 C(N)N 11.4 seconds 129 O=S 11.0 seconds 130 [Nb] 10.8 seconds 131 C(CCO)CCO 10.8 seconds 132 [U] 10.6 seconds 133 c(cBr)cBr 10.7 seconds 134 CCc 10.6 seconds 135 [Ba] 10.2 seconds 136 CC(C)OC(C)C 10.0 seconds 137 C[Si](C)C 10.1 seconds 138 ccNcc 9.8 seconds 139 [Ru] 9.5 seconds 140 C=CCC=C 9.7 seconds 141 CN(C)(C)C 7.5 seconds 142 SS 7.3 seconds 143 C(F)(F)F 7.2 seconds 144 [Re] 7.0 seconds 145 ccnccn 6.9 seconds 146 CC(C)(C)CCO 6.8 seconds 147 [Cd] 6.7 seconds 148 CCC(C)CC 6.7 seconds 149 CCSCC 6.5 seconds 150 [Ge] 6.5 seconds 151 Cc(c)c(c)c 6.4 seconds 152 [Sr] 6.3 seconds 153 C(COCCO)O 6.2 seconds 154 Ccc(C)cC 6.0 seconds 155 c(c(cCl)Cl)Cl 6.0 seconds 156 OO 6.0 seconds 157 ccnc=O 5.9 seconds 158 C(CCl)Cl 5.8 seconds 159 [Pd] 5.7 seconds 160 c(cO)N 5.8 seconds 161 [Ag] 4.7 seconds 162 C1CCCCCC1 4.7 seconds 163 [Ta] 4.6 seconds 164 [Ce] 4.6 seconds 165 C(CC=O)C=O 4.5 seconds 166 cccCl 4.4 seconds 167 CCC(CC)O 4.3 seconds 168 [Gd] 4.2 seconds 169 [Li] 4.2 seconds 170 cc(c)c(c)O 4.2 seconds 171 C=CCO 4.1 seconds 172 CP 4.1 seconds 173 C#N 4.0 seconds 174 COcc(c)OC 3.9 seconds 175 [Bi] 3.8 seconds 176 c(cN)cS 3.8 seconds 177 [Y] 3.8 seconds 178 CC(C)(C)C=O 3.7 seconds 179 c(ccBr)cBr 3.7 seconds 180 [Yb] 3.8 seconds 181 [In] 3.1 seconds 182 CC(C)CO 3.1 seconds 183 Cc(c)cccO 3.0 seconds 184 COCC(O)O 3.0 seconds 185 [La] 2.9 seconds 186 c(cN)N 2.9 seconds 187 [Eu] 2.9 seconds 188 CCC(CC)CC 2.8 seconds 189 [Ga] 2.8 seconds 190 cCCc 2.8 seconds 191 COC(CO)CO 2.7 seconds 192 [Th] 2.7 seconds 193 C=C(C)C 2.7 seconds 194 c(cF)cF 2.6 seconds 195 C[Si]C 2.6 seconds 196 CncccnC 2.6 seconds 197 cc(c)nc(c)c 2.6 seconds 198 C(CNCCN)N 2.5 seconds 199 [Tl] 2.5 seconds 200 C=C(C)C(C)(C)C 2.5 seconds 201 [Ho] 2.2 seconds 202 [Au] 2.2 seconds 203 CC(C)S 2.2 seconds 204 [Nd] 2.1 seconds 205 cc(cCl)c(c)Cl 2.1 seconds 206 CCC(CC)OC 2.1 seconds 207 c(ccS)ccS 2.1 seconds 208 C(=O)OC=O 2.0 seconds 209 [Be] 2.0 seconds 210 CC(C)c 2.0 seconds 211 [Tb] 2.0 seconds 212 COc 1.9 seconds 213 CCCCO 1.9 seconds 214 c(cBr)Br 1.9 seconds 215 CcccccC 1.8 seconds 216 C(COP)COP 1.8 seconds 217 C(CBr)Br 1.8 seconds 218 [Rh] 1.7 seconds 219 CCCCNC=O 1.7 seconds 220 C(CO)N 1.7 seconds 221 O[Si]O[Si]O 1.5 seconds 222 COcccOC 1.5 seconds 223 CC(C)CC=O 1.4 seconds 224 Cn(c)c(c)c 1.4 seconds 225 [Sm] 1.4 seconds 226 CCC(CO)CO 1.4 seconds 227 C(Cl)(Cl)Cl 1.4 seconds 228 CS 1.3 seconds 229 [Cs] 1.3 seconds 230 cc(c)c(cCl)Cl 1.3 seconds 231 CC1CCC(C)C1 1.3 seconds 232 cc(c)c(c)F 1.2 seconds 233 C(S)S 1.2 seconds 234 cnn 1.2 seconds 235 O=P(O)O 1.1 seconds 236 CCOCOCC 1.1 seconds 237 O=[As]O 1.1 seconds 238 CccccN 1.1 seconds 239 [Er] 1.0 seconds 240 C1CCNCC1 1.0 seconds 241 C(=O)CO 0.7 seconds 242 CC(=C(C)C)C 0.6 seconds 243 [Tm] 0.6 seconds 244 cc1ccncc1 0.6 seconds 245 N=O 0.6 seconds 246 C[Si](C)(C)[Si] 0.6 seconds 247 O=POPOP=O 0.6 seconds 248 CCCOC(C)C 0.6 seconds 249 Ccc(c(C)c)O 0.6 seconds 250 C(CC=O)CC=O 0.6 seconds 251 c(cN=O)cN=O 0.6 seconds 252 c(c(cBr)Br)Br 0.6 seconds 253 C(=CCO)CO 0.6 seconds 254 CCCCCCN 0.5 seconds 255 [Dy] 0.5 seconds 256 c(cCl)OccCl 0.5 seconds 257 CC(C)(F)F 0.5 seconds 258 c(cI)cI 0.5 seconds 259 Cc(c)ccCO 0.5 seconds 260 CCCCC(=O)O 0.5 seconds 261 [Ir] 0.4 seconds 262 c(ccCl)cN 0.4 seconds 263 cc(c)cccs 0.4 seconds 264 cc(c)oc(c)c 0.4 seconds 265 cc(c)c(c)Cl 0.4 seconds 266 COCCC(O)O 0.4 seconds 267 [Pu] 0.4 seconds 268 CCCCOC=O 0.4 seconds 269 cc(c)c(c)S 0.3 seconds 270 c(cn)cncN 0.3 seconds 271 C=CCCC=C 0.3 seconds 272 [Rb] 0.3 seconds 273 CNCCCNC 0.3 seconds 274 C1CC2CCC1C2 0.3 seconds 275 C(=O)ccC(=O)O 0.3 seconds 276 CCCSCCC 0.3 seconds 277 C(C=O)C=O 0.3 seconds 278 CCcccCC 0.3 seconds 279 p 0.3 seconds 280 COc(c)cc=O 0.3 seconds 281 cc1cc(c)sc1 0.2 seconds 282 [Pr] 0.2 seconds 283 CNCO 0.2 seconds 284 CCCC(C)C 0.2 seconds 285 CC[Sn](CC)CC 0.2 seconds 286 C(=O)NC=O 0.2 seconds 287 CC(cc(c)c)O 0.2 seconds 288 c(ccSN)cN 0.2 seconds 289 Ccccc(c)c 0.2 seconds 290 C(C=O)CO 0.2 seconds 291 CCP(CC)CC 0.2 seconds 292 [Hf] 0.2 seconds 293 CN(C)CccO 0.1 seconds 294 O([Si])[Si]O[Si]O[Si] 0.1 seconds 295 CCCCCOC 0.1 seconds 296 CCcccOC 0.1 seconds 297 C1CCC1 0.1 seconds 298 CC(C(C)(C)C)O 0.1 seconds 299 cc1cncc1c 0.1 seconds 300 c1cccccc1 0.1 seconds 301 CCCNcc=O 0.1 seconds 302 CSccc(c)c 0.1 seconds 303 [Os] 0.1 seconds 304 CC(C[Se]C)N 0.1 seconds 305 CCC(CC)Cl 0.1 seconds 306 CCC=CCC 0.1 seconds 307 C(=O)C=O 0.1 seconds 308 CCccCC 0.1 seconds 309 c(cc=O)c=O 0.1 seconds 310 CC(c)c 0.1 seconds 311 CCCOC1CC1 0.1 seconds 312 O=Cl(=O)O 0.1 seconds 313 c(cN(=O)O)cO 0.1 seconds 314 C(CCl)O 0.1 seconds 315 C(C(CO)(O)O)O 0.1 seconds 316 CCC(CC)Br 0.1 seconds 317 cc(cBr)cBr 0.1 seconds 318 C(ccccO)O 0.1 seconds 319 CCCCCCO 0.1 seconds 320 CCCN(C)P 0.1 seconds 321 COCcc(c)c 0.0 seconds 322 C[Si](C)([Si])[Si] 0.0 seconds 323 Cc(c)cc(c)S 0.0 seconds 324 C=C(C)CC(C)c 0.0 seconds 325 CBr 0.0 seconds 326 C(CC(Cl)Cl)CCl 0.0 seconds 327 C(S(=O)=O)S(=O)=O 0.0 seconds 328 C1=CCCCC1 0.0 seconds 329 C1CNCCN1 0.0 seconds 330 C=Cc(c)c(c)c 0.0 seconds 331 CC(c)N(C)Cc 0.0 seconds 332 CCC(=O)OCC 0.0 seconds 333 CcncccF 0.0 seconds 334 O(P)POPOP 0.0 seconds 335 c(c(cF)F)N 0.0 seconds 336 cc(c)cccI 0.0 seconds 337 C(C=O)COC=O 0.0 seconds 338 CCC(CC)N 0.0 seconds 339 CCCC(=O)S 0.0 seconds 340 CC[Pb](CC)CC 0.0 seconds 341 COP(OC)OC 0.0 seconds 342 C[Si](C)(C)C[Si] 0.0 seconds 343 Ccc(C=O)cC 0.0 seconds 344 cc1ccocc1 0.0 seconds 345 ccscc(c)c 0.0 seconds 346 CC(C)(C)COC 0.0 seconds 347 cc(c)cccN 0.0 seconds 348 cc(c)cccO 0.0 seconds 349 C(CO)COC=O 0.0 seconds 350 COc(c)cOc 0.0 seconds 351 [as] 0.0 seconds 352 c(ccCl)ccCl 0.0 secondsThis says that 353 bits are all that's needed in order to put 202,632 unique fingerprints into bins with at most about 10 unique fingerprints in a given bin.
I think that's quite amazing given that most people use fingerprints which are at least twice and usually at least four times that length.
Also, it took 1.75 hours to generate this, and most of the time is spent doing len(set1 & set).
Is it useful?
I really wish I had a good query set to use for tuning - and another dataset to use for validation. I don't, but Greg Landrum made a synthetic one for testing the RDKit/PostgreSQL cartridge. Last December we worked on optimizing the fingerprint filters. I used an earlier version of this program to generate 54 query bits, which he fused with his hash-based fingerprint. Greg reported:
Tacking your most recent 54 bits, interpreted as SMILES, onto the end of a shorter version of my new fingerprint gives: [06:10:40] INFO: FINISHED 50001 (41150823 total, 1325699 searched, 1118055 found) in 80.30 [06:10:40] INFO: screenout: 0.03, accuracy: 0.84 The original performance for the shorter fingerprint was: [05:04:23] INFO: FINISHED 50001 (41150823 total, 1693315 searched, 1118055 found) in 90.90 [05:04:23] INFO: screenout: 0.04, accuracy: 0.66 so those bits definitely help with the screenout. Putting the bits at the beginning of the fingerprint gets a bit more speed: [06:14:48] INFO: FINISHED 50001 (41150823 total, 1325699 searched, 1118055 found) in 71.09 [06:14:48] INFO: screenout: 0.03, accuracy: 0.84I think a 20% speedup is nothing to sneeze at.
I would still like an inverted index library
The greedy algorithm does a lot of set intersections, partitions sets into subsets, and removes old sets. I feel like this is not what databases are designed for, That's why I'm looking for an inverted index library which lets me manage boolean set operations myself. Then again I'm not a database person and I don't know if my belief is true.
I ran my greedy feature selection algorithm until it ran out of data, at only 353 bits. I want to have 1024 substructure bits. This means I need to feed my algorithm a lot more data. That's fine - my test set is only 1/200th of the data I have. However, my implementation running the test set takes up 2 GB of memory on my 12 GB desktop machine, and took 6300 seconds (1h45m) to run.
My concern is that a distributed memory implementation of my algorithm is more complex to write than a shared memory one. The Amazon high-memory instances have 68 GB of RAM. I estimate I need 400 GB with Python sets, and under 40 GB with a more compact bitset implemented in C. This seems eminently doable, which is why I'm looking to see if someone has already done the work.
Based on my searching, and the lack of pointers by others, the answer is "no."
In search of an inverted index library #
I'm looking for a Python library for managing inverted indices. I've been using Python sets, which are nice and fast, but they take up too much memory. Lists use 1/10th the space but are a lot slower and still not compact enough. The most likely solution so far is Xapian. Are there other options I should consider? If so, email me or comment on the Reddit thread I started on this topic.
Let explain what I want and what I've done so far.
Sparse feature fingerprints
I'm working with chemical fingerprints. That's not unusual for me. What's new is that these fingerprints are not fixed length but are of indefinite length. For each molecule I find all fragments with no more than 7 atoms, and put the fragment into a canonical representation, here a canonical fragment SMILES. Each unique representation gets its own unique integer identifier. The end result is a fingerprint which looks like:
0:4,1:3,5:2,18:2,28:2,124,47298,53119 254846The original SMILES for this is CCO.CCO.O=[V].Cl. I picked this structure because it has a very small fingerprint.
This fingerprint line says that PubChem record 254846 has 8 substructures ("features"). The features are stored along with a count of the number of times it's found. Here, features '5', '18', and '28' are found in two difference places in the molecule, while features '124', '47298', and '53119' are found only once. Also, the feature field and the identifier field are separated by a tab.
I kept track of the original canonical description. Here's the breakdown:
canonical
bit# description
---- -----------
0 C
1 O
5 CC
18 CO
28 CCO
124 Cl
47298 O=[V]
53119 [V]
With a bit of back and forth you can verify that it's correct.
Overall I have about 25 million records and 2 million canonical values, using a subset of PubChem's 35 million compounds. The average fingerprint contains about 300-400 features, though value ranges from 1 to about 2,000 features. These are "sparse" fingerprints because because the number of features per record divided by the number of features total is very small (350/2,000,000) - small enough that it's more efficient to store the list of feature bits (0, 1, 5, etc.) than it is to store the bit pattern of 0s and 1s.
The features are very long-tail. Carbon ("C") exists in nearly every record, in both aromatic and non-aromatic forms. "O", "cc", "ccc", and "CC" are also very popular. But about 1/2 of the features only exist in one record. Here's more on the fragment distribution pattern.
Why do I want an inverted index?
Feature fingerprints are old news in cheminformatics. One application is to improve substructure search performance. Suppose you are searching for a phenol (structure "c1ccccc1O"). You could do a substructure isomorphism test with every single record, but that's slow. Instead, there are some obvious improvements.
You know that if a record matches then it must have at least 7 atoms and 7 bonds, at least 6 carbons, and at least 1 oxygen. It's easy to precompute this information, so the set labeled "at least 6 carbons" contains the set of all records with at least 6 carbons, and so on. When a query comes in, analyze the structure based on those predetermined features. This ends up with a list of sets, one for each feature. Find the intersection of those sets, and you've reduced your search space, potentially by a lot.
This is exactly when one should use an inverted index.
(There are other optimizations. If the query exists already as one of the precomputed sets then there's no need to any additional substructure search.)
Feature "folding"
Most cheminformatics systems "fold" their sparse bits into a fixed bit length, like 1024 bits. For example, if a feature is assigned the canonical id of "458793484" then some programs will apply modulo 1024 to that to get the folded bit id "524". This saves a lot of space, at the cost of extra subgraph isomorphism search time.
I've been wondering, do I really need to fold the bits? Why can't I keep all of the information around? My thoughts are informed by Managing Gigabytes, which points out that there are efficient ways to encode the inverted index. Can I apply those techniques to this problem?
Working code
The first step is to establish that there is a problem. I'll use a subset of 10 files, which represents about 0.5% of my actual data set, and write some code to handle it. Here's the final code and here's the code with the benchmark data set (37 MB, compressed with 7zip).
The simplest inverted index implementation uses Python sets.
from collections import defaultdict
class InvertedIndex(object):
def __init__(self):
# Map from a feature string like "ccO" to an integer feature id
self.feature_to_id = {}
# Map from an integer feature id to a set of record ids
self.inverted_indices = defaultdict(set)
def get_feature_id(self, feature):
try:
return self.feature_to_id[feature]
except KeyError:
n = len(self.feature_to_id)
self.feature_to_id[feature] = n
return n
def add_record(self, id, feature_ids):
for feature_id in feature_ids:
self.inverted_indices[feature_id].add(id)
def search(self, features):
# These are *features*, not *feature_ids*.
# If the feature wasn't seen before then there are no entries
# and this will raise a KeyError, meaning nothing found.
try:
terms = [self.inverted_indices[self.feature_to_id[feature]] for feature in features]
except KeyError:
return set()
terms.sort(key=len)
return set.intersection(*terms)
Very simple, and it works pretty well. There's a bit of API trickiness
to get the feature id for a given feature. This simplifies the loader
and handles integer interning. (Interning is outside of the discussion
of this essay.)
Second is the code parse the data files and load the index. The file format looks like:
#FPC1 #bit-0=13120 C #bit-1=12986 c #bit-2=12978 cc ... #bit-46924=1 COCNCP=O 0:6,1:21,2:21,3:22,4:23,5:2,6,...,36858,36869,37247,37429:4,37431,39942 26350001 0:6,1:21,2:21,3:22,4:23,5:2,6,...,36858,36869,37247,37429:4,37431,39942 26350002 ... 0:5,1:12,2:12,3:12,4:12,5:3,6:3,...,25998,32031,32086:2,33182,33998,38357 26374982There's a version line, some header lines starting "#bit-", and the fingerprint lines. The fingerprint line uses the sparse fingerprint I mentioned earlier, followed by a tab, followed by the record identifier. (Here it's a PubChem id, which is always an integer.)
Note: the bit definitions are not the same across different files. The header lines show that bit 0 corresponds to the feature "C" for this file (and it exists in 13120 records), and bit 1 is "c" (in 12986 records), but in another file I see that bit 0 is the feature "cc" and in a third it's "ccc(c)CCN". So the parser needs to be smart and map the per-file bit numbers to its own internal fragment identifier.
The details of the parser aren't important to this essay, and left to the implementation.
Benchmark
I need a benchmark. I'll load 10 data files (links to the 37 MB benchmark file) to make the inverted indices. I also need some query data. I reused the parser to extract a subset (1%) of the records in a fingerprint file. Actually, two fingerprint files. One timing set reuses one of the data files used to build the invertex indices, which means that everything query will match, while the other uses a new data file, which should have many fewer matches.
Again, see the code for full details.
When I run the benchmark I get:
== Sets == Load 0 Compound_000000001_000025000.counts.fpc Load 1 Compound_000025001_000050000.counts.fpc Load 2 Compound_000050001_000075000.counts.fpc Load 3 Compound_000075001_000100000.counts.fpc Load 4 Compound_000100001_000125000.counts.fpc Load 5 Compound_000125001_000150000.counts.fpc Load 6 Compound_000150001_000175000.counts.fpc Load 7 Compound_000175001_000200000.counts.fpc Load 8 Compound_000200001_000225000.counts.fpc Load 9 Compound_000225001_000250000.counts.fpc Loading took 63.0 seconds and 2092896256 memory 100 66575 315.0 per second 200 100909 337.5 per second all exist: 227 searches took 0.7 seconds (counts 114990) 100 2323 1161.5 per second 200 3865 1360.9 per second some exist: 230 searches took 0.2 seconds (counts 4048)Some of this is progress information. The important parts are the load time (63.0) seconds, the total amount of memory used in the index (2.1 GB), and the search times of 0.7 and 0.2 seconds for the two test sets. (I use 'psutil' to get the memory usage; note that the exact meaning is OS dependent.) Also, the match counts of "114990" and "4048" are useful checks that each inverted index implementation is correct.
This output shows the problem. I'm using 2.1 GB of memory for 10 data files. I only have 12 GB on this desktop machine, and remember, the full data set is 200 times bigger. Do you have a 500 GB machine I can use?
Use an array instead of a set
Python objects have overhead. For example, the Python integer 300 on my desktop requires 24 bytes, while I know I have about 2**25 records so could easily get away with 4 bytes.
>>> import sys >>> sys.getsizeof(3) 24The other 20 bytes handle things like a reference to the Python integer class object and the reference count.
The Python set data type uses a lot of space because it's implemented as a hash, and many of the internal slots are empty in order to improve lookup performance. Quick testing shows that a set uses about 30-40 bytes for each element in the set.
It's not quite as bad as what this because the set elements contain Python references, which are 8 byte pointers on my 64-bit Python. But that's still about 42 bytes of space used to store at most 4 bytes of data. I should be able to do better.
The usual way "to do better" in Python, when storing a list of C-like integers is to use an 'array.array.' This is similar to a Python list, except that the contents are all of the same type, so there's no per-item overhead to store the type information. (Note: pje described this approach on the Reddit thread.
The downside is that there's no equivalent to a set.intersection. For now, I'll convert an array into a set, and do the normal set intersection. Here's the code:
# Make sure I can handle 4 byte integers
import array
_test = array.array("I", (0,))
try:
_test[0] = 2**31+100
array_code = "I"
except OverflowError:
array_code = "L"
def make_unsigned_array():
return array.array(array_code, ())
class InvertedIndexArray(object):
def __init__(self):
# Map from a feature string like "ccO" to an integer feature id
self.feature_to_id = {}
# Map from an integer feature id to an set of record ids
self.inverted_indices = defaultdict(make_unsigned_array)
def get_feature_id(self, feature):
try:
return self.feature_to_id[feature]
except KeyError:
n = len(self.feature_to_id)
self.feature_to_id[feature] = n
return n
def add_record(self, id, feature_ids):
for feature_id in feature_ids:
self.inverted_indices[feature_id].append(id)
def search(self, features):
assert features
try:
terms = [self.inverted_indices[self.feature_to_id[feature]] for feature in features]
except KeyError:
return set()
terms.sort(key=len)
# Set conversion is expensive; it's faster to pass an iteratable to intersection_update.
result = set(terms[0])
for term in terms[1:]:
result.intersection_update(term)
# Short-circuit the evalution when the result set is empty.
# Gives a 3x speedup for the "some exists" benchmark, with no
# performance hit for the "all exist" benchmark.
if not result:
break
return result
How well does it work? Here's the benchmark output using the array-based inverted index implementation, trimmed somewhat to make it easier to read:
== Sets == Loading took 63.0 seconds and 2092896256 memory all exist: 227 searches took 0.7 seconds (counts 114990) some exist: 230 searches took 0.2 seconds (counts 4048) == Arrays == Loading took 51.2 seconds and 165552128 memory all exist: 227 searches took 41.7 seconds (counts 114990) some exist: 230 searches took 16.1 seconds (counts 4048)The new version takes less 10% of the memory as the old, which is about what I predicted. It's nice to see the confirmation. It's also a bit faster to build the data structure. But on the other hand, search is over 60x slower. At 0.1 seconds per query, this performance is not acceptable the types of highly interactive searches I want to do.
There are of course hybrid solutions. This version spends a lot of time to convert an array to a set, which is then discarded. I convert the 'C', 'c' and a few other inverted indices a lot, so what I could do is use an LRU cache for the top 1,000 or so sets.
But let's get back to my problem. My data set is 200x bigger than the test set, so simple scaling of 166MB says I'll need 33 GB of RAM in the best of cases. I really would like this to work on my desktop, instead of renting some Amazon compute node every time I want to try things out.
Stop the presses! pypy 1.9 just came out
The pypy-1.9 release notice says:
This is good because in previous benchmarking I found that pypy sets were rather slower than cpython's own sets.
- Sets now have strategies just like dictionaries. This means for example that a set containing only ints will be more compact (and faster).
I re-ran the benchmark using pypy 1.7 and 1.9. Here are the results, first for 1.7:
== Sets == Loading took 278.6s seconds and (psutil not installed) memory all exist: 227 searches took 1.0 seconds (counts 114990) some exist: 230 searches took 0.2 seconds (counts 4048) == Arrays == Loading took 32.7s seconds and (psutil not installed) memory all exist: 227 searches took 299.6 seconds (counts 114990) some exist: 230 searches took 335.6 seconds (counts 4048)and then for 1.9:
== Sets == Loading took 47.9s seconds and (psutil not installed) memory all exist: 227 searches took 0.5 seconds (counts 114990) some exist: 230 searches took 0.1 seconds (counts 4048) == Arrays == Loading took 35.4s seconds and (psutil not installed) memory all exist: 227 searches took 186.2 seconds (counts 114990) some exist: 230 searches took 72.9 seconds (counts 4048)The good news is that 1.9 set operations are faster — 5x faster to load, and 2x faster to search. In fact, the search time is faster than cpython's! The bad news is it looks like pypy's intersection_update isn't yet as well optimized. Also, manual inspection shows that pypy 1.9 uses 2.1 GB of memory, which is about the same as cpython's. pypy on its own isn't the right solution.
Is there an inverted index library for Python?
Inverted indices are used in all sort of search engines, and there are many well-known ways to improve intersection performance and decrease memory use. Some of the ways I know of storing bit sets are:
- FastBit: An Efficient Compressed Bitmap Index Technology
- sparse and dense hash sets from the Google sparsehash library
- Judy1 arrays from Judy
Another possibility is to store the 'on' bits as a sorted list. Sorted lists are nice because the intesection corresponds to a merge sort, in strictly linear time. Other methods are described in Experimental Analysis of a Fast Intersection Algorithm for Sorted Sequences. There was also a recent posting on how to take advantage of SSE instructions for fast sorted list intersection. Since the fields are sorted, it's also possible to compress the data by, for example, storing deltas using something like Elias gamma coding or perhaps an encoder tuned to the data set.
Here too I've not been able to find an existing library which handled this for me, much less one with Python bindings.
Search engine components
Actually, I'm wrong. These are available, as part of a larger search engine package like Lucene and Xapian. Lucene is very well known, but I would have to mix the Java runtime and the chemistry code that I'm using from C++. Doable, but it feels like too much weight.
Instead, I downloaded, compiled, and installed Xapian and its Python bindings and wrote an inverted index based on its API. (Justinsaccount was the first to suggest this in the Reddit thread.)
Xapian uses the filesystem to store the database, so this version uses a lot less local memory than the previous two inverted indices. I wonder how much of the data set is stored in memory vs. how much is read from disk. I imagine there are tuning variables, and it looks like there's an in-memory option I could also do. I am quite the newbie with this library.
Here's my implementation:
try:
import xapian
except ImportError:
xapian = None
class InvertedIndexXapian(object):
def __init__(self, dbname):
self.db = xapian.WritableDatabase(dbname, xapian.DB_CREATE_OR_OPEN)
def get_feature_id(self, feature):
return feature
def add_record(self, id, feature_ids):
try:
doc = self.db.get_document(id)
except xapian.DocNotFoundError:
doc = xapian.Document()
for feature_id in feature_ids:
doc.add_boolean_term(feature_id)
self.db.replace_document(id, doc)
def search(self, features):
enquire = xapian.Enquire(self.db)
query = xapian.Query(xapian.Query.OP_AND, features)
enquire.set_query(query)
matches = enquire.get_mset(0, 200000)
return matches
and when I run the benchmark I get this timing information:
Loading took 349.9 seconds and 46465024 memory all exist: 227 searches took 15.7 seconds (counts 114990) some exist: 230 searches took 20.7 seconds (counts 4048)Queries are about 3-12x slower than the cpython implementation, and 5-20x slower than pypy. But on the other hand, this uses only 90 MB of RAM, although it does build a 340 MB data file. Simple scaling suggests that I'll need 65 GB of disk space for the entire data set. That's quite doable.
The Xapian load time is a lot slower, but as the data is persistent, reload time from a built database is trivially fast. The documentation says I can set XAPIAN_FLUSH_THRESHOLD to a larger value for better indexing performance. Are there other tuning variables I should consider?
Xapian might be the way to go. Value ranges might be useful when I think about how to handle fragment counts. I'll need to do more evaluation. For example, I don't like that it's an order of magnitude slower than Python sets. After all, Python sets aren't optimized for this task, so I expected faster search times.
Sadly, there isn't much documentation about Xapian - and nowhere near like what there is for Lucene. (For example, there appears to be no page which describes the tuning environment variables. There's XAPIAN_MAX_CHANGESETS and XAPIAN_FLUSH_THRESHOLD .. are these or something else relevant for what I want?)
I still want a good inverted index library for Python
Xapian might work well for this project, but there are other projects I work on which use inverted indicies. In one project I try to optimize the ~1024 fixed patterns used in a traditional feature-based substructure screen. Both my greedy algorithm and my genetic algorithm for this used a lot of temporary intersections to evaluate a given selection. I'm not sure how well it would work on top of a persistent database.
If you know of a good library, please email me or comment on the Reddit thread I started on the topic.
Testing hard algorithms #
I developed software to find a maximum common subgraph (MCS) given a set of molecules represented as a chemical graph. It's called fmcs. My previous three essays were about the background of the MCS problem, introducing fmcs, and an example of when MCS is used.
What I didn't describe was the mental effort it took to develop this program. This is the second time I've written code to find the multiple structure MCS, and both times it took a couple of months and put my head in a very strange place. You would think the second time is easier, but it means that I spent more time adding features and doing things that my first version couldn't begin to handle. (Did someone just mutter "second system effect"? Pshaw!)
This time too I had a better understanding of the development process. I think I know why it's so much harder than most of the software I develop.
In this essay, I reflect on some of the reasons why this is a hard problem to test and I consider how unit tests, or any other incremental-based testing approach, are not well-suited to a certain class of complex algorithm development. While unit tests can provide a basic sanity check during development, they fail to provide the test coverage which one might expect from applying them to algorithmically simpler modules. Further, development methodologies built primarily on unit tests, like test-first style test-driven development, aren't that applicable. Instead, other methods, like system testing and a deep understanding of the algorithm and possible problems, are required. This strengthens my view, explored in Problems with TDD that "TDD is a weak method for developing those tests of confidence."
How would you implement an MCS algorithm?
Think about the problem for a moment. How would you write an algorithm to find the largest common subgraph between two graphs?
Take your time. This really is a hard problem. I dug up some of the early papers from the Journal of Computer Documentation. People ended up simplifying the problem by, for example, not supporting rings.
Finished thinking?
You probably came up with a graph walking algorithm which tries different match pairings and uses backtracking or lazy evaluation to search all of graph space. The more mathemtically inlined might have converted this into a maximum clique algorithm.
In both cases there are a lot of arbitrary decisions to make. Which pairings should you investigate first? Are there times when you cann prune the search space? Did you make any assumptions about the nature of the chemical graph (eg, that it's topologically a planar graph)?
How would you test your MCS algorithm?
Back in the late 1990s, I almost never wrote automated tests, and never wrote an extensive test suite. Nowadays I'm pretty thorough, and use coverage-guided techniques to get good, extensive tests through some semi-permanent API layer that will allow me to refactor the internals without changing the tests.
I couldn't do that here.
There are a lot of heuristics, and some of them are only triggered under unusual circumstances, after a bunch of combinitorial possibilities. Outside of a few minor components, I couldn't figure out how to write unit tests for the code.
I ended up pushing most of the testing into validation testing of the complete system, which meant I wrote some 1,000+ lines of code without strong testing. Moreover, I used a new approach to the MCS problem, so the algorithm I was working on doesn't have the track record of the standard clique-based or backtracking approaches.
So stress factors included not knowing if the algorithm would work, and not being able to develop enough test cases to provide good validation during development.
As a rule of thumb, it's easiest to fix bugs which are caught early. Unit tests and evolutionary prototyping are two of the techniques that people use to tighten the feedback loop between specification, implementation, and valiation. I think another stress factor is propotional to the size of the feedback loop.
What testing did I do?
I mean, I did have tests during development. I came up with a few examples by hand, I did a substructure search of a large data set and I verified that the MCS code found that substructure, but I know that's not enough tests. I know this because after six weeks of development and over 1000 lines of code, I spent another three weeks doing a large amount of post-development testing, and found several bugs. In the process of writing this essay I also found that four days of that development work ended up making things slower, so I'll have to remove it.
I did most of my tests based on the ChEMBL data: 10,000 random pairs of structures, the k=2, k=10, and k=100 nearest neighbors, and the k<=100 neighbors with similarities of at least 0.95, 0.9, and 0.8. I also did, at the end, tests based on the ChEBI structure ontology. There were easily 20,000 different machine-generated test cases, although in most cases I didn't know the expected results beforehand.
What bugs did I find?
What bugs did I find? I think it's educational to characterize a few of them.
Typo caught by an assertion check
One of the simplest bugs was a poorly formatted string. I used "%d" when I should have used ""%%%d". A bit of jargon for those in the know; I generate SMARTS strings for the intermedate subgraphs. If there are more than 9 open rings then the syntax goes from a single digit closure number to a closure number like "%10". I forgot to include the '%' for that case, and probably because the '%' was already there for the format string.
This wasn't triggered by my random-pairs test nor my various similarity-search based tests. Only when I ran through the ChEBI data, did I get an assertion failure when RDKit refused to accept my SMARTS string. That was the first time where I had a SMARTS with 10 unclosed rings.
As it happens, this error could have been caught by a unit testing, as some people practice unit tests. It's a four line function which takes a string and a number. Testing it is trivial. I didn't test it because I feel that testing directly against internal functions inhibits refactoring. I prefer to test against higher-level, more stable APIs.
My view is that it's usually easy to come up with high-level test cases which trigger a certain function call. But not in this case. The MCS search algorithm, while deterministic, uses so many arbitrarily defined values that I couldn't on my own come up with a test case with at least 10 open ring closures. And even if I did, a new search heuristic might change things so that only, say, 7 open ring closures were needed.
I felt that my system testing and the assertion check would be enough to identify if there was a problem, and it did. A low-level unit test might have helped, especially as I still don't have a specific test for that failure.
I think the right thing to do is add that failure as a specific test case, and use monkey-patching to insert wrong code for a repeat of that test case. The first one tests that the code is correct, and the second tests that the test case is still exercising the code under test.
Cross-validation testing
The first MCS papers are about as old as I am. Many people have written implementions, although relatively few are are available to me both at no cost and with no prohibitions on using it to develop a new MCS algorithm. (Some commercial companies don't like people using their software to write new software which is competitive to it, or even to use their software for benchmark comparisons.)
I tested pairs of structures against SMSD and I did more extenstive tests against Indigo's MCS implementation.
This is cross-validation testing. It's a relatively rare technique because the cost of producing multiple identical implementations usually isn't worth the benefits. Even here the results aren't exactly identical because of differences in how the toolkits perceive chemistry, and more specifically, aromaticity. I ended up spending a lot of investigation time staring at cases with different answers and trying to figure out if it was a chemistry problem or an MCS algorithm implementation problem.
I found the SMSD had a bug in one of its options, which I reported back to the author. The code had been fixed internally but not pushed to the outside world. Its default mode and fmcs matched quite well, except for a couple of chemistry differences. The new version is out now - I need to test it again.
The only problem I found in the Indigo code was a part of their setup code which didn't check the timeout. That's also been fixed after I reported it.
The cross-validation with Indigo found problems in my code. For example, I was often getting smaller MCSes then Indigo. After looking at them, I figured out that my code didn't correctly handle the case when a molecule was fragmented after a bond was removed because its bondtype wasn't in all of the structures.
Why didn't my hand-written test cases find it? None of them had a case where there was a bond in the "middle" of a structure chosen as the reference structure, and where the MCS was not in the first fragment.
My code usually got the right answer when using highly similar structures, for obvious reasons. It was only the random pair testing where the problem really stood out.
Could I create a simple unit test for this error? Perhaps, but it's not easy. I don't know which of the two fragments will be first - it depends on so many arbitrary decisions which could change as the algorithm is improved. The only test I can think of for this was to generate a diverse set of tests, make sure some fail if the code isn't implemented correctly, record the results, and make sure that future tests never find a worse (or better?) test case.
Bad heuristic to determine the maximum possible subgraph growth
Most of the MCS implementation is heuristics. There's a branch and bound search, there's subgraph canonicalization to reduce the number of substructure tests, and so on. Each of these is supposed to help make the code faster.
One of the tests takes the current subgraph and the list of "outgoing" bonds to see how much of the remainder of the graph is accessible for growth from the current subgraph. The rest of the molecule might not be accessible because it's on another fragment, but it also might not be accessible due to an earlier decision to exclude certain parts of the molecule from future growth. (My algorithm tests all subgraphs which include a given bond, then all subgraph which don't include the given bond.)
It took a couple of days to think of the algorithm, write the code, and get it working. I then did the timing tests to find out it was 1% faster, to get the same answers.
Does that mean my code worked? I only had a suspicion that the new algorithm should be faster. Perhaps the overhead of searching for the accessible atoms was too costly?
After looking at my implementation - a lot - I finally realized that I told the algorithm to exclude the subgraph from consideration for growth but had forgotten to also exclude the set of previously excluded bonds from consideration. With two changed lines, the overall performance doubled for the random-pairs case. Most of the time it's faster and sometimes its slower, so it takes an aggregate of tests to measure this correctly.
I don't think this heuristic could have been written as a unit test. No, I take that back. It could have, but it would have required some careful thought to set up. Not only was this part of the code not "built for test", but setting up the right conditions requires a mental understanding of the problem which I know I didn't have when I was doing the development.
BTW, I am not saying that unit tests can't be used to measure performance. Some algorithms have simple timing characteristics where it's easy to say that the code must complete by a given time, or that it must be at least twice as fast as a reference implementation. Occasionally these tests will hiccup due to unusual loads on the test machine, but not usually enough to be a problem.
Indeed, the fmcs code has some unit tests for the timeout code. I found a pair of structures which takes over 10 seconds to find the MCS, I set the timeout to 0.1 second, and assert that no more than 0.5 seconds elapsed during the function call. (I don't require a high precision for this code.) I do worry though that future changes to the MCS search code might speed things up by a faster of 20. On the other hand, the effect of broken timeout code is very obvious when running the validation suite, so I'm not going to worry about it.
As is widely acknowledged, this stretches the idea of a "unit test." Unit tests are supposed to be fast; preferably several hundred or more per second. The goal is that these tests run often, perhaps even after every save. Stick in a few 0.1 second timeouts into the system and it bogs everything down, which discourages people from running the tests so often.
But let's go back to this new code. On average, over a large number of tests, the performance is 50% faster. What's a good test case? Can I pick one structure or a small set of structures? Does the speedup occur only when there are large molecules? Only for molecules with lots of internal symmetry? Only for those which are easily fragmented?
Only now, when writing this essay, did I find good test cases. When I use a 10 second timeout using the k=10 nearest neighbors tests, I get 8 timeouts in the first 32 records using the old algorithm, and only 1 timeout using the new algorithm. The time for the very first test goes from over a minute (I finally gave up) using the old algorithm to 0.08 seconds using the new one.
Obviously (in retrospect) the right solution is to pull out a couple of those from my validation suite and turn them into test cases.
I never would have come up with these test cases before implementing the new code - up until ten minutes ago I thought that the new code only added a factor of two the code, not possible factors of 1,000!
Canonicalization: an in-depth analysis
I'm not finished with testing. I haven't fully characterized all the parts of the implementation. For that matter, there are heuristics I can still add. Here I'll describe what I did to analyze subgraph SMARTS canonicalization. I conclude by finding that while it works, it slows down the code and several days of development work ought to be removed.
My MCS algorithm enumerates subgraph of one structure, converts that into a SMARTS string, and tests if the pattern exists in other structures.
There are many alternate ways to make a SMARTS string from a subgraph. Given the linear chain of C O N, there's the reasonable CON or NOC variations, the more confusing O(C)N and O(N)C, and crazy variations like C1.N1O and O%987.N7.C%98. It's easy to make the non-insane versions be generating a spanning tree. The question is where to start the search and how to handle branches.
I believe that there are many duplicate patterns in a query. For example, a benzene ring will generate many "ccc" queries. I can minimize the number of substructure matches by caching earlier SMARTS match results. But caching doesn't work if sometimes I get a CON and sometimes I get a NOC. The solution is to generate a "canonical" SMARTS for the subgraph - a unique representation for all of the variations.
I implemented the CANGEN algorithm from SMILES. 2. Algorithm for generation of unique SMILES notation by Weininger, Weininger, and Weininger J. Chem. Inf. Comput. Sci., 1989, 29 (2), pp 97-101 in order to choose a canonical SMARTS. (But see Assigning Unique Keys to Chemical Compounds for Data Integration: Some Interesting Counter Examples, Neglur, Grossman, and Liu, 2nd International Workshop on Data Integration in the Life Sciences (DILS 2005), La Jolla.)
Canonicalization is notoriously hard to get right. It too has a lot of tricky corner cases. For example, in the late 1990s Daylight tracked down a reported bug in their canonicalization implementation. They algorithm requires a stable sort algorithm, but they used an unstable sort. Their own internal testing never found the problem - it was a customer who found the Solaris and IRIX boxes occasionaly returned different values.
There could be similar bugs in my canonicalization algorithm. I don't know. The usual solution is to generate large numbers of test cases, randomize the order of the atoms and bonds, and verify that all of them are the same. Here too it may be that only certain rare topologies, or rare bond patterns, triggers an error. I'm not convinced that I could come up with the right set of test cases for it.
What makes it harder is that the underlying search algorithm doesn't need a canonical SMARTS, so a canonicalization failure doesn't ever show up in the output. I could have an infrequent bug and not notice it! Canonicalization is only there for performance reasons, but my implementation is in Python, while the substructure matching is in C++, so it might even be that the canonicalization time might be more than the time saved.
I can make that a bit more nuanced. Canonical SMARTS generation has three phases: 1) initial ranking, 2) canonicalization/tie-breaking, and 3) SMARTS generation. I can disable the canonicalization step and get a "semi-canonical" SMARTS (my initial ranking is more like a circular fingerprint of diameter 3 than the atom characteristic from Weininger et al.). I can also disable caching. Both of these are doable with only a couple of changes to the code.
This leads to three cases: canonical SMARTS with caching, semi-canonical SMARTS with caching, and semi-canonical SMARTS without caching. (There should be a fourth step, which is arbitrary SMARTS, but that requires more extensive changes.)
What I want to understand most is the effect of canonicalization on my code. Again, I have to resort to aggregate timings and statistics across a set of benchmarks. I used the ChEMBL-13 data set and generated various benchmarks. One is based on computing the MCS between 10,000 pairs selected at random, another contains the k=2, k=10, and k=100 nearest neighbors of randomly chosen fingerprints (timing 500 data sets each time), and the last is the total of 500 tests of the k<=100 compounds with Tanimoto score of at least 0.95 to randomly selected fingerprints. My results (times are reproducible to within a few percent) include the number of unique SMARTS (canonical or non-canonical) generated and the total number of substructure tests ("SS") which were carried out.
| Random pair timings | # unique SMARTS | # SS tests | |
|---|---|---|---|
| no cache | Total: 10000/467.3s (21.4/s) Complete: 9997/437.3s (22.9/s) Incomplete: 3/30.0s | 921666 | 1339292 |
| semi-canonical | Total: 10000/421.3s (23.7/s) Complete: 9997/391.3s (25.6/s) Incomplete: 3/30.0s | 921666 | 921666 |
| canonical | Total: 10000/442.1s (22.6/s) Complete: 9997/412.0s (24.3/s) Incomplete: 3/30.0s | 709636 | 709636 |
| k=2 nearest neigbhors | # unique SMARTS | # SS tests | |
| no cache | Total: 500/287.3s (1.7/s) Complete: 484/127.1s (3.8/s) Incomplete: 16/160.2s | 264057 | 320238 |
| semi-canonical | Total: 500/276.7s (1.8/s) Complete: 484/116.6s (4.2/s) Incomplete: 16/160.2s | 264057 | 264057 |
| canonical | Total: 500/298.6s (1.7/s) Complete: 483/128.4s (3.8/s) Incomplete: 17/170.2s | 186682 | 186682 |
| k=10 nearest neigbhors | # unique SMARTS | # SS tests | |
| no cache | Total: 500/520.5s (1.0/s) Complete: 471/230.0s (2.0/s) Incomplete: 29/290.5s | 447648 | 3525563 |
| semi-canonical | Total: 500/490.1s (1.0/s) Complete: 472/209.5s (2.3/s) Incomplete: 28/280.6s | 447648 | 2872040 |
| canonical | Total: 500/509.3s (1.0/s) Complete: 471/218.7s (2.2/s) Incomplete: 29/290.6s | 330010 | 2109332 |
| k=100 nearest neigbhors | # unique SMARTS | # SS tests | |
| no cache | Total: 500/414.4s (1.2/s) Complete: 486/271.4s (1.8/s) Incomplete: 14/143.0s | 128781 | 9932263 |
| semi-canonical | Total: 500/363.5s (1.4/s) Complete: 487/230.8s (2.1/s) Incomplete: 13/132.7s | 128781 | 7456877 |
| canonical | Total: 500/361.8s (1.4/s) Complete: 488/239.5s (2.0/s) Incomplete: 12/122.4s | 107648 | 6196764 |
| k<=100 at or above Tanimoto threshold of 0.95 | # unique SMARTS | # SS tests | |
| no cache | Total: 500/642.1s (0.8/s) Complete: 458/220.3s (2.1/s) Incomplete: 42/421.9s | 468661 | 4388074 |
| semi-canonical | Total: 500/624.5s (0.8/s) Complete: 461/232.8s (2.0/s) Incomplete: 39/391.7s | 468661 | 3828813 |
| canonical | Total: 500/640.9s (0.8/s) Complete: 460/239.2s (1.9/s) Incomplete: 40/401.8s | 364802 | 3222366 |
Whew! That's a lot of data to throw at you. Please believe me when I say that it took two days to generate correctly. BTW, "complete" means that the MCS search algorithm went to completion, the "incomplete" means that it timed out - here after 10 seconds - and gave a partial solution.
What does this tell me? Obviously, caching is good. As expected, the "semi-canonical" solution is always better than the "no cache" case, and the canonicalization always reduces the number of substructures and substructure tests. This means the canonicalization code is doing something right.
Unfortunately, it seems like canonicalization has a big performance impact. In the pairwise tests I do 922,000 canonicalizations in Python to save 212,000 substructure tests, and I lose 21 seconds in doing so. For the k=100 nearest neighbor benchmark, which is the only one where the canonicalization code saves time, I do 129,000 canonicalizations to save 1,260,000 comparisons, and gain about 2 seconds. This suggests that each canonicalization takes as long as 10 substructure matches, and that RDKit can do 60,000 SMARTS matches per second. A quick checks using one SMARTS gives 100,000 SMARTS matches per second, which helps validate this estimate.
This means I should take out the canonicalization until it can be implemented in C++. Granted, there may be a bug in the code, like there was with an earlier hueristic. But there would have to be an order of magnitude performance increase for this to be effective, and I don't think that's likely.
Unit tests aren't enough to drive development
Which leads me back to my thesis. It would have been possible to develop a few more unit tests for the canonicalization code. There would have been some extra scaffolding in order to do that, but that's a minor cost. I suspect that many of the tests would be very implementation-dependent, and tied to specific internal function calls. I don't like this because it means that a re-implementation of this internal component in C++, which submerges many of the internal helper functions and places them out of reach of Python-based unit tests, would cause the unit tests to be un-runnable. And I am certain that those unit tests would not be rigorous enough to be confident that the code was working correctly.
Think about the question "should I write this code in Python or C++?". It's a development-time question. Test Driven Development (TDD) is a development methodology which uses unit tests to help make development-time decisions. I think TDD is most helpful when used to establish the minimum requirements for a project.
I don't see how TDD is helpful here. I can't think of any unit tests which would guide this development decision. Sometimes you can write a "spike solution" (also called a "prototype"):
A spike solution is a very simple program to explore potential solutions. Build the spike to only addresses the problem under examination and ignore all other concerns.This sounds good, but what development style should you use to develop the spike solution? Spikes are supposed to be throw-away code, but I can't figure out how something other than a complete, tested implementation of the canonical algorithm or the remaining growth heuristics would have been useful. After all, a two line change in the latter went from "working but 1% faster" to "working and 50% faster", and I didn't even know if there was going to give a speedup in the first place.
If not TDD, what methodology do I use?
Read Peter Seibel's Unit testing in Coders at Work. The author interviewed various famous programmers and learned more about how they tested their software. You may also want to read the related comments on Hacker News.
Seibel recounts that Joshua Bloch "wrote a monstrous 'basher'" which found "that occasionally, just occasionally, the basher would fail its consistency check." Bloch then wrote tests to pin down the failure, eventually finding the fault in a system mutex. Note that under normal unit test development you assume that your underlying components are correct, so this isn't something you would normally code for.
Seibel also reports that others assemble a large amount of test cases and uses that to check their code. This is of course the technique I used for the MCS problem.
It feels as trite as saying that the secret to losing weight is diet and exercise, but my method for programming is to understand the problem, implement it, and test it until you are confident that it's good enough. That knowledge of how to do that comes from experience, practice, reflection, and discussion.
For those who scoff and call this "big design up front," I merely point you to earlier parts of this essay. When I started this code I had a rough idea that it would work. I had previously implemented substructure enumeration, and the Weininger et al. canonicalization algorithm and SMARTS generation, so I knew that the components were possible, but I didn't know how they went together. Instead, I let the code development itself guide me. I thought some, I implemented code, I "ran the code in my head", I reflected on what the tricky parts were, I thought about how to make them more clear, and I tried various was to improve the code.
That was a lot to keep in my head, I wasn't sure if the result would be fast enough, and I had no way good way to test its usefulness until most of the code was in place. No wonder it was mentally taxing!
I believe tests - including unit tests implemented during development - are important. I don't believe that unit tests are good enough to guide complex algorithm development. However, most programming (over 95%?) is not complex algorithm development; complicated? yes, but not complex.
I once jokingly said that TDD is not useful if there's two or more embedded for loops. That's a bit of a simplification, but not far from my feelings. Some classes of problems, like complex algorithm development and security analysis, require a different attitude towards programming, emphasizing "what can go wrong?" and "how can I improve my confidence that the code is working correctly?" It's my view that this doubt-based philosophical attitude is missing from most discussions of software development practices.
And perhaps as fundamental, when I have to be that critical of my own work and probe it for failures and be open to the possibility of nasty gotchas lurking in the deep dark corners, then some of that doubt passes over into my personal life. I empathize with the idea of not living in that "strange place" all the time, and I can see why this isn't a common attitude.
Comments
This essay is meant to be a thoughtful reflection on the difficulties of programming, using the MCS problem to provide specific structure. If you want to leave comments about the MCS portion then use the MCS comment site. Otherwise, leave a comment on testing hard algorithms.
Topologically non-planar compounds #
John MacCuish, from Mesa Analytics, pointed out that the MCS problem takes polynomial time if the graphs are planar. He writes: "Are there graphs in your likely sets, that are not planar? I have never seen a non-planar small molecule drug for example, but they may be out there. Tests for planarity are also in P. Of course it doesn't mean that solutions in P will be faster than the usual methods since N is small an the overhead for planar MCS may be large, such that N may need to be large for the planar method to beat the non-planar heuristics."
This leads to a couple of questions. One is, what is the shape of the run-time of my MCS algorithm? I don't actually know. Some tests take a very long time, but that's not enough to establish that it's polynomial for real-world compounds or exponential. I'm not going to research this now.
Another is, are there real-world small molecules which are non-planar, in the topological sense, and not the chemistry sense where all of the non-hydrogen atoms line on or near a plane?
Previous work in topologically non-planar compounds
A quick literature seach finds Synthesis of the first topologically non-planar molecule, which says:
We report here the synthesis and characterization of the tris-ether 2,5,14-trioxahexacyclo-[5.5.2.1.24,10.O4,17.O10,17]-heptadecane 3; this topologically unique (graph theory) molecule is prepared via a novel intramolecular rearrangement of either of two isomeric propellane spiro-epoxides, 1 and 2.
There's also Topological stereochemistry. 9. Synthesis and cutting "in-half" of a molecular Möbius strip by Walba, Homan, Richards, and Haltiwanger in New. J. Chem., 1993, 17, 661-681.
Quoting from the above link to Modern Physical Organic Chemistry by Ansyln and Dougherty:
We mention briefly here another topological issue that has fascinated chemists. For the overwhelming majority of organic molecules, we can draw a two-dimensional representation with no bonds crossing each other. ... It may seem surprising, but most molecules have planar graphs.
Recent efforts have produced chemical structures that successfully realize many interesting and novel topologies. A landmark was certainly the synthesis of a trefoil knot using Sauvage's Cu+/phenanthroline templating strategy.... Vögtle and co-workers have described an "all organic" approach to amide-containing trefoil knots, and have been able to separate the two enantiomeric knots using chiral chromatography. Another seminal advance in the field was the synthesis and characterization of a 'Möbius strip' molecule..."
So it's well-established that there are topologically non-planar structures. But are they in compound databases which I can access?
Searching PubChem for topologially non-planar compounds
Take a look at Scaffold Topologies II: Analysis of Chemical Databases by Wester, Pollock, Coutsias, Allu, Muresan and Oprea in PMC 2010 January 15., Published in final edited form as: J Chem Inf Model 2008 July; 48(7): 1311-1324.
Only 12 nonplanar and 2,099 spiro node topologies (all of which are planar) are present in the merged database. 9 of the nonplanar topologies are found only in PubChem and the total number of molecules represented by such topologies in the merged database is a mere 44, agreeing with Walba's assessment concerning the rarity of chemicals with nonplanar graphs.
This establishes that in 2008 there were no more than 44 topologically non-planar structures in PubChem. Can I find them?
I don't think any of the database search engines support this capability, so I need to write a program.
For that I need a method for planarity testing. Various sources say that the linear time algorithm is "widely regarded as being quite complex", but that one of the linear time planarity algorithms is part of Sage
Sage is a comprehensive mathematical software system, which uses Python. There's a pre-built binary distribution for my OS, which I downloaded. It includes Python, the IPython shell, and everything else it needs, so it really is a system and not a set of Python modules.
With Sage it's easy to make a graph and test for planarity.
% sage
----------------------------------------------------------------------
| Sage Version 5.0, Release Date: 2012-05-14 |
| Type notebook() for the GUI, and license() for information. |
----------------------------------------------------------------------
sage: d = {1: [2,3], 2: [3], 3: [4, 6], 4: [5]}
sage: g = Graph(d)
sage: g.is_planar()
True
sage:
sage: k3_3 = {1: [2,3,4,5,6], 2: [3,4,5,6], 3: [4, 5, 6]}
sage: Graph(k3_3).is_planar()
False
sage:
The dictionary I use as input to "Graph" contains an upper-triangle connection matrix. So to test if a molecule is topologically planar, I just need to convert its connectivity information into a dictionary of the right form, turn the dictionary into a Graph, and test the graph for is_planar().
Roadblock! My Python 2.6 modules and Sage's 2.7 Python don't mix
The prebuilt binaries include its own Python distribution, and when you use "sage", it replaces the PYTHONPATH with its own settings. This means when I run sage I don't have access to the cheminformatics tools I've already set up for my system:
% sage ---------------------------------------------------------------------- | Sage Version 5.0, Release Date: 2012-05-14 | | Type notebook() for the GUI, and license() for information. | ---------------------------------------------------------------------- sage: import rdkit --------------------------------------------------------------------------- ImportError Traceback (most recent call last) /Users/dalke/<ipython console> in <module>() ImportError: No module named rdkit sage: from openeye.oechem import * --------------------------------------------------------------------------- ImportError Traceback (most recent call last) /Users/dalke/<ipython console> in <module>() ImportError: No module named openeye.oechem sage: os.environ["PYTHONPATH"] '/Users/dalke/ftps/sage/local/lib/python'I tried to force it to include the right path, and that failed.
% printenv PYTHONPATH /Users/dalke/ftps/openeye//wrappers/v2011.Oct.1/python/:/Users/dalke/envs/RDKit_2011_12-svn % sage ... sage: import sys sage: sys.path.insert(0, "/Users/dalke/envs/RDKit_2011_12-svn") sage: import os sage: os.environ["DYLD_LIBRARY_PATH"] '/Users/dalke/ftps/sage/local/lib:/Users/dalke/ftps/sage/local/lib/R/lib::/Users/dalke/ftps/openeye//wrappers/libs:/Users/dalke/envs/RDKit_2011_12-svn/lib:/Users/dalke/ftps/sage/local/lib/R/lib' sage: from rdkit import Chem Fatal Python error: Interpreter not initialized (version mismatch?) ------------------------------------------------------------------------ Unhandled SIGABRT: An abort() occurred in Sage. This probably occurred because a *compiled* component of Sage has a bug in it and is not properly wrapped with sig_on(), sig_off(). You might want to run Sage under gdb with 'sage -gdb' to debug this. Sage will now terminate. ------------------------------------------------------------------------ /Users/dalke/ftps/sage/spkg/bin/sage: line 312: 56741 Abort trap sage-ipython "$@" -iWhat happened here is that Sage ships with Python 2.7, while the locally installed cheminformatics toolkits use Python 2.6.
I decided to use another technique. I would have sage call out to another program to handle the cheminformatics. For each structure it will output a line containing the identifier, the SMILES, and the needed upper-triangle dictionary data structure. One line of output will look like:
('15', 'OC1C2(C(C3C(C4(C(CC3)CC(=O)CC4)C)CC2)CC1)C', {0: [1], 1: [2, 19],
2: [3, 20, 17], 3: [4, 18], 4: [5, 9], 5: [6, 16], 6: [7, 15, 14],
7: [8, 10], 8: [9], 10: [11], 11: [12, 13], 13: [14], 16: [17], 18: [19]})
All the sage code needs to do is read those lines, extract the data,
pass the graph into Sage for analysis, and print out those which are
non-planar.
Even this wasn't as easy as I thought. The PYTHONPATH environment variable persists in spawned processes, which the python subprocess I started uses the wrong PYTHONPATH. I ended up setting the environment variables myself - including PYTHONHOME - before it would work correctly:
import sys
import os
env = os.environ.copy()
env["DYLD_LIBRARY_PATH"] = "/Users/dalke/ftps/openeye//wrappers/libs:/Users/dalke/envs/RDKit_2011_12-svn/lib"
env["PYTHONPATH"] = "/Users/dalke/envs/RDKit_2011_12-svn"
env["RDBASE"] = "/Users/dalke/envs/RDKit_2011_12-svn"
env["PYTHONHOME"] = "/System/Library/Frameworks/Python.framework/Versions/2.6"
import subprocess
# Import the "Graph" constructor
from sage.all import *
# Use the version of Python for which RDKit was built to use
p = subprocess.Popen(["/usr/bin/python2.6", "pubchem_connectivity.py"],
env = env,
stdout = subprocess.PIPE)
for i, line in enumerate(p.stdout):
id, smiles, d = eval(line)
G = Graph(d)
if not G.is_planar():
print "======= Found one!!!"
print id, smiles
if i % 100 == 0:
sys.stderr.write("%d ...\r" % (i,))
sys.stderr.flush()
Now I just need the "pubchem_connectivity.py" program to generate the connectivity information.
Call another program to generate the connectivity information
I've been using RDKit these days, because the funding I got for the MCS project said I needed to use RDKit. I usually use OEChem, which (among other things) has much faster structure parsers than RDKit. Since I'm going to read tens of millions of structures, I wanted a way to reduce the number of structures to process.
Now, if a graph has no cycles then it's always planar. Even if it has one, two, or three cycles, it's still impossible for it to be non-planar. The number of cycles in a single component graph is E-V+1, which is the number of edges (bonds), minus the number of vertices (atoms), plus 1. So a simple test is to exclude molecules where the number of bonds is less than three greater than the number of atoms.
Mind you, this is an exclusion test which rejects graphs that cannot be planar. There are plenty of molecules with dozens or rings which are topologically planar. Also, PubChem has plenty of records with multiple components, which means I may miss a few cases. But mind you, my goal is to find some non-planar structures in PubChem, not find all of them.
I had previously converted my local copy of PubChem to a set of compressed SMILES files. A few months ago I wrote a set of Ragel definitions for SMILES, which includes a demonstration of how to use count the number of atoms and bonds in a SMILES file. I can use that unmodified as a co-process to do high-speed counting for me.
import glob
import subprocess
import gzip
import sys
from collections import defaultdict
from rdkit import Chem
# Start a co-process to count the number of atoms and bonds in a SMILES string
p = subprocess.Popen(["/Users/dalke/opensmiles-ragel/smiles_counts"],
stdin = subprocess.PIPE,
stdout = subprocess.PIPE)
filenames = glob.glob("/Users/dalke/databases/pubchem/*.smi.gz")
for i, filename in enumerate(filenames):
msg = "Processing %d/%d\n" % (i+1, len(filenames))
sys.stderr.write(msg)
sys.stderr.flush()
for line in gzip.open(filename):
# Read a line from the data file
smiles, id = line.split()
# Send it to the co-process
p.stdin.write(smiles + "\n")
p.stdin.flush()
# Get the counts (looks like "atoms: 34 bonds: 38")
line = p.stdout.readline()
_, atoms, _, bonds = line.split()
num_atoms = int(atoms)
num_bonds = int(bonds)
# Skip records which cannot be non-planar
# (Note: assumes single component structures!)
if num_bonds < num_atoms + 3:
continue
# Extract the topology into upper-triangle dictionary form
mol = Chem.MolFromSmiles(smiles)
if mol is None:
continue
d = defaultdict(list)
for bond in mol.GetBonds():
b1 = bond.GetBeginAtomIdx()
b2 = bond.GetEndAtomIdx()
if b1 < b2:
d[b1].append(b2)
else:
d[b2].append(b1)
# print to stdout so the Sage program can read it
data = (id, smiles, dict(d))
print repr(data)
BTW, if you haven't been paying attention, I have one process which
tests if a SMILES string has enough bonds in it, another to convert
structure information into a simple topology, and a third program to
do the graph planarity test. "Bailing-wire and chewing gum!" to repeat
an old phrase.
The structures!
It took many hours for my computer to chug along (while I slept). I ended up with 224 of the non-planar SMILES in the subset of 28.5 million PubChem structures I have on my machine. (In other words, bear in mind that this is not a complete search!)
Here are a few structures to show you what they look like:
That first structure, silicon nitride, is a ceramic, and which cannot be expressed in SMILES. (That is "[C]" is an equally poor SMILES representation of graphite as it is for diamond.) The others look like progressively more realistic chemical representations.
The non-planar structures were a bit too verbose to show as a SMILES file. Instead, here's the full list of identifiers, which you can easily use to get the structures yourself (or follow the hyperlink to look at the non-planar depictions).
390566, 498002, 636755, 3084099, 4868274, 5104674, 6712449, 10019039, 10882690, 10895017, 10898366, 10994362, 11027681, 11126005, 11131973, 11187418, 11340053, 11350583, 11360285, 11384163, 11407796, 11672382, 14381365, 14381430, 14381432, 14381435, 16132679, 16132681, 16132995, 16132999, 16133150, 16133259, 16133262, 16133397, 16133412, 16133413, ee16133414, 16133878, 16145580, 16146229, 16146230, 16148442, 16148609, 16148632, 16148888, 16148900, 16149007, 16149114, 16149222, 16149238, 16149361, 16149362, 16149482, 16149579, 16149602, 16149827, 16149958, 16150054, 16150193, 16150399, 16150419, 16150654, 16150658, 16150833, 16150994, 16151169, 16151337, 16151360, 16151567, 16151729, 16151960, 16152362, 16152566, 16152608, 16152699, 16152729, 16153098, 16153207, 16153649, 16154275, 16154327, 16154453, 16154584, 16154971, 16155013, 16155014, 16155015, 16155069, 16155075, 16155076, 16155130, 16155140, 16155152, 16155193, 16155194, 16155197, 16155399, 16155442, 16155443, 16155456, 16155607, 16155626, 16155630, 16155631, 16155632, 16155633, 16155652, 16155784, 16155884, 16155916, 16155917, 16155920, 16155972, 16156042, 16156080, 16156082, 16156145, 16156198, 16156260, 16156261, 16156264, 16156265, 16156312, 16156411, 16156413, 16156417, 16156432, 16156495, 16156539, 16156544, 16156545, 16156609, 16156610, 16156613, 16157557, 16214951, 17749011, 21597602, 21597607, 21597610, 21597611, 21770498, 22294696, 22835058, 22835161, 22835262, 22835624, 22835636, 22835637, 23327291, 23584643, 23726086, 23727886, 23955822, 24764125, 24770227, 24770228, 24770229, 24770290, 24770291, 24871221, 24940071, 25200029, 44239114, 44303783, 44303799, 44303821, 44382489, 44397933, 44397934, 44566282, 44566284, 44575206, 44575207, 44575208, 44592641, 44592642, 44592645, 44592646, 44592647, 44592648, 44592945, 44593582, 44606373, 44606374, 46882313, 46882314, 46882315, 46882316, 46882317, 46882318, 46882319, 46882320, 46882321, 46882322, 46882323, 46882324, 46882325, 46882326, 46882327, 46882328, 46891923, 46895835, 49799159, 49799160, 49873810, 50897242, 50900298, 50900299, 50900300, 50919058, 51004304, 51026319, 52945815, 52952313, 52952314, 52952315, 52952316, 52952317, 53468167, 53468168, 53468169, 53468170, 53468171
Still, having some SMILES on-hand is nice so here are some of the shorter SMILES strings. At the least, you can use these as test-cases for an MCS search engine, and perhaps force a linear-time planar-graph-only method to fail :)
O1C23C4C5(C16OCC7C4(CC(C2(OC(=O)C3(CC5C8C6CC=C9C8(C(=O)C=CC9)C)O)C)OC7=O)C)O 390566 O=C1N2C34N5CCC(C3CCC4(C1)C=CC5)c6c2cccc6 498002 O=C1N2C34N5CCC(C3CCC4(C1)C=CC5)c6c2cc(cc6)OC 636755 [Si]123N4[Si]56N1[Si]4(N25)N36 3084099 C123C45C16C24C6C7C(C35)CC(C(C7)C)C 4868274 O1C23C4C5(C16OCC7C4(CC(C2(OC(=O)C3(CC5C8C6CC=C9C8(C(=O)C=CC9)C)O)C)OC7=O)C)O 5104674 O1C23C4C5(C16OCC7C4(CC(C2(OC(=O)C3(CC5C8C6CC=C9C8(C(=O)C=CC9)C)O)C)OC7=O)C)O 6712449 O=C1N2C34N5CCC(C3CCC4(C1)C=CC5)c6c2cc(cc6)OC 10019039 O=C1N2C34N5CCC(C3CCC4(C1)C=CC5)c6c2cc(c(c6)OC)OC 10882690 O1C2C34C5C(CN(C3C(C5OC)c6c(cc(c(c6)OC)O)C4C1)CC)(C(C2)O)CO 10895017 O1c2c3cccc2Cc4c5c(ccc4)Cc6c7c(ccc6)Cc8c(c(ccc8)C3)OCCOCCN(CCOCCO5)CCOCCOc9c(cccc9)OCCOCCN(CCOCC1)CCOCCO7 10898366 O1C2C34C5C(C(C2)OC(=O)C)(CN(C3C(C5OC)c6c(cc(c(c6)OC)O)C4C1)CC)COC 10994362 C123C45c6c(cccc6)C17c8c(cccc8)C2(c9c(cccc9)C3(c1c4cccc1)c1c7cccc1)c1c5cccc1 11027681 OC1N2C34N5CCC(C3CCC4(C1)C=CC5)c6c2cc(c(c6)OC)OC 11131973 O1c2c3c(ccc2)OB4Oc5c6c(ccc5)OB1Oc7c(c(ccc7)O4)N63 11187418 C123C45C6C7C8C19C1C(C4C4C%10C22C%11C(C5C5C%12C33C(C6C5)C8CC5C9C(C2C(C35)CC%12%11)CC1%10)C4)C7 11340053 O1C2C34C5C(CN(C3C(C5OC)c6c(cc(c(c6)O)O)C4C1)CC)(C(C2)O)COC 11350583 C123C4C5C1C6C5(C4C26)C=CC#CC=CC78C9C1C7C2C1(C9C82)C=CC#CC=C3 11360285 C123C4C5C1C6C5(C4C26)C=CC#CC#CC=CC78C9C1C7C2C1(C9C82)C=CC#CC#CC=C3 11384163 O1C2C34C5C(CN(C3C(C5OC)c6c(cc(c(c6)OC)O)C4C1)CC)(C(C2)O)COC 11407796 O(c1cc2c(cc1OC)C34C56C27c8c(cc(c(c8)OC)OC)C5(c9c3cc(c(c9)OC)OC)c1c(cc(c(c1)OC)OC)C6(c1c7cc(c(c1)OC)OC)c1c4cc(c(c1)OC)OC)C 11672382 O(C12NC(=O)C3C4C56C1C7C8C59C15C6(C3C3C1C(C9C43)C#N)C2C7C5C8=O)C(=O)C 14381365 O1C2C3C45C67C2C8C3C9C42C6(C8C9=O)C3C4C7C(C5C4C2C3C#N)C1=O 14381430 O1C2C3C45C67C2C8C3C9C42C63C8C9OC(=O)C4C2C2C5C(C7C2C34)C1=O 14381432 BrC12C34C56C7(C8C9C5C5C3C9C1C8C(=O)OC1C2C2C4C(C6C2C71)OC5=O)Br 14381435 O1c2c3c4c5c6c7c3c(cc8c7c(cc6Oc9cc(ccc9)OCCOCCOc3cc1ccc3)C(=O)N(C8=O)C(CCCCCC)C)Oc1cc(ccc1)OCCOCCOc1cc(ccc1)Oc5cc1c4c(c2)C(=O)N(C1=O)C(CCCCCC)C 16214951 BrC12C34C56C7(C8C9C5C5C3C9C1C8C(=O)OC1(C2C2C4C(C6C2C71)OC5=O)Br)Br 21597602 BrC12C34C56C7C8C9C3C7C(=O)OC3C4C4C1C(=O)C(C5(C8C(C29)C(=O)OCC)Br)C4C63 21597607 BrC12C34C56C7C8C9C5C5C3C9C1C8C(=O)OC1C2C2C4C(C6C2C71)OC5=O 21597610 O1C2C3C45C67C2C8C3C9C42C6(C8C9O)C3C4C7C(C5C4C2C3C(=O)O)C1 21597611 O1C23OC(OC4C25C6C7(C1=O)C(C3OC(OC5C(=C)C4CC6)(C)C)C(CCC7)(C)C)(C)C 21770498 ClC1=CC2OC3C1C4OC5C2C4C3C5 22294696 O=C1N2C34N5CCC(C3CCC4(C1)C=CC5)c6c2cc(cc6)OC 23327291 O1c2cc3c4cc2OCCOCCOCCOc5c(cc6c(c5)c(c7c(c6C)cc8c(c7)OCCOCCOCCOc9cc2c(cc9OCCOCCOCCO8)C4(c4c(cccc4)C32C)C)C)OCCOCCOCC1 23584643 O1C2(OCC34C5C2(CCC3OC(=O)C67C4C(OC51)CC(C6)C(=C)C7=O)C)C 23727886 O=C1N2C34N5CCC(C3CCC4(C1)C=CC5)c6c2cc(c(c6)OC)OC 23955822 O1C2C3C45C6C1OC(C6(CCC4OC(=O)C37CC(C2)C(=C)C7=O)C)OC5 24764125 O=C1N2C34N5CCC(C3CCC4(C1)C=CC5)C6=C2CC(=C(C6)OC)OC 24871221 O1C2C34C56C(C(C2(C=C5)OC)C(O)(CCCC)C)CCN(C6Cc7c3c1c(cc7)O)CC4 44303783 O1C2C34C56C(C(C2(C=C5)OC)C(O)(CCCC)C)CN(C6Cc7c3c1c(cc7)O)CC4 44303799 O1C2C34C56C(C(C2(C=C5)OC)C(O)(CCCC)C)CCN(C6Cc7c3c1c(cc7)O)CC4 44303821 S(=O)(=O)(OCC12C3C4C(C1)CC(C3)CC4C2)N 44382489 O=C1N2C34N5CCC(C3CCC4(C1)C=CC5)c6c2cc(c(c6)OC)OC 44592945 O1C23C4C5(C16OCC7C4(CC(C2(OC(=O)C3(CC5C8C6CC=C9C8(C(=O)C=CC9)C)O)C)OC7=O)C)O 49873810 O=C1N2C34N5CCC(C3CCC4(C1)C=CC5)c6c2cccc6 50897242 O=C1N2C34N5CCC(C3CCC4(C1)C=CC5)c6c2cc(cc6)OC 50919058 O1C23C4C5(C16OCC7C4(CC(C2(OC(=O)C3(CC5C8C6CC=C9C8(C(=O)C=CC9)C)O)C)OC7=O)C)O 51004304 O=C1N2C34N5CCC(C3CCC4(C1)C=CC5)c6c2cc(cc6)OC 51026319
Feel free to leave a comment.
Finding the MCSes for the ChEBI ontology #
This is part 3 of a series on MCS:
- MCS background
- fmcs - find the MCS of a set of compounds
- Finding the MCSes for the ChEBI ontology
The industrious folks at EBI have been developing ChEBI, which expands to "Chemical Entities of Biological Interest." Quoting Wikipedia, "[ChEBI] is a database and ontology of molecular entities focused on 'small' chemical compounds, that is part of the Open Biomedical Ontologies effort."
They define several distinct ontologies. One is a chemical structure ontology. For example, the identifier CHEBI:33567 contains catecholamine, and a few examples of catecholamines are hexoprenaline (CHEBI:37950), arbutamine (CHEBI:50580), L-isoprenaline (CHEBI:6257). In addition, catecholamine is a catechol (CHEBI:33566), which in turn is a benzenediol (CHEBI:33570), and so on. A group can have more than one parent; catecholamine is also a monoamine molecular messenger (CHEBI:25375).
The end result is a hierarchial structure. The bottom of the hierarchy are structures, and intermediate nodes are such that all children of the node have some common property.
Some of these common properties map directly to a common substructure. For example, CHEBI:33853 contains phenols, so every compound under that node has "one or more hydroxy groups attached to a benzene or other arene ring."
However, not all of them do. As Chepelev, Hastings, Ennis, Steinbeck, and Dumontier pointed out in "Self-organizing ontology of biochemically relevant small molecules", BMC Bioinformatics 2012, 13:3, the term "'ester' includes compounds that conform to C(=O)OC (i.e. carboxylic esters) and C(=S)OC patterns, among others."
Other cases can't even be represented as SMARTS. They give "bicyclic" as one such example.
Can I find the MCS of all structures in a node in the ontology?
I was curious to see if I could use their data set as a test of fmcs. If their intermediate nodes have a machine-readable way to tell if it's a purely substructure-based node, and if I could get the size information, then I could get all the structures underneath it, find the MCS, and compare my answer to theirs.
Alas, they don't have that annotation information. It's something they are working on, but I didn't get the impression that it's a high priority. (I don't see why it should, either.)
Still, it's an interesting thought - what if I were to generate the MCS for all nodes, and visualize the results somehow?
It took a bit longer than I thought, but I finally downloaded their ontology (in OBO format), parsed it, extracted the hierarchy, figured out the compounds in each node, tossed out the structures that RDKit couldn't parse, and the nodes which didn't have at least two remaining structures in them.
One that was done, I let my MCS algorithm at it. It took about 50 minutes to process. (Well, I had a 15 second timeout on the MCS. I've found that 15 seconds is usually good enough.)
I also developed a visualizer for the result, using Karen Schomburg's SMARTSviewer and Daylight's depictmatch.cgi
Oooh! More pictures!
Here's a snapshot of one of the successful cases, CHEBI:16648, which is dialkyl phosphate:

Most of the results aren't as clear-cut. For example, CHEBI:16389 contains the ubiquinones. I found the MCS:

which is nearly right, but the Wikipedia page for Coenzyme_Q10 ("Coenzyme Q10, also known as ubiquinone, ...") shows a methyl attached to the top-most oxygen this SMARTS depiction. This is because CHEBI:18238 is a structure in the set which does not have that methyl attached!
It this methyl important? I don't know. I'm not a chemist, and this requires expertise I simply don't have.
An oopsie in the oxolanes?
What I do know is that there's a mistake in the oxolanes, CHEBI:26912. Wikipedia calls this tetrahydrofuran and says it's an 5-membered ring with the formula (CH2)4O. I would write it as the SMILES/SMARTS "O1CCCC1".
However, my search finds only "OCCCC"; it doesn't find the cycle. There shouldn't be a problem with this one so I investigated further, wondering if it was a bug. It ended up that acetylblasticidin S (CHEBI:2413) is considered an oxolane. A quick look at the structure though shows that it has no 5-membered ring.
I think that's an annotation error. BTW, I do not envy the job of annotator. There's a lot of data to review, and people like me end up pointing out the mistakes, not the huge amount of work to get all the other parts right.
Even more pictures... most of the ChEBI ontology!
Do you want to see the output of my full analysis? Do you have a lot of memory on your computer? If so, download fmcs_chebi.html.bz2. It's only 7.5 MB but it bzip2 uncompresses to 166 MB. Open fmcs_chebi.html in your browser, and have fun! (Note: I'll probably delete it after a month or so.)
BTW: the images are computed on-demand using servers from the University of Hamburg and from Daylight. I didn't want to show everything at once since that would put a huge demand on those servers. Instead, you'll need to press the "Toggle images" button in order to see the SMARTS and the graphical depiction of the matches.
Comments
If you have comments or questions, leave them here.
Copyright © 2001-2010 Dalke Scientific Software, LLC.


