<?xml version="1.0" encoding="iso-8859-1"?>
<rss version="2.0"><channel><title>Andrew Dalke's writings</title><link>http://www.dalkescientific.com/writings/diary/index.html</link><description>Writings from the software side of bioinformatics and
  chemical informatics, with a heaping of Python thrown in for good
  measure.  Code to taste.  Best served at room temperature.</description><lastBuildDate>Sat, 15 Dec 2012 10:16:13 GMT</lastBuildDate><generator>PyRSS2Gen-1.0.0</generator><docs>http://blogs.law.harvard.edu/tech/rss</docs><item><title>Is TDD better than Apps Hungarian?</title><link>http://www.dalkescientific.com/writings/diary/archive/2012/12/13/version_control_and_user_testing.html</link><description>&lt;P&gt;
Greg Wilson and Jorge Aranda's essay &lt;a
href="http://third-bit.com/blog/archives/4552.html"&gt;Two Solitudes&lt;/a&gt;
describes the different viewpoints and additudes between software
engineering researchers and practicioners. Here is one of the
complaints:

&lt;blockquote&gt;
At industry-oriented gatherings, it seems that a strong opinion, a
loud voice, and a couple of pints of beer constitute "proof" of almost
any claim. Take test-driven development, for example. If you ask its
advocates for evidence, they'll tell you why it has to be true; if you
press them, you'll be given anecdotes, and if you press harder, people
will be either puzzled or hostile.
&lt;br /&gt;&lt;br /&gt;
Most working programmers simply don't know that scientists have been
doing empirical studies of TDD, and of software development in
general, for almost forty years. It's as if family doctors didn't know
that the medical research community existed, much less what they had
discovered. Once again, the first step toward bridging this gulf
seemed to be to get each group to tell the other what they knew.
&lt;/blockquote&gt;


Greg worked with Andy Oram to collect and edit the essays in &lt;a
href="http://shop.oreilly.com/product/9780596808303.do"&gt;Making
Software: What Really Works, and Why We Believe It&lt;/a&gt;, as a way to
present some of that research evidence to software practicioners. The
Two Solitudes essay refers to that book, saying its "topics range from
the impact of test-driven development on productivity (probably small,
if there is one at all) to whether machine learning techniques can
predict fault rates in software modules (yes, with the right
training)."

&lt;/P&gt;
&lt;h3&gt;Why is there a debate over TDD?&lt;/h3&gt;
&lt;P&gt;

Did you catch the "&lt;i&gt;probably small, if there is one at all&lt;/i&gt;"
part?  Why then is there a continued debate over the effectiveness of
TDD when the conclusion from reseach is that there is at best only a
small improvement in productivity? The &lt;a
href="http://third-bit.com/blog/archives/4552.html"&gt;Two Solitudes&lt;/a&gt;
essay goes into some of the possible reasons. You should read it.

&lt;/P&gt;&lt;P&gt;

That got me to thinking about other well-accepted or well-rejected
practices: version control software and Apps Hungarian notation. We as
practicioners generally accept that version control software is a Good
Thing and generally view Apps Hungarian as a negative. If we "know"
that version control systems are useful, then can software engineering
researchers find experimental evidence to support or reject the claim?
If they have, then it might be an example of what a successful
research experiment might look like. If it can't, then perhaps it says
something about the weakness of the current experimental approaches
compared to personal experience.


&lt;/P&gt;&lt;P&gt;

We can ask the same about other development practices. Can software
testing show if Apps Hungarian improves productivity over other
variable naming systems?

&lt;/P&gt;&lt;P&gt;

With those experiments in place we could start asking questions like
"are the gains from introducing TDD to the process more like the gains
in introducing Apps Hungarian, or more like the gains in introducing
Mercurial?" Or more crudely, "is TDD more effective than Apps
Hungarian?"

&lt;/P&gt;&lt;P&gt;

(For "TDD" put pair programming, unit tests, coverage analysis,
version control systems, or most any other practice.)

&lt;/P&gt;&lt;P&gt;

So let's ask the software engineering researchers if they can provide
evidence for those beliefs of the practicioners. Perhaps they've done
it already?

&lt;/P&gt;
&lt;h3&gt;Do version control systems make people more productive?&lt;/h3&gt;
&lt;P&gt;

As far as I can see (which isn't very far), no one has done
experimental research to tell if a version control system makes people
more productive. I asked Greg, and he was a bit chagrined to reply
that he didn't know of any published research either. Yes, he promotes
that everyone use a version control system, despite not knowing of
research evidence to back that belief.

&lt;/P&gt;&lt;P&gt;

Curious, isn't it?

&lt;/P&gt;&lt;P&gt;

My challenge to researchers is, can you carry out experiments to test
the effectiveness of version control software for different types of
projects?

&lt;/P&gt;&lt;P&gt;

This research, like TDD, is complicated by many real-world
factors. There are alternatives to using version control software:

&lt;ul&gt;
  &lt;li&gt;Don't use any version control software, just one directory,
  &lt;li&gt;Depend on system backups (including Time Machine for the Mac),
  &lt;li&gt;Make manual backups of the working code from time, as a
        tar/zip file, or saved to another directory,
  &lt;li&gt;Use a versioning filesystem, which might be a distributed
        file system;
&lt;/ul&gt;

different development scenarios:

&lt;ul&gt;
  &lt;li&gt;Single developer,
  &lt;li&gt;Pair-programming using the same account,
  &lt;li&gt;Multiple developers all using the same file system,
  &lt;li&gt;Multiple developers across multiple sites;
&lt;/ul&gt;

different types of developers:

&lt;ul&gt;
  &lt;li&gt;Trained as software developers,
  &lt;li&gt;... electrical engineers,
  &lt;li&gt;... computational scientists,
  &lt;li&gt;Undergraduates in a first semester programming course;
&lt;/ul&gt;

and even different needs:

&lt;ul&gt;
  &lt;li&gt;Research spikes where the code is thrown away at the end,
  &lt;li&gt;Scientific research where one person works on a new approach
      for several months, publishes a paper, and never works on it again,
  &lt;li&gt;Traditional main-branch-development-only coding,
  &lt;li&gt;Multi-branch development and multiple supported releases.
&lt;/ul&gt;


&lt;/P&gt;&lt;P&gt;

It sounds like a lot of work to test these possibilities when we
already know the answer, but that's my point. If research methods
can't address this question then how do we trust those methods on
other topics, like TDD or pair-programming?

&lt;/P&gt;&lt;P&gt;

Perhaps the research will point out something interesting. Perhaps it
will find that research projects which take less than three months and
where there is only one developer are 10% more effective using
automatic versioning file system over a manual version control
system. If that were true, perhaps we should recommend that most
scientists use a versioning file system instead of version control.

&lt;/P&gt;&lt;P&gt;

Or perhaps we should develop better connections between the various
file system monitoring tools (like &lt;a
href="http://en.wikipedia.org/wiki/Inotify"&gt;inotify&lt;/a&gt;) and the
version control systems, to give an automatic/no-touch interface to
the version control system.

&lt;/P&gt;&lt;P&gt;

For that matter, some people start continuous integration tests based
on checkins. Why not do CI based on filesystem changes? Obviously some
will commit less often, but perhaps productivity will improve.

&lt;/P&gt;&lt;P&gt;

I can continue to conjecture, but of course, without evidence it's
little better than blather. The world is full of blather. That's why
we need research.

&lt;/P&gt;
&lt;h3&gt;Does Apps Hungarian make people more productive?&lt;/h3&gt;
&lt;P&gt;

Hungarian Notation is a classic topic. Quoting from Michael Feathers
on the &lt;a href="http://c2.com/cgi/wiki?HungarianNotation"&gt;c2.com wiki
page&lt;/a&gt;:

&lt;blockquote&gt;
Hungarian notation inspires some of the most bitter religious wars
among programmers. Detractors claim that it is indecipherable and that
no real standards exist for its usage. Adherents claim that it becomes
more likeable with use, and the instant mnemonic recognition of type
that one gains with familiarity aids program comprehension
immensely.
&lt;/blockquote&gt;

&lt;/P&gt;&lt;P&gt;

Hungarian notation seems to be in decline. Microsoft's .Net "&lt;a
href="http://msdn.microsoft.com/en-us/library/ms229045.aspx"&gt;General
Naming Conventions&lt;/a&gt;" document says outright "&lt;b&gt;Do not use
Hungarian notation&lt;/b&gt;." Nor is Hungarian the de facto style in any of
the few dozen languages that I'm familiar with, though some libraries,
especially those with a Microsoft heritage, do use it.

&lt;/P&gt;&lt;P&gt;

This despite the efforts of people like Joel Spolsky to promote a
distinction between &lt;a
href="http://www.joelonsoftware.com/articles/Wrong.html"&gt;Apps
Hungarian and System Hungarian&lt;/a&gt;. The former was is "extremely
valuable" while the latter is from "the dark side.")

&lt;/P&gt;&lt;P&gt;

The Wikipedia page lists other "&lt;a href="http://en.wikipedia.org/wiki/Hungarian_notation#Notable_opinions"&gt;notable opinions&lt;/a&gt;" both for and against.

&lt;/P&gt;&lt;P&gt;

Some people say it's useful in untyped languages. Others say it
Hungarian Notation just gets in the way. Is there any research to back
up any of these opinions, or are variable naming schemes yet another
unexplored field?

&lt;/P&gt;
&lt;h3&gt;Does TDD make people more productive?&lt;/h3&gt;
&lt;P&gt;

I'm picking on TDD, but this section can be applied to many other
topics.

&lt;/P&gt;&lt;P&gt;

In any TDD debate you'll hear from people who find it useful, and
those who don't. You'll hear about people who tried TDD but didn't
find it useful, and you'll hear from others who want to help the first
group do TDD correctly. And you'll hear from people who tried git and
found it confusing ... though they often move to other version
control systems instead.

&lt;/P&gt;&lt;P&gt;

Look at the old Hungarian discussions and you'll see the same
patterns, including people that tried it, liked it, won't move away,
and believe that others just don't understand Hungarian correctly.

&lt;/P&gt;&lt;P&gt;

It looks like the effectiveness of TDD has had many more studies then
the effectiveness of either version control systems or Hungarian
Notation, so let's use that as the reference. Is the difference in
effectiveness between using PEP-8 style names and Apps Hungarian more
or less than the difference between TDD and coverage-based testing?
Is the difference in effectiveness between using version control
instead of manual backups more or less than the difference between TDD
and test-last?

&lt;/P&gt;&lt;P&gt;

And more importantly, are the answers to those questions opinions
based on experience or are they backed by evidence? Can we use these
experiments to calibrate our understanding of what software
engineering researchers can identify?

&lt;/P&gt;
&lt;h3&gt;Feedback&lt;/h3&gt;
&lt;P&gt;

Do you know of any research which tests the effectiveness of version
control systems over other approaches? What about of App Hungarian
over other approaches? Perhaps you have a "strong opinion,", "loud
voice," or have had a "couple of pints of beer" and want to &lt;a
href="http://www.neverworkintheory.org/?p=451"&gt;leave a comment&lt;/a&gt;?

&lt;/P&gt;&lt;P&gt;

I sent the seed of this essay to Greg, and he posted it on "&lt;a
href="http://www.neverworkintheory.org/"&gt;It will never work in
theory&lt;/a&gt;." I think that's the best place to continue this dicussion,
so I hope to &lt;a href="http://www.neverworkintheory.org/?p=451"&gt;see you
there&lt;/a&gt;.

&lt;/P&gt;&lt;P&gt;

I'll end with a quote from Peter Medawar's "Advice to a Young Scientist":

&lt;blockquote&gt;
I cannot give any scientist of any age better advice than this: the
intensity of a conviction that a hypothesis is true has no bearing
over whether it is true or not. The importance of the strength of our
conviction is only to provide a proportionately strong incentive to
find out of the hypothesis will stand up to critical evaluation.
&lt;/blockquote&gt;

&lt;/P&gt;
&lt;h3&gt;Postscript&lt;/h3&gt;
&lt;P&gt;

The above essay places the entire burden on software engineering
researchers to do new experiments. I don't mean for things to be so
one sided. This was also meant for software practicioners.

&lt;/P&gt;&lt;P&gt;

Software practioners claim a lot of things. Some, like version control
systems make sense. Hungarian notation, not so much. I don't think TDD
makes much sense, and the research seems to back me up.

&lt;/P&gt;&lt;P&gt;

If you are software practioner then take that final quote from Medawar
to heart. You can make claims about your method X or approach Y, but
in the back of your head, remember that I'll be asking you "how do you
know that your approach is any better at improving productivity than
switching to Apps Hungarian?" You won't be able to answer that
question ... unless X or Y is "Apps Hungarian," of course.

&lt;/P&gt;&lt;P&gt;

At best you'll point to success cases, and then I'll point to the
success of Microsoft Word and Excel in the 1990s, and to shops like
Fog Creek.

&lt;/P&gt;&lt;P&gt;

Of course, your proper response will be "but you use version control,
right, and where's &lt;i&gt;your&lt;/i&gt; evidence for that.". Sigh. Researchers?
Help!

&lt;/P&gt;
</description><guid isPermaLink="true">http://www.dalkescientific.com/writings/diary/archive/2012/12/13/version_control_and_user_testing.html</guid><pubDate>Thu, 13 Dec 2012 12:00:00 GMT</pubDate></item><item><title>Interview with Igor Filippov about OSRA</title><link>http://www.dalkescientific.com/writings/diary/archive/2012/11/10/Igor_Filippov_and_OSRA.html</link><description>&lt;P&gt;

A new "Molecular Coding" podcast is out! &lt;a
href="http://dalkescientific.com/podcast/Molecular_Coding/Podcast/Entries/2012/11/10_Igor_Filippov_and_OSRA.html"&gt;I
interviewed Igor Filippov&lt;/a&gt; about &lt;a
href="http://cactus.nci.nih.gov/osra/"&gt;OSRA&lt;/a&gt;. We recorded it at
ICCS in summer of 2011 but I didn't do the editing and transcript
until the last couple of days. Good news on the personal side. After
about two years of temporary housing, we have finally moved into some
place more permanant, in Trollh&amp;auml;ttan, Sweden. (Yes, these
sentences are related.)

&lt;/P&gt;&lt;P&gt;

Tomorrow I leave for the &lt;a href="https://www.gdch.de/index.php?id=1400"&gt;2012 German Conference on Chemoinformatics&lt;/a&gt;,
in Goslar, Germany. I present the &lt;a href="https://bitbucket.org/dalke/fmcs"&gt;fmcs&lt;/a&gt; work with Janna Hastings on
Monday at 12:05pm, and I have a poster about &lt;a href="http://code.google.com/p/chem-fingerprints/"&gt;chemfp&lt;/a&gt; at poster P38.

&lt;/P&gt;&lt;P&gt;

I hope to see some of you there. Let me know if you listen to the
podcast!

&lt;/P&gt;

</description><guid isPermaLink="true">http://www.dalkescientific.com/writings/diary/archive/2012/11/10/Igor_Filippov_and_OSRA.html</guid><pubDate>Sat, 10 Nov 2012 12:00:00 GMT</pubDate></item><item><title>PyRSS2Gen for Python3</title><link>http://www.dalkescientific.com/writings/diary/archive/2012/08/27/pyrss2gen_for_python3.html</link><description>&lt;P&gt;

Nine years ago I wrote and released &lt;a
href="http://dalkescientific.com/Python/PyRSS2Gen.html"&gt;PyRSS2Gen&lt;/a&gt;. It's
a Python library for creating RSS2 feeds. You give it a Python data
structure as input, and out pops XML on the other side.

&lt;/P&gt;&lt;P&gt;

It does one thing, and it does it reasonably well. People still use
the package. A cursory search finds that packages for it in Debian and
Ubuntu, examples of using it for various projects, and &lt;a
href="http://stackoverflow.com/search?q=pyrss2gen"&gt;16 questions about
it&lt;/a&gt; in StackOverflow.

&lt;/P&gt;&lt;P&gt;

Not bad for something which only took a few days to develop. 

&lt;/P&gt;&lt;P&gt;

I don't get much mail about it. People really only send email when
something's broken, and PyRSS2Gen is a small enough that the few bugs
it did have were quickly fixed for the 1.0.0 release.

&lt;/P&gt;&lt;P&gt;

Times change. PyRSS2Gen-1.0.0 wasn't Python3 compatible. The end of
the world was neigh. But then I got a patch from Graham Bell with the
two things which needed to change. The world was saved!

&lt;/P&gt;&lt;P&gt;


For your enjoyment, update to &lt;a
href="http://dalkescientific.com/Python/PyRSS2Gen-1.1.tar.gz"&gt;PyRSS2Gen-1.1&lt;/a&gt;.
The module is a single file which is compatible with Python2.3 and
later; and where "later" includes Python3.

&lt;/P&gt;

</description><guid isPermaLink="true">http://www.dalkescientific.com/writings/diary/archive/2012/08/27/pyrss2gen_for_python3.html</guid><pubDate>Mon, 27 Aug 2012 12:00:00 GMT</pubDate></item><item><title>The Structure Query Collection</title><link>http://www.dalkescientific.com/writings/diary/archive/2012/08/21/structure_query_collection.html</link><description>&lt;P&gt;

Finally! Long time readers might recall my long-term goal of improving
small molecule search performance. There's my &lt;a
href="http://code.google.com/p/chem-fingerprints/"&gt;chemfp&lt;/a&gt; project
for high-performance Tanimoto searches, and I have ideas on how to &lt;a
href="http://dalkescientific.com/writings/diary/archive/2012/06/11/optimizing_substructure_keys.html"&gt;optimize
substructure keys&lt;/a&gt;. But they suffer from a lack of a good
performance benchmark.

&lt;/P&gt;&lt;P&gt;

Well, I still don't have a benchmark, since that requires
targets. What I do have are several sets of structure queries, in a
form that you can easily use to make your own benchmarks. I call it the
&lt;a href="https://bitbucket.org/dalke/sqc"&gt;Structure Query
Collection&lt;/a&gt;, or SQC for short.


&lt;/P&gt;
&lt;h2&gt;So you want to know if search is faster&lt;/h2&gt;
&lt;P&gt;

There are dozens, if not hundreds of papers on ways to improve search
performance. Let's suppose you have a faster way to do a substructure
search. (By this I mean that that someone sketches a compound, which
your search program interprets as a substructure query.) To show
improvement, you need a baseline. The easiest is to select a random
set of compounds from the database itself, to use a queries. A problem
is that all of the queries will have at least one match - the query
compound itself. Real-world queries will sometimes not find
matches. Perhaps you can partition your data so that some of the
compounds aren't in the and can be used as queries. What's a good
ratio for that?

&lt;/P&gt;&lt;P

What about duplicate queries? Different people will submit the same
query, so if there are lots of repeats and little diversity then
perhaps it makes good sense to cache previous results. But what is the
diversity and duplication rate of real-world queries?

&lt;/P&gt;&lt;P&gt;

There's even a more fundamental problem: I think people who do a
substructure search tend to use smaller structures than what's in the
database. If the benchmark queries and targets come from the same set
of compounds, then the queries are probably not characteristic of how
the dataset will be used. This is only a suspicion: am I right?

&lt;/P&gt;
&lt;h2&gt;The search for real-world data&lt;/h2&gt;
&lt;P&gt;

The only way to find out is to look at real-world query sets. I know
people at pharmas who of course have logs from internal searches, but
it's considered confidential information since it may contain
proprietary compound structure data. Even if the dataset was filtered
so that only the public structures are released, there still may be
enough for a competitor to get an idea of the types of structures the
company is interested in. So that's a no.

&lt;/P&gt;&lt;P&gt;

PubChem by law can't reveal any information about the user-submited
queries. I contacted ChemSpider, but at the time they didn't have
enough queries to be worthwhile. In general, companies which provide
search don't want to reveal their queries since they get paid by
people who want that information to be secret. And so on.

&lt;P&gt;&lt;P&gt;

(BTW, if &lt;i&gt;you&lt;/i&gt; have a chemical query data set that you can make
public, &lt;a href="mailto:dalke@dalkescientific.com"&gt;email me&lt;/a&gt; and
let me know so that I can add it to the collection.)

&lt;/P&gt;
&lt;h2&gt;BindingDB queries&lt;/h2&gt;
&lt;P&gt;

But all is not lost. Last year I talked with Michael Gilson of &lt;a
href="http://www.bindingdb.org/"&gt;BindingDB&lt;/a&gt;. They have no promises
or expectation of privacy of user-submitted data, and he had no
problem with getting me that information. More specifically, he asked
Tiqing Liu to do the data extraction for me.

&lt;/P&gt;&lt;P&gt;

That meant some munging of the log files, and a bit of back and forth
while I figured out what Tiqing did and he could figure out what
information I wanted. The result was a set of SMILES, sketched in
Marvin, for each of three classes of search: 3.3 million for exact
search, 550,000 for similarity search, and 14,000 for substructure
search.

&lt;/P&gt;&lt;P&gt;

(While I first asked for this data last year, this topic is very much
a background task and I didn't have the time to follow up on it until
a couple of weeks ago.)


&lt;/P&gt;
&lt;h2&gt;SMARTS queries&lt;/h2&gt;
&lt;P&gt;

That covers searches where the input is a sketched structure. What
about SMARTS searches? Rather fewer people sketch or write SMARTS
queries directly. Instead, SMARTS are more often part of the automated
workflow for feature counts and structure alerts. While I would love
to get my hand on real-world user-created SMARTS query sets, it's also
useful to have workflow query sets too. Luckily, I know of a few.

&lt;/P&gt;
&lt;P&gt;

A few weeks ago, Ehrlich and Rarey published &lt;a
href="http://www.jcheminf.com/content/4/1/13/abstract"&gt;Systematic
benchmark of substructure search in molecular graphs - From Ullmann to
VF2&lt;/a&gt;. This has a set of SMARTS patterns extracted from various
sources. They used it to benchmark two different subgraph isomorphism
implementations. The SMARTS patterns are available in the
supplementary information, although it took me a while to figure out
how to get the patterns.

&lt;/P&gt;&lt;P&gt;

Greg Landrum had developed a similar benchmark for RDKit, but it was
somewhat hidden away in its source distribution and not many people
knew about it.

&lt;/P&gt;&lt;P&gt;

For a classical touch, RDKit defines 162 of the 166 MACCS keys as
SMARTS patterns. The MACCS keys are widely used as a cheminformatics
fingerprint, so this could be used to benchmark fingerprint
generation. SQC actually includes the tweaked version of these SMARTS
which I put into chemfp. I changed a few of the SMARTS to be more
cross-toolkit friendly.

&lt;/P&gt;
&lt;h2&gt;The "Structure Query Collection"&lt;/h2&gt;
&lt;P&gt;

I've collected all of these together into one spot, called the &lt;a
href="https://bitbucket.org/dalke/sqc"&gt;Structure Query
Collection&lt;/a&gt;. Not a very creative name, I know, but it is
descriptive. There are currently the three BindingDB structure sets
and the three SMARTS query sets.

&lt;/P&gt;&lt;P&gt;

Each set contains the raw data or relevant subset of the data as I got
it from the respective sources, as well as source details. The raw
data needs processing - at the very least to put the SMILES or SMARTS
into a easily used form - so I also included the details, including
software, for how I extracted and cleaned up the data so it can be
used in a benchmark.

&lt;/P&gt;
&lt;h2&gt;First exploration: substructure query sizes&lt;/h2&gt;
&lt;P&gt;

Earlier I conjectured that structures used as a substructure query
will tend to be smaller than the molecules in the target set. Now I
have the data to test that hypothesis.

&lt;/P&gt;&lt;P&gt;

I'll use my &lt;a
href="https://bitbucket.org/dalke/opensmiles-ragel"&gt;OpenSMILES-ragel&lt;/a&gt;
project, which has a command-line program to count the number of atoms
and bonds in a SMILES string. It does very little SMILES processing,
and mostly assumes that the SMILES are correct beyond the basic syntax
level.

&lt;/P&gt;&lt;P&gt;

Here I get the number of atoms in each substructure query:

&lt;pre class="code"&gt;
% smiles_counts &amp;lt; sqc/BindingDB_substructure/BindingDB_substructure.dat  | head
atoms: 20 bonds: 22
atoms: 21 bonds: 23
atoms: 24 bonds: 27
atoms: 17 bonds: 19
atoms: 21 bonds: 24
atoms: 23 bonds: 26
atoms: 23 bonds: 26
atoms: 20 bonds: 23
atoms: 10 bonds: 11
atoms: 16 bonds: 18
&lt;/pre&gt;

Or to get the average and standard deviation:

&lt;pre class="code"&gt;
% smiles_counts &amp;lt; sqc/BindingDB_substructure/BindingDB_substructure.dat | 
   awk '{sum+=$2; sumsq+=$2*$2} END {print sum/NR, "+/-", sqrt(sumsq/NR - (sum/NR)**2)}'
17.8607 +/- 11.1145
&lt;/pre&gt;

&lt;/P&gt;&lt;P&gt;

For representative targets, I'll use the same ZINC subset from Ehrlich
and Rarey's supplemental file "supp3.zip", in
datasets/zinc_lead-like_2011-02-12_first100k.smi. This contains some
SMILES which the OpenSMILES grammar doesn't accept (it uses an odd
SMILES dialect I've not seen before), but I can filter out the
failures by removing lines where the number of atoms is "-1".

&lt;pre class="code"&gt;
% SMILESFILE=~/databases/Ehrlich_and_Rarey_VF2_Ullmann/datasets/zinc_lead-like_2011-02-12_first100k.smi
% smiles_counts &amp;lt; $SMILESFILE | head
atoms: 20 bonds: 21
atoms: 17 bonds: 17
atoms: 19 bonds: 20
atoms: -1 bonds: -1
atoms: 22 bonds: 23
atoms: -1 bonds: -1
atoms: 17 bonds: 17
atoms: 14 bonds: 16
atoms: 21 bonds: 23
atoms: 22 bonds: 24
% smiles_counts &amp;lt; $SMILESFILE | fgrep -v -- -1 |
   awk '{sum+=$2; sumsq+=$2*$2} END {print sum/NR, "+/-", sqrt(sumsq/NR - (sum/NR)**2)}'
20.238 +/- 2.92336
&lt;/pre&gt;

Since I wasn't sure if there was a bias in the '-1' reports, I
double-checked this by changing the failure cases to something which
was no longer a valid SMILES but where smiles_counts would still
report the same number of atoms. First, I check that the
transformation works, and then I redo the count:

&lt;pre class="code"&gt;
% perl -pe 's/\)(\d|%\d\d)+/)/g' $SMILESFILE | fgrep -- -1
% perl -pe 's/\)(\d|%\d\d)+/)/g' $SMILESFILE | smiles_counts |
   awk '{sum+=$2; sumsq+=$2*$2} END {print sum/NR, "+/-", sqrt(sumsq/NR - (sum/NR)**2)}'
20.4499 +/- 2.90956
&lt;/pre&gt;

&lt;/P&gt;&lt;P&gt;

So, BindingDB substructure queries average 18 atoms while the ZINC
targets average about 20 atoms. This agrees with my hypothesis, but I
would need to get a copy of BindingDB for better assurance.

&lt;/P&gt;

</description><guid isPermaLink="true">http://www.dalkescientific.com/writings/diary/archive/2012/08/21/structure_query_collection.html</guid><pubDate>Tue, 21 Aug 2012 12:00:00 GMT</pubDate></item><item><title>Optimizing substructure keys</title><link>http://www.dalkescientific.com/writings/diary/archive/2012/06/11/optimizing_substructure_keys.html</link><description>&lt;P&gt;

Advertisement: This essay describes a research project I've slowly
been working on during the last 18 months. I believe I can do very
fast substructure searches of large compound data sets. I would like
to make it more real, but that requires funding.
&lt;a href="mailto:dalke@dalkescientific.com"&gt;Contact me&lt;/a&gt; if you want
to collaborate and can pay for it.

&lt;/P&gt;&lt;P&gt;

There are two related goals in my quest for an inverted index library,
which is unfortunate because what I want for one is not necessarily
the best for the other. One goal is to see if I can do PubChem-scale
substructure searches with "interactive" performance (under 0.1
seconds). I think I can do that with a lot of precomputation based on
substructure enumeration and optimized key development. Based on the
feedback from various people, I think this is best handled within an
existing database, and most probably PostgreSQL.

&lt;/P&gt;&lt;P&gt;

The first goal requires developing new database substructure search
methods. My other goal is to see if I can optimize substructure keys
within the framework used in most existing substructure search
programs.

&lt;/P&gt;&lt;P&gt;

The existing approaches use either a explicit substructure-based
fingerprint like the 
&lt;a href="http://dalkescientific.com/writings/diary/archive/2011/01/20/implementing_cactvs_keys.html"&gt;MACCS or CACTVS/PubChem keys&lt;/a&gt;,
or an exhaustive hash-based fingerprint like the
&lt;a href="http://www.daylight.com/dayhtml/doc/theory/theory.finger.html"&gt;Daylight fingerprints&lt;/a&gt;.
The substructure-based ones are hand-optimized and small (under 1000
bits), but don't do well with unusual chemistry. The exhaustive ones
can handle diverse chemistries, but are larger (typically 1024-4096
bits).

&lt;/P&gt;&lt;P&gt;

I want to see if I can use my
&lt;a href="http://dalkescientific.com/writings/diary/archive/2011/01/13/faster_subgraph_enumeration.html"&gt;subgraph enumeration approach&lt;/a&gt;
to do target- and query-based key optimization. That is, given a set of
structures in the database and a characteristic set of queries, I
would like to develop short, dense fingerprints which can optimize
those queries.

&lt;/P&gt;
&lt;h2&gt;Problem: I lack a query set&lt;/h2&gt;
&lt;P&gt;

I'm not there yet. The biggest problem is that I lack any sort of
representative sample of the types of queries people do. People at
pharmas don't want to contribute their data sets because it's hard to
remove all of the proprietary queries from the dataset. By law,
PubChem is not allowed to release this data.

&lt;/P&gt;&lt;P&gt;

Do you have a good (1,000+) query set that I can use for a benchmark?
If yes, &lt;a href="mailto:dalke@dalkescientific.com"&gt;email me&lt;/a&gt;!

&lt;/P&gt;&lt;P&gt;

So until I get a good query set, I'll have to assume that the targets
themselves are representative queries. This of course is incorrect. By
definition, a substructure search will likely be smaller than what's
in the database. Plus, if someone uses a sketch program to make the
query, then they are going to build it up an atom or ring at a time,
so the first dozen queries or so will likely be small and have lots of
matches.

&lt;/P&gt;
&lt;h2&gt;Optimizing the query keys&lt;/h2&gt;
&lt;P&gt;

At its heart, a substructure query is a subgraph isomorphism
test. This is slow. A substructure filter is a quick test which tries
to reduce the number of required isomorphism tests. For example, if an
"N" (nitrogen) exists in the query but does not exist in the target
then there's no need to do the full isomorpism test.

&lt;/P&gt;&lt;P&gt;

However, if "N" does not exist in the query then it might still match
a target molecule which has an "N" in it. For example, the query "O"
should still match the target "N#[N+][O-]" (nitrous oxide).

&lt;/P&gt;&lt;P&gt;

So if the query feature is "1" then the target feature must be a "1",
otherwise the isomorphism test is required. This is a simple decision
tree; all compounds with feature&lt;sub&gt;i&lt;/sub&gt; take one branch, and all
which don't take the other. Since I'm assuming that the queries are
sampled from the targets, then what I want is to end up with the set
of features which minimize the size of the number of nodes at the end
of the decision tree.

&lt;P&gt;&lt;/P&gt;

Last year I tried a
&lt;a href="http://dalkescientific.com/writings/diary/archive/2011/01/16/screening_teaser.html"&gt;genetic algorithm&lt;/a&gt;
to find this, but it didn't converge well, and the results weren't
pleasing. For example, if I decide that I want to add 32 bits to a
fingerprint, then I have to redo the search from scratch and I end up
with a completely different set of fingerprints.

&lt;/P&gt;
&lt;h2&gt;A greedy optimization algorithm&lt;/h2&gt;
&lt;P&gt;

I came up with a &lt;a href="http://dalkescientific.com/writings/diary/most_divisive_features.py"&gt;greedy optimization&lt;/a&gt; instead. The base of the decision
tree has all of the records in a single node. I find the "most
divisive" feature, defined as the feature which best separates the
tree into two branches, with about half of the records in one branch
and half in the other. Each iteration finds the features which most
evenly tries to split all of the nodes. I use the scoring function:

&lt;pre class="code"&gt;
def score(current_leaves, feature):
    score = 0
    for node in current_leaves:
        num_on = len(node &amp;amp; feature)
        num_off = len(node) - num_on
        score += (num_on-num_off)**2
    return score
&lt;/pre&gt;

This does a &lt;i&gt;lot&lt;/i&gt; of set intersection tests; about 30 million per
iteration in my benchmark. Actually, since all I care about is the
number of elements in the intersection, and not the set itself, I wish
that Python had a "set.intersection_size()" method.

&lt;/P&gt;&lt;P&gt;

I use the scoring function to find the most divisive feature. This is
the one which generates the smallest score. I use that feature to
split all of the current leaves into "haves" and "have nots." As an
optimization, I've decided that if there are fewer than 10 elements in
a node then I don't care about further splitting that node. That's
small enough that I can easily do a subgraph isomorphism test.

&lt;/P&gt;&lt;P&gt;

There are a few other optimizations. I ignore duplicate fingerprints,
and I ignore features with fewer than 10 records in them. Set
intersections are expensive, so I also do a short-circuit evaluation
while computing the scoring function and abort the calculation if the
current score is ever worse than the current best score. I also sort
the nodes so the ones with the largest size go first. This increases
chance of early short-circuiting.

&lt;/P&gt;&lt;P&gt;

I do some other optimizations, but you'll need to
&lt;a href="http://dalkescientific.com/writings/diary/most_divisive_features.py"&gt;see the code&lt;/a&gt; for all
the details. Yesterday I found that pypy 1.9's set intersection was about
twice as fast as the one in cpython, so that's what I used to run the code.

&lt;/P&gt;
&lt;h2&gt;Preliminary results&lt;/h2&gt;
&lt;P&gt;

I used the same &lt;a href="http://dalkescientific.com/inverted_index_benchmark.7z"&gt;benchmark set&lt;/a&gt;
(37 MB 7zip compressed) from
&lt;a href="http://dalkescientific.com/writings/diary/archive/2012/06/10/inverted_index_library.html"&gt;yesterday's essay&lt;/a&gt;,
which is a small subset of my full data set. From
&lt;a href="http://dalkescientific.com/writings/diary/archive/2011/12/25/unique_fragments_in_pubchem.html"&gt;previous analysis&lt;/a&gt;,
I know it has high local fingerprint diversity, so I figure it's a
good enough start. Here's the subset of output containing the feature
keys and the amount of time it took to find each one:

&lt;pre class="code"&gt;
0 Cccc 5.5 seconds
1 CCO 3.6 seconds
2 CN 3.8 seconds
3 C=O 3.8 seconds
4 CCCC 5.1 seconds
5 cN 4.7 seconds
6 Cl 9.0 seconds
7 CC(C)C 10.2 seconds
8 n 8.2 seconds
9 C=C 10.5 seconds
10 cO 10.1 seconds
11 S 11.3 seconds
12 COC 11.7 seconds
13 cc(c)c 24.3 seconds
14 CC(C)O 24.1 seconds
15 O 24.5 seconds
16 CN(C)C 27.4 seconds
17 C1CCCCC1 41.4 seconds
18 CC 36.8 seconds
19 CCC(=O)O 44.1 seconds
20 F 55.2 seconds
21 CCCCCC 52.1 seconds
22 Br 48.2 seconds
23 CCC 52.3 seconds
24 C 51.1 seconds
25 C(CO)O 53.5 seconds
26 CccC 64.2 seconds
27 CCNCC 66.1 seconds
28 P 60.2 seconds
29 CCC(C)C=O 59.6 seconds
30 c1ccccc1 68.4 seconds
31 N 74.9 seconds
32 C1CCCC1 80.9 seconds
33 cCO 85.1 seconds
34 NO 96.5 seconds
35 C(CCO)CO 97.8 seconds
36 C(CN)N 96.4 seconds
37 cc(c)cc(c)c 99.7 seconds
38 CC(C)(C)C 99.2 seconds
39 I 102.6 seconds
40 CcccC 208.8 seconds
41 CCCCCCC 100.7 seconds
42 cc(c)nc 98.0 seconds
43 [Si] 93.7 seconds
44 CNC 95.3 seconds
45 CC(=O)C 94.5 seconds
46 c(cO)O 91.7 seconds
47 CCCCC 89.6 seconds
48 cCc 90.5 seconds
49 CC(C)(C)O 82.6 seconds
50 [Na] 82.6 seconds
51 o 84.3 seconds
52 [As] 85.6 seconds
53 NN 84.6 seconds
54 C(O)O 81.6 seconds
55 c(n)n 83.3 seconds
56 [Se] 80.4 seconds
57 CC=CC 80.1 seconds
58 COCCCOC 79.6 seconds
59 c(cCl)cCl 78.1 seconds
60 C=N 105.0 seconds
61 C#C 59.6 seconds
62 [Cr] 58.5 seconds
63 c(ccN)cN 58.1 seconds
64 CccccC 58.0 seconds
65 CC(C)C(C)C 54.9 seconds
66 ccc 52.7 seconds
67 B 53.3 seconds
68 CCCOCCC 55.1 seconds
69 CccO 51.0 seconds
70 [K] 48.7 seconds
71 CCN(CC)CC 48.6 seconds
72 CCCC=C(C)C 47.4 seconds
73 [Te] 47.8 seconds
74 cc=O 47.4 seconds
75 [Ti] 47.4 seconds
76 O=S(=O)O 47.7 seconds
77 CCC(CC)CO 46.7 seconds
78 [Fe] 46.8 seconds
79 c1ccncc1 46.7 seconds
80 CCOCCOC 52.5 seconds
81 c(ccN)ccN 35.2 seconds
82 [Al] 34.5 seconds
83 CC(C)N 34.2 seconds
84 [Sn] 32.4 seconds
85 c(cCl)Cl 32.4 seconds
86 CSC 31.9 seconds
87 C=CC=C 30.9 seconds
88 c(ccO)ccO 30.3 seconds
89 [Mo] 30.0 seconds
90 s 30.0 seconds
91 C(Cl)Cl 29.3 seconds
92 [Ca] 28.7 seconds
93 CccN 28.3 seconds
94 C(CO)CO 27.3 seconds
95 [W] 26.7 seconds
96 O(P)P 26.8 seconds
97 [Cu] 27.0 seconds
98 CCC(=O)CC 27.5 seconds
99 CCCNCCC 26.8 seconds
100 CO 28.0 seconds
101 [Pb] 19.6 seconds
102 CCOC(=O)C 19.6 seconds
103 [Zn] 18.8 seconds
104 N=N 18.6 seconds
105 [Mg] 18.2 seconds
106 CC(C)CC(C)C 18.0 seconds
107 [Hg] 17.7 seconds
108 C=CCCO 17.2 seconds
109 [V] 17.1 seconds
110 [H] 17.0 seconds
111 cc(c)c(c)c 17.1 seconds
112 c(ccO)cO 16.8 seconds
113 [Zr] 16.5 seconds
114 CC(=O)N 16.5 seconds
115 c(ccCl)cCl 15.9 seconds
116 cc(c)cN 15.8 seconds
117 [Mn] 15.6 seconds
118 COP(=O)OC 15.5 seconds
119 Ccnc 15.4 seconds
120 [Ni] 15.7 seconds
121 CCCCCC=O 12.4 seconds
122 C1CC1 12.1 seconds
123 cOc 11.9 seconds
124 [Co] 11.7 seconds
125 c1cccc1 11.7 seconds
126 [Sb] 11.5 seconds
127 [Pt] 11.5 seconds
128 C(N)N 11.4 seconds
129 O=S 11.0 seconds
130 [Nb] 10.8 seconds
131 C(CCO)CCO 10.8 seconds
132 [U] 10.6 seconds
133 c(cBr)cBr 10.7 seconds
134 CCc 10.6 seconds
135 [Ba] 10.2 seconds
136 CC(C)OC(C)C 10.0 seconds
137 C[Si](C)C 10.1 seconds
138 ccNcc 9.8 seconds
139 [Ru] 9.5 seconds
140 C=CCC=C 9.7 seconds
141 CN(C)(C)C 7.5 seconds
142 SS 7.3 seconds
143 C(F)(F)F 7.2 seconds
144 [Re] 7.0 seconds
145 ccnccn 6.9 seconds
146 CC(C)(C)CCO 6.8 seconds
147 [Cd] 6.7 seconds
148 CCC(C)CC 6.7 seconds
149 CCSCC 6.5 seconds
150 [Ge] 6.5 seconds
151 Cc(c)c(c)c 6.4 seconds
152 [Sr] 6.3 seconds
153 C(COCCO)O 6.2 seconds
154 Ccc(C)cC 6.0 seconds
155 c(c(cCl)Cl)Cl 6.0 seconds
156 OO 6.0 seconds
157 ccnc=O 5.9 seconds
158 C(CCl)Cl 5.8 seconds
159 [Pd] 5.7 seconds
160 c(cO)N 5.8 seconds
161 [Ag] 4.7 seconds
162 C1CCCCCC1 4.7 seconds
163 [Ta] 4.6 seconds
164 [Ce] 4.6 seconds
165 C(CC=O)C=O 4.5 seconds
166 cccCl 4.4 seconds
167 CCC(CC)O 4.3 seconds
168 [Gd] 4.2 seconds
169 [Li] 4.2 seconds
170 cc(c)c(c)O 4.2 seconds
171 C=CCO 4.1 seconds
172 CP 4.1 seconds
173 C#N 4.0 seconds
174 COcc(c)OC 3.9 seconds
175 [Bi] 3.8 seconds
176 c(cN)cS 3.8 seconds
177 [Y] 3.8 seconds
178 CC(C)(C)C=O 3.7 seconds
179 c(ccBr)cBr 3.7 seconds
180 [Yb] 3.8 seconds
181 [In] 3.1 seconds
182 CC(C)CO 3.1 seconds
183 Cc(c)cccO 3.0 seconds
184 COCC(O)O 3.0 seconds
185 [La] 2.9 seconds
186 c(cN)N 2.9 seconds
187 [Eu] 2.9 seconds
188 CCC(CC)CC 2.8 seconds
189 [Ga] 2.8 seconds
190 cCCc 2.8 seconds
191 COC(CO)CO 2.7 seconds
192 [Th] 2.7 seconds
193 C=C(C)C 2.7 seconds
194 c(cF)cF 2.6 seconds
195 C[Si]C 2.6 seconds
196 CncccnC 2.6 seconds
197 cc(c)nc(c)c 2.6 seconds
198 C(CNCCN)N 2.5 seconds
199 [Tl] 2.5 seconds
200 C=C(C)C(C)(C)C 2.5 seconds
201 [Ho] 2.2 seconds
202 [Au] 2.2 seconds
203 CC(C)S 2.2 seconds
204 [Nd] 2.1 seconds
205 cc(cCl)c(c)Cl 2.1 seconds
206 CCC(CC)OC 2.1 seconds
207 c(ccS)ccS 2.1 seconds
208 C(=O)OC=O 2.0 seconds
209 [Be] 2.0 seconds
210 CC(C)c 2.0 seconds
211 [Tb] 2.0 seconds
212 COc 1.9 seconds
213 CCCCO 1.9 seconds
214 c(cBr)Br 1.9 seconds
215 CcccccC 1.8 seconds
216 C(COP)COP 1.8 seconds
217 C(CBr)Br 1.8 seconds
218 [Rh] 1.7 seconds
219 CCCCNC=O 1.7 seconds
220 C(CO)N 1.7 seconds
221 O[Si]O[Si]O 1.5 seconds
222 COcccOC 1.5 seconds
223 CC(C)CC=O 1.4 seconds
224 Cn(c)c(c)c 1.4 seconds
225 [Sm] 1.4 seconds
226 CCC(CO)CO 1.4 seconds
227 C(Cl)(Cl)Cl 1.4 seconds
228 CS 1.3 seconds
229 [Cs] 1.3 seconds
230 cc(c)c(cCl)Cl 1.3 seconds
231 CC1CCC(C)C1 1.3 seconds
232 cc(c)c(c)F 1.2 seconds
233 C(S)S 1.2 seconds
234 cnn 1.2 seconds
235 O=P(O)O 1.1 seconds
236 CCOCOCC 1.1 seconds
237 O=[As]O 1.1 seconds
238 CccccN 1.1 seconds
239 [Er] 1.0 seconds
240 C1CCNCC1 1.0 seconds
241 C(=O)CO 0.7 seconds
242 CC(=C(C)C)C 0.6 seconds
243 [Tm] 0.6 seconds
244 cc1ccncc1 0.6 seconds
245 N=O 0.6 seconds
246 C[Si](C)(C)[Si] 0.6 seconds
247 O=POPOP=O 0.6 seconds
248 CCCOC(C)C 0.6 seconds
249 Ccc(c(C)c)O 0.6 seconds
250 C(CC=O)CC=O 0.6 seconds
251 c(cN=O)cN=O 0.6 seconds
252 c(c(cBr)Br)Br 0.6 seconds
253 C(=CCO)CO 0.6 seconds
254 CCCCCCN 0.5 seconds
255 [Dy] 0.5 seconds
256 c(cCl)OccCl 0.5 seconds
257 CC(C)(F)F 0.5 seconds
258 c(cI)cI 0.5 seconds
259 Cc(c)ccCO 0.5 seconds
260 CCCCC(=O)O 0.5 seconds
261 [Ir] 0.4 seconds
262 c(ccCl)cN 0.4 seconds
263 cc(c)cccs 0.4 seconds
264 cc(c)oc(c)c 0.4 seconds
265 cc(c)c(c)Cl 0.4 seconds
266 COCCC(O)O 0.4 seconds
267 [Pu] 0.4 seconds
268 CCCCOC=O 0.4 seconds
269 cc(c)c(c)S 0.3 seconds
270 c(cn)cncN 0.3 seconds
271 C=CCCC=C 0.3 seconds
272 [Rb] 0.3 seconds
273 CNCCCNC 0.3 seconds
274 C1CC2CCC1C2 0.3 seconds
275 C(=O)ccC(=O)O 0.3 seconds
276 CCCSCCC 0.3 seconds
277 C(C=O)C=O 0.3 seconds
278 CCcccCC 0.3 seconds
279 p 0.3 seconds
280 COc(c)cc=O 0.3 seconds
281 cc1cc(c)sc1 0.2 seconds
282 [Pr] 0.2 seconds
283 CNCO 0.2 seconds
284 CCCC(C)C 0.2 seconds
285 CC[Sn](CC)CC 0.2 seconds
286 C(=O)NC=O 0.2 seconds
287 CC(cc(c)c)O 0.2 seconds
288 c(ccSN)cN 0.2 seconds
289 Ccccc(c)c 0.2 seconds
290 C(C=O)CO 0.2 seconds
291 CCP(CC)CC 0.2 seconds
292 [Hf] 0.2 seconds
293 CN(C)CccO 0.1 seconds
294 O([Si])[Si]O[Si]O[Si] 0.1 seconds
295 CCCCCOC 0.1 seconds
296 CCcccOC 0.1 seconds
297 C1CCC1 0.1 seconds
298 CC(C(C)(C)C)O 0.1 seconds
299 cc1cncc1c 0.1 seconds
300 c1cccccc1 0.1 seconds
301 CCCNcc=O 0.1 seconds
302 CSccc(c)c 0.1 seconds
303 [Os] 0.1 seconds
304 CC(C[Se]C)N 0.1 seconds
305 CCC(CC)Cl 0.1 seconds
306 CCC=CCC 0.1 seconds
307 C(=O)C=O 0.1 seconds
308 CCccCC 0.1 seconds
309 c(cc=O)c=O 0.1 seconds
310 CC(c)c 0.1 seconds
311 CCCOC1CC1 0.1 seconds
312 O=Cl(=O)O 0.1 seconds
313 c(cN(=O)O)cO 0.1 seconds
314 C(CCl)O 0.1 seconds
315 C(C(CO)(O)O)O 0.1 seconds
316 CCC(CC)Br 0.1 seconds
317 cc(cBr)cBr 0.1 seconds
318 C(ccccO)O 0.1 seconds
319 CCCCCCO 0.1 seconds
320 CCCN(C)P 0.1 seconds
321 COCcc(c)c 0.0 seconds
322 C[Si](C)([Si])[Si] 0.0 seconds
323 Cc(c)cc(c)S 0.0 seconds
324 C=C(C)CC(C)c 0.0 seconds
325 CBr 0.0 seconds
326 C(CC(Cl)Cl)CCl 0.0 seconds
327 C(S(=O)=O)S(=O)=O 0.0 seconds
328 C1=CCCCC1 0.0 seconds
329 C1CNCCN1 0.0 seconds
330 C=Cc(c)c(c)c 0.0 seconds
331 CC(c)N(C)Cc 0.0 seconds
332 CCC(=O)OCC 0.0 seconds
333 CcncccF 0.0 seconds
334 O(P)POPOP 0.0 seconds
335 c(c(cF)F)N 0.0 seconds
336 cc(c)cccI 0.0 seconds
337 C(C=O)COC=O 0.0 seconds
338 CCC(CC)N 0.0 seconds
339 CCCC(=O)S 0.0 seconds
340 CC[Pb](CC)CC 0.0 seconds
341 COP(OC)OC 0.0 seconds
342 C[Si](C)(C)C[Si] 0.0 seconds
343 Ccc(C=O)cC 0.0 seconds
344 cc1ccocc1 0.0 seconds
345 ccscc(c)c 0.0 seconds
346 CC(C)(C)COC 0.0 seconds
347 cc(c)cccN 0.0 seconds
348 cc(c)cccO 0.0 seconds
349 C(CO)COC=O 0.0 seconds
350 COc(c)cOc 0.0 seconds
351 [as] 0.0 seconds
352 c(ccCl)ccCl 0.0 seconds
&lt;/pre&gt;

This says that 353 bits are all that's needed in order to put 202,632
unique fingerprints into bins with at most about 10 unique
fingerprints in a given bin.

&lt;/P&gt;&lt;P&gt;

I think that's quite amazing given that most people use fingerprints
which are at least twice and usually at least four times that
length.

&lt;/P&gt;&lt;P&gt;

Also, it took 1.75 hours to generate this, and most of the time is
spent doing &lt;tt&gt;len(set1 &amp;amp; set)&lt;/tt&gt;.

&lt;/P&gt;
&lt;h2&gt;Is it useful?&lt;/h2&gt;
&lt;P&gt;

I really wish I had a good query set to use for tuning - and another
dataset to use for validation. I don't, but Greg Landrum made a
synthetic one for testing the RDKit/PostgreSQL cartridge. Last
December we worked on optimizing the fingerprint filters. I used an
earlier version of this program to generate 54 query bits, which he
fused with his hash-based fingerprint. &lt;a
href="http://www.mail-archive.com/rdkit-discuss@lists.sourceforge.net/msg02078.html"&gt;Greg
reported&lt;/a&gt;:


&lt;pre&gt;
Tacking your most recent 54 bits, interpreted as SMILES, onto the end
of a shorter version of my new fingerprint gives:
[06:10:40] INFO: FINISHED 50001 (41150823 total, 1325699 searched, 1118055 found) in 80.30
[06:10:40] INFO:   screenout: 0.03, accuracy: 0.84

The original performance for the shorter fingerprint was:
[05:04:23] INFO: FINISHED 50001 (41150823 total, 1693315 searched, 1118055 found) in 90.90
[05:04:23] INFO:   screenout: 0.04, accuracy: 0.66

so those bits definitely help with the screenout.

Putting the bits at the beginning of the fingerprint gets a bit more speed:
[06:14:48] INFO: FINISHED 50001 (41150823 total, 1325699 searched, 1118055 found) in 71.09
[06:14:48] INFO:   screenout: 0.03, accuracy: 0.84
&lt;/pre&gt;

I think a 20% speedup is nothing to sneeze at.

&lt;/P&gt;
&lt;h2&gt;I would still like an inverted index library&lt;/h2&gt;
&lt;P&gt;

The greedy algorithm does a lot of set intersections, partitions sets
into subsets, and removes old sets. I feel like this is not what
databases are designed for, That's why I'm looking for an inverted
index library which lets me manage boolean set operations myself. Then
again I'm not a database person and I don't know if my belief is true.

&lt;/P&gt;&lt;P&gt;

I ran my greedy feature selection algorithm until it ran out of data,
at only 353 bits. I want to have 1024 substructure bits. This means I
need to feed my algorithm a lot more data. That's fine - my test set
is only 1/200th of the data I have. However, my implementation running
the test set takes up 2 GB of memory on my 12 GB desktop machine, and
took 6300 seconds (1h45m) to run.

&lt;/P&gt;&lt;P&gt;

My concern is that a distributed memory implementation of my algorithm
is more complex to write than a shared memory one. The Amazon
high-memory instances have 68 GB of RAM. I estimate I need 400 GB with
Python sets, and under 40 GB with a more compact bitset implemented in
C. This seems eminently doable, which is why I'm looking to see if
someone has already done the work.

&lt;/P&gt;&lt;P&gt;

Based on my searching, and the lack of pointers by others, the answer
is "no."

&lt;/P&gt;

</description><guid isPermaLink="true">http://www.dalkescientific.com/writings/diary/archive/2012/06/11/optimizing_substructure_keys.html</guid><pubDate>Mon, 11 Jun 2012 12:00:00 GMT</pubDate></item><item><title>In search of an inverted index library</title><link>http://www.dalkescientific.com/writings/diary/archive/2012/06/10/inverted_index_library.html</link><description>&lt;P&gt;

I'm looking for a Python library for managing inverted indices. I've
been using Python sets, which are nice and fast, but they take up too
much memory. Lists use 1/10th the space but are a lot slower and
still not compact enough. The most likely solution so far is
Xapian. Are there other options I should consider? If so, 
&lt;a
href="mailto:dalke@dalkescientific.com"&gt;email me&lt;/a&gt; or comment on the
&lt;a
href="http://www.reddit.com/r/Python/comments/umkhr/inverted_index_library_for_python/"&gt;Reddit
thread&lt;/a&gt; I started on this topic.

&lt;/P&gt;&lt;P&gt;

Let explain what I want and what I've done so far.

&lt;/P&gt;
&lt;h2&gt;Sparse feature fingerprints&lt;/h2&gt;
&lt;P&gt;

I'm working with chemical fingerprints. That's not unusual for
me. What's new is that these fingerprints are not fixed length but are
of indefinite length. For each molecule I find all fragments with no
more than 7 atoms, and put the fragment into a canonical
representation, here a canonical fragment SMILES. Each unique
representation gets its own unique integer identifier. The end result
is a fingerprint which looks like:

&lt;pre class="code"&gt;
0:4,1:3,5:2,18:2,28:2,124,47298,53119 254846
&lt;/pre&gt;

The original SMILES for this is &lt;tt&gt;CCO.CCO.O=[V].Cl&lt;/tt&gt;. I picked
this structure because it has a very small fingerprint.

&lt;/P&gt;&lt;P&gt;

This fingerprint line says that PubChem record 254846 has 8
substructures ("features"). The features are stored along with a count
of the number of times it's found. Here, features '5', '18', and '28'
are found in two difference places in the molecule, while features
'124', '47298', and '53119' are found only once. Also, the feature
field and the identifier field are separated by a tab.

&lt;/P&gt;&lt;P&gt;

I kept track of the original canonical description. Here's the
breakdown:

&lt;pre class="code"&gt;
        canonical
 bit#  description
 ----  -----------
    0    C
    1    O
    5    CC
   18    CO
   28    CCO
  124    Cl
47298    O=[V]
53119    [V]
&lt;/pre&gt;

With a bit of back and forth you can verify that it's correct.

&lt;/P&gt;&lt;P&gt;

Overall I have about 25 million records and 2 million canonical
values, using a subset of PubChem's 35 million compounds. The average
fingerprint contains about 300-400 features, though value ranges from
1 to about 2,000 features. These are "sparse" fingerprints because
because the number of features per record divided by the number of
features total is very small (350/2,000,000) - small enough that it's
more efficient to store the list of feature bits (0, 1, 5, etc.) than
it is to store the bit pattern of 0s and 1s.

&lt;/P&gt;&lt;P&gt;

The features are very long-tail. Carbon ("C") exists in nearly every
record, in both aromatic and non-aromatic forms. "O", "cc", "ccc", and
"CC" are also very popular. But about 1/2 of the features only exist
in one record. Here's &lt;a href="http://dalkescientific.com/writings/diary/archive/2011/12/25/unique_fragments_in_pubchem.html"&gt;more on the fragment distribution pattern&lt;/a&gt;.

&lt;/P&gt;
&lt;h2&gt;Why do I want an inverted index?&lt;/h2&gt;
&lt;P&gt;

Feature fingerprints are old news in cheminformatics. One application
is to improve substructure search performance. Suppose you are
searching for a phenol (structure "c1ccccc1O"). You could do a
substructure isomorphism test with every single record, but that's
slow. Instead, there are some obvious improvements.

&lt;/P&gt;&lt;P&gt;

You know that if a record matches then it must have at least 7 atoms
and 7 bonds, at least 6 carbons, and at least 1 oxygen. It's easy to
precompute this information, so the set labeled "at least 6 carbons"
contains the set of all records with at least 6 carbons, and so
on. When a query comes in, analyze the structure based on those
predetermined features. This ends up with a list of sets, one for each
feature. Find the intersection of those sets, and you've reduced your
search space, potentially by a lot.

&lt;/P&gt;&lt;P&gt;

This is exactly when one should use an inverted index.

&lt;/P&gt;&lt;P&gt;

(There are other optimizations. If the query exists already as one of
the precomputed sets then there's no need to any additional
substructure search.)

&lt;/P&gt;
&lt;h2&gt;Feature "folding"&lt;/h2&gt;
&lt;P&gt;

Most cheminformatics systems "fold" their sparse bits into a fixed bit
length, like 1024 bits. For example, if a feature is assigned the
canonical id of "458793484" then some programs will apply modulo 1024
to that to get the folded bit id "524". This saves a lot of space, at
the cost of extra subgraph isomorphism search time.

&lt;/P&gt;&lt;P&gt;

I've been wondering, do I really need to fold the bits? Why can't I
keep all of the information around? My thoughts are informed by &lt;a
href="http://ww2.cs.mu.oz.au/mg/"&gt;Managing Gigabytes&lt;/a&gt;, which points
out that there are efficient ways to encode the inverted index. Can I
apply those techniques to this problem?

&lt;/P&gt;
&lt;h2&gt;Working code&lt;/h2&gt;
&lt;P&gt;

The first step is to establish that there is a problem. I'll use a
subset of 10 files, which represents about 0.5% of my actual data set,
and write some code to handle it. &lt;a href="http://dalkescientific.com/writings/diary/inverted_index.py"&gt;Here's the final code&lt;/a&gt; and here's the code with &lt;a href="http://dalkescientific.com/inverted_index_benchmark.7z"&gt;the benchmark data set&lt;/a&gt; (37 MB, compressed with 7zip).

&lt;/P&gt;&lt;P&gt;

The simplest inverted index implementation uses Python sets.

&lt;pre class="code"&gt;
from collections import defaultdict

class InvertedIndex(object):
    def __init__(self):
        # Map from a feature string like "ccO" to an integer feature id
        self.feature_to_id = {}
        # Map from an integer feature id to a set of record ids
        self.inverted_indices = defaultdict(set)

    def get_feature_id(self, feature):
        try:
            return self.feature_to_id[feature]
        except KeyError:
            n = len(self.feature_to_id)
            self.feature_to_id[feature] = n
            return n

    def add_record(self, id, feature_ids):
        for feature_id in feature_ids:
            self.inverted_indices[feature_id].add(id)

    def search(self, features):
        # These are *features*, not *feature_ids*.
        # If the feature wasn't seen before then there are no entries
        # and this will raise a KeyError, meaning nothing found.
        try:
            terms = [self.inverted_indices[self.feature_to_id[feature]] for feature in features]
        except KeyError:
            return set()
        terms.sort(key=len)
        return set.intersection(*terms)
&lt;/pre&gt;

Very simple, and it works pretty well. There's a bit of API trickiness
to get the feature id for a given feature. This simplifies the loader
and handles integer interning. (Interning is outside of the discussion
of this essay.)

&lt;/P&gt;&lt;P&gt;

Second is the code parse the data files and load the index. The file
format looks like:

&lt;pre class="code"&gt;
#FPC1
#bit-0=13120 C
#bit-1=12986 c
#bit-2=12978 cc
 ...
#bit-46924=1 COCNCP=O
0:6,1:21,2:21,3:22,4:23,5:2,6,...,36858,36869,37247,37429:4,37431,39942   26350001
0:6,1:21,2:21,3:22,4:23,5:2,6,...,36858,36869,37247,37429:4,37431,39942   26350002
 ...
0:5,1:12,2:12,3:12,4:12,5:3,6:3,...,25998,32031,32086:2,33182,33998,38357    26374982
&lt;/pre&gt;

There's a version line, some header lines starting "#bit-", and the
fingerprint lines.  The fingerprint line uses the sparse fingerprint I
mentioned earlier, followed by a tab, followed by the record
identifier. (Here it's a PubChem id, which is always an integer.)

&lt;/P&gt;&lt;P&gt;

Note: the bit definitions are not the same across different files. The
header lines show that bit 0 corresponds to the feature "C" for this
file (and it exists in 13120 records), and bit 1 is "c" (in 12986
records), but in another file I see that bit 0 is the feature "cc" and
in a third it's "ccc(c)CCN". So the parser needs to be smart and map
the per-file bit numbers to its own internal fragment identifier.

&lt;/P&gt;&lt;P&gt;

The details of the parser aren't important to this essay, and left to
&lt;a href="http://dalkescientific.com/writings/diary/inverted_index.py"&gt;the implementation&lt;/a&gt;.

&lt;/P&gt;
&lt;h3&gt;Benchmark&lt;/h3&gt;
&lt;P&gt;

I need a benchmark. I'll load
&lt;a href="http://dalkescientific.com/inverted_index_benchmark.7z"&gt;10 data files&lt;/a&gt;
(links to the 37 MB benchmark file) to make the inverted indices. I
also need some query data. I reused the parser to extract a subset
(1%) of the records in a fingerprint file. Actually, two fingerprint
files. One timing set reuses one of the data files used to build the
invertex indices, which means that everything query will match, while
the other uses a new data file, which should have many fewer matches.

&lt;P&gt;&lt;P&gt;

Again, see &lt;a
href="http://dalkescientific.com/writings/diary/inverted_index.py"&gt;the
code&lt;/a&gt; for full details.

&lt;/P&gt;&lt;P&gt;

When I run the benchmark I get:

&lt;pre class="code"&gt;
== Sets ==
Load 0 Compound_000000001_000025000.counts.fpc
Load 1 Compound_000025001_000050000.counts.fpc
Load 2 Compound_000050001_000075000.counts.fpc
Load 3 Compound_000075001_000100000.counts.fpc
Load 4 Compound_000100001_000125000.counts.fpc
Load 5 Compound_000125001_000150000.counts.fpc
Load 6 Compound_000150001_000175000.counts.fpc
Load 7 Compound_000175001_000200000.counts.fpc
Load 8 Compound_000200001_000225000.counts.fpc
Load 9 Compound_000225001_000250000.counts.fpc
Loading took 63.0 seconds and 2092896256 memory
100 66575 315.0 per second
200 100909 337.5 per second
all exist: 227 searches took 0.7 seconds (counts 114990) 
100 2323 1161.5 per second
200 3865 1360.9 per second
some exist: 230 searches took 0.2 seconds (counts 4048) 
&lt;/pre&gt;

Some of this is progress information. The important parts are the load
time (63.0) seconds, the total amount of memory used in the index (2.1
GB), and the search times of 0.7 and 0.2 seconds for the two test
sets. (I use '&lt;a href="http://code.google.com/p/psutil/"&gt;psutil&lt;/a&gt;'
to get the memory usage; note that the exact meaning is OS dependent.)
Also, the match counts of "114990" and "4048" are useful checks that
each inverted index implementation is correct.

&lt;/P&gt;&lt;P&gt;

This output shows the problem. I'm using 2.1 GB of memory for 10 data
files. I only have 12 GB on this desktop machine, and remember, the
full data set is 200 times bigger. Do you have a 500 GB machine I can
use?

&lt;/P&gt;
&lt;h2&gt;Use an array instead of a set&lt;/h2&gt;
&lt;P&gt;

Python objects have overhead. For example, the Python integer 300 on
my desktop requires 24 bytes, while I know I have about 2**25 records
so could easily get away with 4 bytes.

&lt;pre class="code"&gt;
&amp;gt;&amp;gt;&amp;gt; import sys
&amp;gt;&amp;gt;&amp;gt; sys.getsizeof(3)
24
&lt;/pre&gt;

The other 20 bytes handle things like a reference to the Python
integer class object and the reference count.

&lt;/P&gt;&lt;P&gt;

The Python set data type uses a lot of space because it's implemented
as a hash, and many of the internal slots are empty in order to
improve lookup performance. Quick testing shows that a set uses about
30-40 bytes for each element in the set.

&lt;P&gt;&lt;P&gt;

It's not quite as bad as what this because the set elements contain
Python references, which are 8 byte pointers on my 64-bit Python. But
that's still about 42 bytes of space used to store at most 4 bytes of
data. I should be able to do better.

&lt;/P&gt;&lt;P&gt;

The usual way "to do better" in Python, when storing a list of C-like
integers is to use an
'&lt;a href="http://docs.python.org/library/array.html"&gt;array.array&lt;/a&gt;.'
This is similar to a Python list, except that the contents are all of
the same type, so there's no per-item overhead to store the type
information. (Note: pje described this approach on
&lt;a href="http://www.reddit.com/r/Python/comments/umkhr/inverted_index_library_for_python/c4wqy7e"&gt;the Reddit thread&lt;/a&gt;.

&lt;/P&gt;&lt;P&gt;

The downside is that there's no equivalent to a set.intersection. For
now, I'll convert an array into a set, and do the normal set
intersection. Here's the code:

&lt;pre class="code"&gt;
# Make sure I can handle 4 byte integers
import array
_test = array.array("I", (0,))
try:
    _test[0] = 2**31+100
    array_code = "I"
except OverflowError:
    array_code = "L"

def make_unsigned_array():
    return array.array(array_code, ())

class InvertedIndexArray(object):
    def __init__(self):
        # Map from a feature string like "ccO" to an integer feature id
        self.feature_to_id = {}
        # Map from an integer feature id to an set of record ids
        self.inverted_indices = defaultdict(make_unsigned_array)


    def get_feature_id(self, feature):
        try:
            return self.feature_to_id[feature]
        except KeyError:
            n = len(self.feature_to_id)
            self.feature_to_id[feature] = n
            return n

    def add_record(self, id, feature_ids):
        for feature_id in feature_ids:
            self.inverted_indices[feature_id].append(id)

    def search(self, features):
        assert features
        try:
            terms = [self.inverted_indices[self.feature_to_id[feature]] for feature in features]
        except KeyError:
            return set()
        terms.sort(key=len)
        # Set conversion is expensive; it's faster to pass an iteratable to intersection_update.
        result = set(terms[0])
        for term in terms[1:]:
            result.intersection_update(term)
            # Short-circuit the evalution when the result set is empty.
            # Gives a 3x speedup for the "some exists" benchmark, with no
            # performance hit for the "all exist" benchmark.
            if not result:
                break
        return result
&lt;/pre&gt;

&lt;/P&gt;&lt;P&gt;

How well does it work? Here's the benchmark output using the
array-based inverted index implementation, trimmed somewhat to make it
easier to read:

&lt;pre class="code"&gt;
== Sets ==
Loading took 63.0 seconds and 2092896256 memory
all exist: 227 searches took 0.7 seconds (counts 114990) 
some exist: 230 searches took 0.2 seconds (counts 4048) 
== Arrays ==
Loading took 51.2 seconds and 165552128 memory
all exist: 227 searches took 41.7 seconds (counts 114990) 
some exist: 230 searches took 16.1 seconds (counts 4048) 
&lt;/pre&gt;

The new version takes less 10% of the memory as the old, which is
about what I predicted. It's nice to see the confirmation. It's also a
bit faster to build the data structure. But on the other hand, search
is over 60x slower. At 0.1 seconds per query, this performance is not
acceptable the types of highly interactive searches I want to do.

&lt;/P&gt;&lt;P&gt;

There are of course hybrid solutions. This version spends a lot of
time to convert an array to a set, which is then discarded. I convert
the 'C', 'c' and a few other inverted indices a lot, so what I could
do is use an LRU cache for the top 1,000 or so sets.

&lt;/P&gt;&lt;P&gt;

But let's get back to my problem. My data set is 200x bigger than the
test set, so simple scaling of 166MB says I'll need 33 GB of RAM in
the best of cases. I really would like this to work on my desktop,
instead of renting some Amazon compute node every time I want to try
things out.

&lt;/P&gt;
&lt;h2&gt;Stop the presses! pypy 1.9 just came out&lt;/h2&gt;
&lt;P&gt;

The &lt;a href="http://morepypy.blogspot.se/2012/06/pypy-19-yard-wolf.html"&gt;pypy-1.9 release notice&lt;/a&gt; says:

&lt;blockquote&gt;
&lt;ul&gt;
 &lt;li&gt;Sets now have strategies just like dictionaries. This means for example
    that a set containing only ints will be more compact (and faster).&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

This is good because in previous benchmarking I found that pypy sets
were rather slower than cpython's own sets.

&lt;/P&gt;&lt;P&gt;

I re-ran the benchmark using pypy 1.7 and 1.9. Here are the results,
first for 1.7:

&lt;pre class="code"&gt;
== Sets ==
Loading took 278.6s seconds and (psutil not installed) memory
all exist: 227 searches took 1.0 seconds (counts 114990)
some exist: 230 searches took 0.2 seconds (counts 4048)
== Arrays ==
Loading took 32.7s seconds and (psutil not installed) memory
all exist: 227 searches took 299.6 seconds (counts 114990)
some exist: 230 searches took 335.6 seconds (counts 4048)
&lt;/pre&gt;

and then for 1.9:

&lt;pre class="code"&gt;
== Sets ==
Loading took 47.9s seconds and (psutil not installed) memory
all exist: 227 searches took 0.5 seconds (counts 114990)
some exist: 230 searches took 0.1 seconds (counts 4048)
== Arrays ==
Loading took 35.4s seconds and (psutil not installed) memory
all exist: 227 searches took 186.2 seconds (counts 114990)
some exist: 230 searches took 72.9 seconds (counts 4048)
&lt;/pre&gt;

The good news is that 1.9 set operations are faster &amp;mdash; 5x faster
to load, and 2x faster to search. In fact, the search time is faster
than cpython's!  The bad news is it looks like pypy's
intersection_update isn't yet as well optimized. Also, manual
inspection shows that pypy 1.9 uses 2.1 GB of memory, which is about
the same as cpython's. pypy on its own isn't the right solution.

&lt;/P&gt;
&lt;h2&gt;Is there an inverted index library for Python?&lt;/h2&gt;
&lt;P&gt;

Inverted indices are used in all sort of search engines, and there are
many well-known ways to improve intersection performance and decrease
memory use. Some of the ways I know of storing bit sets are:

&lt;ul&gt;
 &lt;li&gt;&lt;a href="https://sdm.lbl.gov/fastbit/"&gt;FastBit: An Efficient Compressed Bitmap Index Technology&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="http://sparsehash.googlecode.com/svn/trunk/doc/index.html"&gt;sparse and dense hash sets&lt;/a&gt; from
the &lt;a href="http://code.google.com/p/sparsehash/"&gt;Google sparsehash&lt;/a&gt; library&lt;/li&gt;
 &lt;li&gt;&lt;a href="http://judy.sourceforge.net/doc/Judy1_3x.htm"&gt;Judy1&lt;/a&gt; arrays from &lt;a href="http://judy.sourceforge.net/"&gt;Judy&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

However, I haven't found good Python bindings for these. (I wrote some
Python/Judy bindings years ago. Unfortunately, it's buggy, and I would
have to start from scratch to get it working again.)

&lt;/P&gt;&lt;P&gt;

Another possibility is to store the 'on' bits as a sorted list. Sorted
lists are nice because the intesection corresponds to a merge sort, in
strictly linear time. Other methods are described in
&lt;a href="http://www.bradblock.com/Experimental_Analysis_of_a_Fast_Intersection_Algorithm_for_Sorted_Sequences.pdf"&gt;Experimental Analysis of a Fast Intersection Algorithm for Sorted Sequences&lt;/a&gt;.
There was also a recent posting on how to take advantage of
&lt;a href="http://highlyscalable.wordpress.com/2012/06/05/fast-intersection-sorted-lists-sse/"&gt;SSE instructions for fast sorted list intersection&lt;/a&gt;.
Since the fields are sorted, it's also possible to compress the data by, for example,
storing deltas using something like
&lt;a href="http://en.wikipedia.org/wiki/Elias_gamma_coding"&gt;Elias gamma coding&lt;/a&gt;
or perhaps an encoder tuned to the data set.

&lt;/P&gt;&lt;P&gt;

Here too I've not been able to find an existing library which handled
this for me, much less one with Python bindings.

&lt;/P&gt;
&lt;h2&gt;Search engine components&lt;/h2&gt;
&lt;P&gt;

Actually, I'm wrong. These are available, as part of a larger search
engine package like Lucene and &lt;a
href="http://xapian.org/"&gt;Xapian&lt;/a&gt;.  Lucene is very well known, but
I would have to mix the Java runtime and the chemistry code that I'm
using from C++. Doable, but it feels like too much weight.

&lt;/P&gt;&lt;P&gt;

Instead, I downloaded, compiled, and installed Xapian and its Python
bindings and wrote an inverted index based on its API. (Justinsaccount was the 
first to suggest this in
&lt;a href="http://www.reddit.com/r/Python/comments/umkhr/inverted_index_library_for_python/c4womws"&gt;the Reddit thread&lt;/a&gt;.)

&lt;/P&gt;&lt;P&gt;

Xapian uses the
filesystem to store the database, so this version uses a lot less
local memory than the previous two inverted indices. I wonder how much
of the data set is stored in memory vs. how much is read from disk. I
imagine there are tuning variables, and it looks like there's an
in-memory option I could also do. I am quite the newbie with this
library.

&lt;/P&gt;&lt;P&gt;

Here's my implementation:


&lt;pre class="code"&gt;
try:
    import xapian
except ImportError:
    xapian = None

class InvertedIndexXapian(object):
    def __init__(self, dbname):
        self.db = xapian.WritableDatabase(dbname, xapian.DB_CREATE_OR_OPEN)
        
    def get_feature_id(self, feature):
        return feature
    
    def add_record(self, id, feature_ids):
        try:
            doc = self.db.get_document(id)
        except xapian.DocNotFoundError:
            doc = xapian.Document()
        for feature_id in feature_ids:
            doc.add_boolean_term(feature_id)
        self.db.replace_document(id, doc)

    def search(self, features):
        enquire = xapian.Enquire(self.db)
        query = xapian.Query(xapian.Query.OP_AND, features)
        enquire.set_query(query)
        matches = enquire.get_mset(0, 200000)
        return matches
&lt;/pre&gt;

and when I run the benchmark I get this timing information:


&lt;pre class="code"&gt;
Loading took 349.9 seconds and 46465024 memory
all exist: 227 searches took 15.7 seconds (counts 114990) 
some exist: 230 searches took 20.7 seconds (counts 4048) 
&lt;/pre&gt;

Queries are about 3-12x slower than the cpython implementation, and
5-20x slower than pypy. But on the other hand, this uses only 90 MB of
RAM, although it does build a 340 MB data file. Simple scaling
suggests that I'll need 65 GB of disk space for the entire data
set. That's quite doable.

&lt;/P&gt;&lt;P&gt;

The Xapian load time is a lot slower, but as the data is persistent,
reload time from a built database is trivially fast. The documentation
says I can set XAPIAN_FLUSH_THRESHOLD to a larger value for better
indexing performance. Are there other tuning variables I should
consider?

&lt;/P&gt;&lt;P&gt;

Xapian might be the way to go. &lt;a
href="http://xapian.org/docs/valueranges.html"&gt;Value ranges&lt;/a&gt; might
be useful when I think about how to handle fragment counts. I'll need
to do more evaluation. For example, I don't like that it's an order of
magnitude slower than Python sets. After all, Python sets aren't
optimized for this task, so I expected faster search times.

&lt;/P&gt;&lt;P&gt;

Sadly, there isn't much documentation about Xapian - and nowhere near
like what there is for Lucene. (For example, there appears to be no
page which describes the tuning environment variables. There's
XAPIAN_MAX_CHANGESETS and XAPIAN_FLUSH_THRESHOLD .. are these or
something else relevant for what I want?)

&lt;/P&gt;
&lt;h2&gt;I still want a good inverted index library for Python&lt;/h2&gt;
&lt;P&gt;

Xapian might work well for this project, but there are other projects
I work on which use inverted indicies.
In one project I try to
&lt;a href="http://dalkescientific.com/writings/diary/archive/2012/06/11/optimizing_substructure_keys.html"&gt;optimize the ~1024 fixed patterns used in a traditional feature-based
substructure screen&lt;/a&gt;. Both my greedy algorithm and my genetic algorithm
for this used a lot of temporary intersections to evaluate a given
selection. I'm not sure how well it would work on top of a persistent
database.

&lt;/P&gt;&lt;P&gt;

If you know of a good library, please &lt;a
href="mailto:dalke@dalkescientific.com"&gt;email me&lt;/a&gt; or comment on the
&lt;a
href="http://www.reddit.com/r/Python/comments/umkhr/inverted_index_library_for_python/"&gt;Reddit
thread&lt;/a&gt; I started on the topic.

&lt;/P&gt;

</description><guid isPermaLink="true">http://www.dalkescientific.com/writings/diary/archive/2012/06/10/inverted_index_library.html</guid><pubDate>Sun, 10 Jun 2012 12:00:00 GMT</pubDate></item><item><title>Testing hard algorithms</title><link>http://www.dalkescientific.com/writings/diary/archive/2012/05/21/testing_hard_algorithms.html</link><description>&lt;P&gt;

I developed software to find a maximum common subgraph (MCS) given a
set of molecules represented as a chemical graph. It's called &lt;a
href="https://bitbucket.org/dalke/fmcs"&gt;fmcs&lt;/a&gt;. My previous three
essays were about the &lt;a
href="http://dalkescientific.com/writings/diary/archive/2012/05/12/mcs_background.html"&gt;background&lt;/a&gt;
of the MCS problem, &lt;a href="http://dalkescientific.com/writings/diary/archive/2012/05/13/fmcs.html"&gt;introducing fmcs&lt;/a&gt;, and &lt;a href="http://dalkescientific.com/writings/diary/archive/2012/05/13/mcs_chebi.html"&gt;an example of when MCS is used&lt;/a&gt;.

&lt;/P&gt;&lt;P&gt;

What I didn't describe was the mental effort it took to develop this
program. This is the second time I've written code to find the
multiple structure MCS, and both times it took a couple of months and
put my head in a very strange place. You would think the second time
is easier, but it means that I spent more time adding features and
doing things that my first version couldn't begin to handle. (Did
someone just mutter "second system effect"? Pshaw!)

&lt;/P&gt;&lt;P&gt;

This time too I had a better understanding of the development process.
I think I know why it's so much harder than most of the software I
develop.

&lt;/P&gt;&lt;P&gt;

In this essay, I reflect on some of the reasons why this is a hard
problem to test and I consider how unit tests, or any other
incremental-based testing approach, are not well-suited to a certain
class of complex algorithm development. While unit tests can provide a
basic sanity check during development, they fail to provide the test
coverage which one might expect from applying them to algorithmically
simpler modules. Further, development methodologies built primarily on
unit tests, like test-first style test-driven development, aren't that
applicable. Instead, other methods, like system testing and a deep
understanding of the algorithm and possible problems, are
required. This strengthens my view, explored in 
&lt;a href="http://dalkescientific.com/writings/diary/archive/2009/12/29/problems_with_tdd.html"&gt;Problems with TDD&lt;/a&gt;
that "TDD is a weak method for developing those tests of confidence."


&lt;/P&gt;
&lt;h3&gt;How would you implement an MCS algorithm?&lt;/h3&gt;
&lt;P&gt;

Think about the problem for a moment. How would you write an algorithm
to find the largest common subgraph between two graphs?

&lt;/P&gt;&lt;P&gt;

Take your time. This really is a hard problem. I dug up some of the
early papers from the Journal of Computer Documentation. People ended
up simplifying the problem by, for example, not supporting rings.

&lt;/P&gt;&lt;P&gt;

Finished thinking?

&lt;/P&gt;&lt;P&gt;

You probably came up with a graph walking algorithm which tries
different match pairings and uses backtracking or lazy evaluation to
search all of graph space. The more mathemtically inlined might have
converted this into a maximum clique algorithm.

&lt;/P&gt;&lt;P&gt;

In both cases there are a lot of arbitrary decisions to make. Which
pairings should you investigate first? Are there times when you cann
prune the search space? Did you make any assumptions about the nature
of the chemical graph (eg, that it's &lt;a href="http://dalkescientific.com/writings/diary/archive/2012/05/18/nonplanar_compounds.html"&gt;topologically a planar graph&lt;/a&gt;)?

&lt;/P&gt;
&lt;h3&gt;How would you test your MCS algorithm?&lt;/h3&gt;
&lt;P&gt;

Back in the late 1990s, I almost never wrote automated tests,
and never wrote an extensive test suite. Nowadays I'm pretty thorough,
and use coverage-guided techniques to get good, extensive tests
through some semi-permanent API layer that will allow me to refactor
the internals without changing the tests.

&lt;/P&gt;&lt;P&gt;

I couldn't do that here.

&lt;/P&gt;&lt;P&gt;

There are a lot of heuristics, and some of them are only triggered
under unusual circumstances, after a bunch of combinitorial
possibilities. Outside of a few minor components, I couldn't figure
out how to write unit tests for the code.

&lt;/P&gt;&lt;P&gt;

I ended up pushing most of the testing into validation testing of the
complete system, which meant I wrote some 1,000+ lines of code
without strong testing. Moreover, I used a new approach to the MCS
problem, so the algorithm I was working on doesn't have the track
record of the standard clique-based or backtracking approaches.


&lt;/P&gt;&lt;P&gt;

So stress factors included not knowing if the algorithm would work,
and not being able to develop enough test cases to provide good
validation during development.

&lt;/P&gt;&lt;P&gt;

As a rule of thumb, it's easiest to fix bugs which are caught
early. Unit tests and evolutionary prototyping are two of the
techniques that people use to tighten the feedback loop between
specification, implementation, and valiation. I think another stress
factor is propotional to the size of the feedback loop.

&lt;/P&gt;
&lt;h2&gt;What testing did I do?&lt;/h2&gt;
&lt;P&gt;

I mean, I did have tests during development. I came up with a few
examples by hand, I did a substructure search of a large data set and
I verified that the MCS code found that substructure, but I know
that's not enough tests. I know this because after six weeks of
development and over 1000 lines of code, I spent another three weeks
doing a large amount of post-development testing, and found several
bugs. In the process of writing this essay I also found that four days
of that development work ended up making things slower, so I'll have
to remove it.

&lt;/P&gt;&lt;P&gt;

I did most of my tests based on the &lt;a
href="https://www.ebi.ac.uk/chembldb/"&gt;ChEMBL&lt;/a&gt; data: 10,000 random
pairs of structures, the k=2, k=10, and k=100 nearest neighbors, and
the k&amp;lt;=100 neighbors with similarities of at least 0.95, 0.9, and
0.8. I also did, at the end, tests based on the ChEBI structure
ontology. There were easily 20,000 different machine-generated test
cases, although in most cases I didn't know the expected results
beforehand.

&lt;/P&gt;
&lt;h2&gt;What bugs did I find?&lt;/h2&gt;
&lt;P&gt;

What bugs did I find? I think it's educational to characterize a few of them.

&lt;/P&gt;
&lt;h3&gt;Typo caught by an assertion check&lt;/h3&gt;
&lt;P&gt;

One of the simplest bugs was a poorly formatted string. I used "%d"
when I should have used ""%%%d". A bit of jargon for those in the
know; I generate SMARTS strings for the intermedate subgraphs. If
there are more than 9 open rings then the syntax goes from a single
digit closure number to a closure number like "%10". I forgot to
include the '%' for that case, and probably because the '%' was
already there for the format string.

&lt;/P&gt;&lt;P&gt;

This wasn't triggered by my random-pairs test nor my various
similarity-search based tests. Only when I ran through the ChEBI data,
did I get an assertion failure when RDKit refused to accept my SMARTS
string. That was the first time where I had a SMARTS with 10 unclosed
rings.

&lt;/P&gt;&lt;P&gt;

As it happens, this error could have been caught by a unit testing, as
some people practice unit tests. It's a four line function which takes
a string and a number. Testing it is trivial. I didn't test it because
I feel that testing directly against internal functions inhibits
refactoring. I prefer to test against higher-level, more stable APIs.

&lt;/P&gt;&lt;P&gt;

My view is that it's usually easy to come up with high-level test
cases which trigger a certain function call. But not in this case. The
MCS search algorithm, while deterministic, uses so many arbitrarily
defined values that I couldn't on my own come up with a test case with
at least 10 open ring closures. And even if I did, a new search
heuristic might change things so that only, say, 7 open ring closures
were needed.

&lt;/P&gt;&lt;P&gt;

I felt that my system testing and the assertion check would be
enough to identify if there was a problem, and it did. A low-level
unit test might have helped, especially as I still don't have a
specific test for that failure.

&lt;/P&gt;&lt;P&gt;

I think the right thing to do is add that failure as a specific test
case, and use monkey-patching to insert wrong code for a repeat of
that test case. The first one tests that the code is correct, and the
second tests that the test case is still exercising the code under
test.

&lt;/P&gt;
&lt;h3&gt;Cross-validation testing&lt;/h3&gt;
&lt;P&gt;

The first MCS papers are about as old as I am. Many people have
written implementions, although relatively few are are available to me
both at no cost and with no prohibitions on using it to develop a new
MCS algorithm. (Some commercial companies don't like people using
their software to write new software which is competitive to it, or
even to use their software for benchmark comparisons.)

&lt;/P&gt;&lt;P&gt;

I tested pairs of structures against &lt;a
href="http://www.ebi.ac.uk/thornton-srv/software/SMSD/"&gt;SMSD&lt;/a&gt; and I
did more extenstive tests against &lt;a
href="http://ggasoftware.com/opensource/indigo"&gt;Indigo's&lt;/a&gt; MCS
implementation.

&lt;/P&gt;&lt;P&gt;

This is cross-validation testing. It's a relatively rare technique
because the cost of producing multiple identical implementations
usually isn't worth the benefits. Even here the results aren't exactly
identical because of differences in how the toolkits perceive
chemistry, and more specifically, aromaticity. I ended up spending a
lot of investigation time staring at cases with different answers and
trying to figure out if it was a chemistry problem or an MCS algorithm
implementation problem.

&lt;/P&gt;&lt;P&gt;

I found the SMSD had a bug in one of its options, which I reported
back to the author. The code had been fixed internally but not pushed
to the outside world. Its default mode and fmcs matched quite well,
except for a couple of chemistry differences. The new version is out
now - I need to test it again.

&lt;/P&gt;&lt;P&gt;

The only problem I found in the Indigo code was a part of their setup
code which didn't check the timeout. That's also been fixed after I
reported it.

&lt;/P&gt;&lt;P&gt;

The cross-validation with Indigo found problems in my code. For
example, I was often getting smaller MCSes then Indigo. After looking
at them, I figured out that my code didn't correctly handle the case
when a molecule was fragmented after a bond was removed because its
bondtype wasn't in all of the structures.

&lt;/P&gt;&lt;P&gt;

Why didn't my hand-written test cases find it? None of them had a case
where there was a bond in the "middle" of a structure chosen as the
reference structure, and where the MCS was not in the first fragment.

&lt;/P&gt;&lt;P&gt;

My code usually got the right answer when using highly similar
structures, for obvious reasons. It was only the random pair testing
where the problem really stood out.

&lt;/P&gt;&lt;P&gt;

Could I create a simple unit test for this error? Perhaps, but it's
not easy. I don't know which of the two fragments will be first - it
depends on so many arbitrary decisions which could change as the
algorithm is improved. The only test I can think of for this was to
generate a diverse set of tests, make sure some fail if the code isn't
implemented correctly, record the results, and make sure that future
tests never find a worse (or better?) test case.

&lt;/P&gt;
&lt;h3&gt;Bad heuristic to determine the maximum possible subgraph growth&lt;/h3&gt;
&lt;P&gt;

Most of the MCS implementation is heuristics. There's a branch and
bound search, there's subgraph canonicalization to reduce the number
of substructure tests, and so on. Each of these is supposed to help
make the code faster.

&lt;/P&gt;&lt;P&gt;

One of the tests takes the current subgraph and the list of "outgoing"
bonds to see how much of the remainder of the graph is accessible for
growth from the current subgraph. The rest of the molecule might not
be accessible because it's on another fragment, but it also might not
be accessible due to an earlier decision to exclude certain parts of
the molecule from future growth. (My algorithm tests all subgraphs
which include a given bond, then all subgraph which don't include the
given bond.)

&lt;/P&gt;&lt;P&gt;

It took a couple of days to think of the algorithm, write the code,
and get it working. I then did the timing tests to find out it was 1%
faster, to get the same answers.

&lt;/P&gt;&lt;P&gt;

Does that mean my code worked? I only had a suspicion that the new
algorithm should be faster. Perhaps the overhead of searching for the
accessible atoms was too costly?

&lt;/P&gt;&lt;P&gt;

After looking at my implementation - a lot - I finally realized that I
told the algorithm to exclude the subgraph from consideration for
growth but had forgotten to also exclude the set of previously
excluded bonds from consideration. With two changed lines, the overall
performance doubled for the random-pairs case. Most of the time it's
faster and sometimes its slower, so it takes an aggregate of tests to
measure this correctly.

&lt;/P&gt;&lt;P&gt;

I don't think this heuristic could have been written as a unit
test. No, I take that back. It could have, but it would have required
some careful thought to set up. Not only was this part of the code not
"built for test", but setting up the right conditions requires a
mental understanding of the problem which I know I didn't have when I
was doing the development.

&lt;/P&gt;&lt;P&gt;

BTW, I am not saying that unit tests can't be used to measure
performance. Some algorithms have simple timing characteristics where
it's easy to say that the code must complete by a given time, or that
it must be at least twice as fast as a reference
implementation. Occasionally these tests will hiccup due to unusual
loads on the test machine, but not usually enough to be a problem.

&lt;/P&gt;&lt;P&gt;

Indeed, the fmcs code has some unit tests for the timeout code. I
found a pair of structures which takes over 10 seconds to find the
MCS, I set the timeout to 0.1 second, and assert that no more than 0.5
seconds elapsed during the function call. (I don't require a high
precision for this code.) I do worry though that future changes to the
MCS search code might speed things up by a faster of 20. On the other
hand, the effect of broken timeout code is very obvious when running
the validation suite, so I'm not going to worry about it.

&lt;/P&gt;&lt;P&gt;

As is widely acknowledged, this stretches the idea of a "unit test."
Unit tests are supposed to be fast; preferably several hundred or more
per second. The goal is that these tests run often, perhaps even after
every save. Stick in a few 0.1 second timeouts into the system and it
bogs everything down, which discourages people from running the tests
so often.

&lt;/P&gt;&lt;P&gt;

But let's go back to this new code. On average, over a large number of
tests, the performance is 50% faster. What's a good test case? Can I
pick one structure or a small set of structures? Does the speedup
occur only when there are large molecules? Only for molecules with
lots of internal symmetry?  Only for those which are easily
fragmented?

&lt;/P&gt;&lt;P&gt;

Only now, when writing this essay, did I find good test cases. When I
use a 10 second timeout using the k=10 nearest neighbors tests, I get
8 timeouts in the first 32 records using the old algorithm, and only 1
timeout using the new algorithm. The time for the very first test goes
from over a minute (I finally gave up) using the old algorithm to 0.08
seconds using the new one.

&lt;/P&gt;&lt;P&gt;

Obviously (in retrospect) the right solution is to pull out a couple
of those from my validation suite and turn them into test cases.

&lt;/P&gt;&lt;P&gt;

I never would have come up with these test cases before implementing
the new code - up until ten minutes ago I thought that the new code
only added a factor of two the code, not possible factors of 1,000!

&lt;/P&gt;
&lt;h2&gt;Canonicalization: an in-depth analysis&lt;/h2&gt;
&lt;P&gt;

I'm not finished with testing. I haven't fully characterized all the
parts of the implementation. For that matter, there are heuristics I
can still add. Here I'll describe what I did to analyze subgraph
SMARTS canonicalization. I conclude by finding that while it works, it
slows down the code and several days of development work ought to be
removed.

&lt;/P&gt;&lt;P&gt;

My MCS algorithm enumerates subgraph of one structure, converts that
into a SMARTS string, and tests if the pattern exists in other
structures.

&lt;/P&gt;&lt;P&gt;

There are many alternate ways to make a SMARTS string from a
subgraph. Given the linear chain of C O N, there's the reasonable CON
or NOC variations, the more confusing O(C)N and O(N)C, and crazy
variations like C1.N1O and O%987.N7.C%98. It's easy to make the
non-insane versions be generating a spanning tree. The question is
where to start the search and how to handle branches.

&lt;/P&gt;&lt;P&gt;

I believe that there are many duplicate patterns in a query. For
example, a benzene ring will generate many "ccc" queries. I can
minimize the number of substructure matches by caching earlier SMARTS
match results. But caching doesn't work if sometimes I get a CON and
sometimes I get a NOC. The solution is to generate a "canonical"
SMARTS for the subgraph - a unique representation for all of the
variations.

&lt;/P&gt;&lt;P&gt;

I implemented the CANGEN algorithm from
&lt;a href="http://pubs.acs.org/doi/abs/10.1021/ci00062a008"&gt;SMILES. 2. Algorithm
for generation of unique SMILES notation&lt;/a&gt; by Weininger, Weininger, and Weininger
J. Chem. Inf. Comput. Sci., 1989, 29 (2), pp 97-101 in order to choose
a canonical SMARTS. (But see 
&lt;a href="http://organica1.org/seminario/cangen.pdf"&gt;Assigning Unique Keys to Chemical Compounds for Data Integration: Some Interesting Counter Examples&lt;/a&gt;, Neglur, Grossman, and Liu, 2nd International Workshop on Data Integration in the Life Sciences (DILS 2005), La Jolla.)

&lt;/P&gt;&lt;P&gt;

Canonicalization is notoriously hard to get right. It too has a lot of
tricky corner cases. For example, in the late 1990s Daylight tracked
down a reported bug in their canonicalization implementation. They
algorithm requires a stable sort algorithm, but they used an unstable
sort. Their own internal testing never found the problem - it was a
customer who found the Solaris and IRIX boxes occasionaly returned
different values.

&lt;/P&gt;&lt;P&gt;

There could be similar bugs in my canonicalization algorithm. I don't
know. The usual solution is to generate large numbers of test cases,
randomize the order of the atoms and bonds, and verify that all of
them are the same. Here too it may be that only certain rare
topologies, or rare bond patterns, triggers an error. I'm not
convinced that I could come up with the right set of test cases for
it.

&lt;/P&gt;&lt;P&gt;

What makes it harder is that the underlying search algorithm doesn't
need a canonical SMARTS, so a canonicalization failure doesn't ever
show up in the output. I could have an infrequent bug and not notice
it!  Canonicalization is only there for performance reasons, but my
implementation is in Python, while the substructure matching is in
C++, so it might even be that the canonicalization time might be more
than the time saved.

&lt;/P&gt;&lt;P&gt;

I can make that a bit more nuanced. Canonical SMARTS generation has
three phases: 1) initial ranking, 2) canonicalization/tie-breaking,
and 3) SMARTS generation. I can disable the canonicalization step and
get a "semi-canonical" SMARTS (my initial ranking is more like a
circular fingerprint of diameter 3 than the atom characteristic from
Weininger et al.). I can also disable caching. Both of these are
doable with only a couple of changes to the code.

&lt;/P&gt;&lt;P&gt;

This leads to three cases: canonical SMARTS with caching,
semi-canonical SMARTS with caching, and semi-canonical SMARTS without
caching. (There should be a fourth step, which is arbitrary SMARTS,
but that requires more extensive changes.)

&lt;/P&gt;&lt;P&gt;

What I want to understand most is the effect of canonicalization on my
code. Again, I have to resort to aggregate timings and statistics
across a set of benchmarks. I used the ChEMBL-13 data set and
generated various benchmarks. One is based on computing the MCS
between 10,000 pairs selected at random, another contains the k=2,
k=10, and k=100 nearest neighbors of randomly chosen fingerprints
(timing 500 data sets each time), and the last is the total of 500
tests of the k&amp;lt;=100 compounds with Tanimoto score of at least 0.95
to randomly selected fingerprints. My results (times are reproducible
to within a few percent) include the number of unique SMARTS
(canonical or non-canonical) generated and the total number of
substructure tests ("SS") which were carried out.

&lt;/P&gt;&lt;P&gt;
&lt;table&gt;

&lt;tr&gt;&lt;th colspan="2"&gt;Random pair timings&lt;/th&gt;&lt;th&gt;# unique SMARTS&lt;/th&gt;&lt;th&gt;# SS tests&lt;/th&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;th align="right"&gt;no cache&lt;/th&gt;&lt;td&gt;Total: 10000/467.3s (21.4/s) Complete: 9997/437.3s (22.9/s) Incomplete: 3/30.0s&lt;/td&gt;&lt;td align="center"&gt;921666&lt;/td&gt;&lt;td align="right"&gt;1339292&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;th align="right"&gt;semi-canonical&lt;/th&gt;&lt;td&gt;Total: 10000/421.3s (23.7/s) Complete: 9997/391.3s (25.6/s) Incomplete: 3/30.0s&lt;/td&gt;&lt;td align="center"&gt;921666&lt;/td&gt;&lt;td align="right"&gt;921666&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;th align="right"&gt;canonical&lt;/th&gt;&lt;td&gt;Total: 10000/442.1s (22.6/s) Complete: 9997/412.0s (24.3/s) Incomplete: 3/30.0s&lt;/td&gt;&lt;td align="center"&gt;709636&lt;/td&gt;&lt;td align="right"&gt;709636&lt;/td&gt;&lt;/tr&gt;

&lt;tr&gt;&lt;th colspan="2"&gt;k=2 nearest neigbhors&lt;/th&gt;&lt;th&gt;# unique SMARTS&lt;/th&gt;&lt;th&gt;# SS tests&lt;/th&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;th align="right"&gt;no cache&lt;/th&gt;&lt;td&gt;Total: 500/287.3s (1.7/s) Complete: 484/127.1s (3.8/s) Incomplete: 16/160.2s&lt;/td&gt;&lt;td align="center"&gt;264057&lt;/td&gt;&lt;td align="right"&gt;320238&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;th align="right"&gt;semi-canonical&lt;/th&gt;&lt;td&gt;Total: 500/276.7s (1.8/s) Complete: 484/116.6s (4.2/s) Incomplete: 16/160.2s&lt;/td&gt;&lt;td align="center"&gt;264057&lt;/td&gt;&lt;td align="right"&gt;264057&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;th align="right"&gt;canonical&lt;/th&gt;&lt;td&gt;Total: 500/298.6s (1.7/s) Complete: 483/128.4s (3.8/s) Incomplete: 17/170.2s&lt;/td&gt;&lt;td align="center"&gt;186682&lt;/td&gt;&lt;td align="right"&gt;186682&lt;/td&gt;&lt;/tr&gt;

&lt;tr&gt;&lt;th colspan="2"&gt;k=10 nearest neigbhors&lt;/th&gt;&lt;th&gt;# unique SMARTS&lt;/th&gt;&lt;th&gt;# SS tests&lt;/th&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;th align="right"&gt;no cache&lt;/th&gt;&lt;td&gt;Total: 500/520.5s (1.0/s) Complete: 471/230.0s (2.0/s) Incomplete: 29/290.5s&lt;/td&gt;&lt;td align="center"&gt;447648&lt;/td&gt;&lt;td align="right"&gt;3525563&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;th align="right"&gt;semi-canonical&lt;/th&gt;&lt;td&gt;Total: 500/490.1s (1.0/s) Complete: 472/209.5s (2.3/s) Incomplete: 28/280.6s&lt;/td&gt;&lt;td align="center"&gt;447648&lt;/td&gt;&lt;td align="right"&gt;2872040&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;th align="right"&gt;canonical&lt;/th&gt;&lt;td&gt;Total: 500/509.3s (1.0/s) Complete: 471/218.7s (2.2/s) Incomplete: 29/290.6s&lt;/td&gt;&lt;td align="center"&gt;330010&lt;/td&gt;&lt;td align="right"&gt;2109332&lt;/td&gt;&lt;/tr&gt;

&lt;tr&gt;&lt;th colspan="2"&gt;k=100 nearest neigbhors&lt;/th&gt;&lt;th&gt;# unique SMARTS&lt;/th&gt;&lt;th&gt;# SS tests&lt;/th&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;th align="right"&gt;no cache&lt;/th&gt;&lt;td&gt;Total: 500/414.4s (1.2/s) Complete: 486/271.4s (1.8/s) Incomplete: 14/143.0s&lt;/td&gt;&lt;td align="center"&gt;128781&lt;/td&gt;&lt;td align="right"&gt;9932263&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;th align="right"&gt;semi-canonical&lt;/th&gt;&lt;td&gt;Total: 500/363.5s (1.4/s) Complete: 487/230.8s (2.1/s) Incomplete: 13/132.7s&lt;/td&gt;&lt;td align="center"&gt;128781&lt;/td&gt;&lt;td align="right"&gt;7456877&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;th align="right"&gt;canonical&lt;/th&gt;&lt;td&gt;Total: 500/361.8s (1.4/s) Complete: 488/239.5s (2.0/s) Incomplete: 12/122.4s&lt;/td&gt;&lt;td align="center"&gt;107648&lt;/td&gt;&lt;td align="right"&gt;6196764&lt;/td&gt;&lt;/tr&gt;

&lt;tr&gt;&lt;th colspan="2"&gt;k&amp;lt;=100 at or above Tanimoto threshold of 0.95&lt;/th&gt;&lt;th&gt;# unique SMARTS&lt;/th&gt;&lt;th&gt;# SS tests&lt;/th&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;th align="right"&gt;no cache&lt;/th&gt;&lt;td&gt;Total: 500/642.1s (0.8/s) Complete: 458/220.3s (2.1/s) Incomplete: 42/421.9s&lt;/td&gt;&lt;td align="center"&gt;468661&lt;/td&gt;&lt;td align="right"&gt;4388074&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;th align="right"&gt;semi-canonical&lt;/th&gt;&lt;td&gt;Total: 500/624.5s (0.8/s) Complete: 461/232.8s (2.0/s) Incomplete: 39/391.7s&lt;/td&gt;&lt;td align="center"&gt;468661&lt;/td&gt;&lt;td align="right"&gt;3828813&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;th align="right"&gt;canonical&lt;/th&gt;&lt;td&gt;Total: 500/640.9s (0.8/s) Complete: 460/239.2s (1.9/s) Incomplete: 40/401.8s&lt;/td&gt;&lt;td align="center"&gt;364802&lt;/td&gt;&lt;td align="right"&gt;3222366&lt;/td&gt;&lt;/tr&gt;

&lt;/table&gt;

&lt;/P&gt;&lt;P&gt;

Whew! That's a lot of data to throw at you. Please believe me when I
say that it took two days to generate correctly. BTW, "complete" means
that the MCS search algorithm went to completion, the "incomplete"
means that it timed out - here after 10 seconds - and gave a partial
solution.

&lt;/P&gt;&lt;P&gt;

What does this tell me? Obviously, caching is good. As expected, the
"semi-canonical" solution is always better than the "no cache" case,
and the canonicalization always reduces the number of substructures
and substructure tests. This means the canonicalization code is doing
&lt;i&gt;something&lt;/i&gt; right.

&lt;/P&gt;&lt;P&gt;

Unfortunately, it seems like canonicalization has a big performance
impact. In the pairwise tests I do 922,000 canonicalizations in Python
to save 212,000 substructure tests, and I lose 21 seconds in doing
so. For the k=100 nearest neighbor benchmark, which is the only one
where the canonicalization code saves time, I do 129,000
canonicalizations to save 1,260,000 comparisons, and gain about 2
seconds. This suggests that each canonicalization takes as long as 10
substructure matches, and that RDKit can do 60,000 SMARTS matches per
second. A quick checks using one SMARTS gives 100,000 SMARTS matches
per second, which helps validate this estimate.

&lt;/P&gt;&lt;P&gt;

This means I should take out the canonicalization until it can be
implemented in C++. Granted, there may be a bug in the code, like
there was with an earlier hueristic. But there would have to be an
order of magnitude performance increase for this to be effective, and
I don't think that's likely.


&lt;/P&gt;
&lt;h2&gt;Unit tests aren't enough to drive development&lt;/h2&gt;
&lt;P&gt;

Which leads me back to my thesis. It would have been possible to
develop a few more unit tests for the canonicalization code. There
would have been some extra scaffolding in order to do that, but that's
a minor cost. I suspect that many of the tests would be very
implementation-dependent, and tied to specific internal function
calls. I don't like this because it means that a re-implementation of
this internal component in C++, which submerges many of the internal
helper functions and places them out of reach of Python-based unit
tests, would cause the unit tests to be un-runnable. And I am certain
that those unit tests would not be rigorous enough to be confident
that the code was working correctly.

&lt;/P&gt;&lt;P&gt;

Think about the question "should I write this code in Python or C++?".
It's a development-time question. Test Driven Development (TDD) is a
development methodology which uses unit tests to help make
development-time decisions. I think TDD is most helpful when
used to establish the minimum requirements for a project.

&lt;/P&gt;&lt;P&gt;

I don't see how TDD is helpful here. I can't think of any unit tests
which would guide this development decision. Sometimes you can write a
"spike solution" (also called a "prototype"):

&lt;blockquote&gt;

&lt;a href="http://www.extremeprogramming.org/rules/spike.html"&gt;A spike
solution&lt;/a&gt; is a very simple program to explore potential
solutions. Build the spike to only addresses the problem under
examination and ignore all other concerns.

&lt;/blockquote&gt;

This sounds good, but what development style should you use to develop
the spike solution? Spikes are supposed to be throw-away code, but I
can't figure out how something other than a complete, tested
implementation of the canonical algorithm or the remaining growth
heuristics would have been useful. After all, a two line change in the
latter went from "working but 1% faster" to "working and 50% faster",
and I didn't even know if there was going to give a speedup in the
first place.

&lt;/P&gt;
&lt;h2&gt;If not TDD, what methodology do I use?&lt;/h2&gt;
&lt;P&gt;

Read Peter Seibel's &lt;a
href="http://gigamonkeys.wordpress.com/2009/10/05/coders-unit-testing/"&gt;Unit
testing in Coders at Work&lt;/a&gt;. The author interviewed various famous
programmers and learned more about how they tested their software. You
may also want to read the related &lt;a
href="http://news.ycombinator.com/item?id=863219"&gt;comments on Hacker
News&lt;/a&gt;.


&lt;/P&gt;&lt;P&gt;

Seibel recounts that Joshua Bloch "wrote a monstrous 'basher'" which
found "that occasionally, just occasionally, the basher would fail its
consistency check." Bloch then wrote tests to pin down the failure,
eventually finding the fault in a system mutex. Note that under normal
unit test development you assume that your underlying components are
correct, so this isn't something you would normally code for.

&lt;/P&gt;&lt;P&gt;

Seibel also reports that others assemble a large amount of test cases
and uses that to check their code. This is of course the technique I
used for the MCS problem.

&lt;/P&gt;&lt;P&gt;

It feels as trite as saying that the secret to losing weight is diet
and exercise, but my method for programming is to understand the
problem, implement it, and test it until you are confident that it's
good enough. That knowledge of how to do that comes from &lt;a
href="http://dalkescientific.com/writings/diary/archive/2007/08/17/kata_and_practice.html"&gt;experience,
practice, reflection, and discussion&lt;/a&gt;.

&lt;/P&gt;&lt;P&gt;

For those who scoff and call this "big design up front," I merely
point you to earlier parts of this essay. When I started this code I
had a rough idea that it would work. I had previously implemented &lt;a
href="http://dalkescientific.com/writings/diary/archive/2011/01/10/subgraph_enumeration.html"&gt;substructure
enumeration&lt;/a&gt;, and the Weininger et al. canonicalization algorithm
and SMARTS generation, so I knew that the components were possible,
but I didn't know how they went together. Instead, I let the code
development itself guide me. I thought some, I implemented code, I
"ran the code in my head", I reflected on what the tricky parts were,
I thought about how to make them more clear, and I tried various
was to improve the code.

&lt;/P&gt;&lt;P&gt;

That was a lot to keep in my head, I wasn't sure if the result would
be fast enough, and I had no way good way to test its usefulness until
most of the code was in place. No wonder it was mentally taxing!

&lt;/P&gt;&lt;P&gt;

I believe tests - including unit tests implemented during development
- are important. I don't believe that unit tests are good enough to
guide complex algorithm development. However, most programming (over
95%?) is not complex algorithm development; complicated? yes, but not
complex.

&lt;/P&gt;&lt;P&gt;

I once jokingly said that TDD is not useful if there's two or more
embedded for loops. That's a bit of a simplification, but not far from
my feelings. Some classes of problems, like complex algorithm
development and security analysis, require a different attitude
towards programming, emphasizing "what can go wrong?" and "how can I
improve my confidence that the code is working correctly?" It's my
view that this doubt-based philosophical attitude is missing from most
discussions of software development practices.

&lt;/P&gt;&lt;P&gt;

And perhaps as fundamental, when I have to be that critical of my own
work and probe it for failures and be open to the possibility of nasty
gotchas lurking in the deep dark corners, then some of that doubt
passes over into my personal life. I empathize with the idea of not
living in that "strange place" all the time, and I can see why this
isn't a common attitude.

&lt;/P&gt;
&lt;h2&gt;Comments&lt;/h2&gt;
&lt;P&gt;

This essay is meant to be a thoughtful reflection on the difficulties
of programming, using the MCS problem to provide specific
structure. If you want to &lt;a href="http://dalkescientific.blogspot.com/2012/05/maximum-common-substructures-and-fmcs.html"&gt;leave
comments about the MCS portion&lt;/a&gt; then use the MCS comment site.
Otherwise, &lt;a href="http://dalkescientific.blogspot.se/2012/05/testing-hard-algorithms.html"&gt;leave
a comment on testing hard algorithms&lt;/a&gt;.

&lt;/P&gt;

</description><guid isPermaLink="true">http://www.dalkescientific.com/writings/diary/archive/2012/05/21/testing_hard_algorithms.html</guid><pubDate>Mon, 21 May 2012 12:00:00 GMT</pubDate></item><item><title>Topologically non-planar compounds</title><link>http://www.dalkescientific.com/writings/diary/archive/2012/05/18/nonplanar_compounds.html</link><description>&lt;P&gt;

John MacCuish, from &lt;a href="http://mesaac.com/"&gt;Mesa Analytics&lt;/a&gt;,
pointed out that the MCS problem takes polynomial time if the graphs
are planar. He writes: "Are there graphs in your likely sets, that are
not planar? I have never seen a non-planar small molecule drug for
example, but they may be out there. Tests for planarity are also in
P. Of course it doesn't mean that solutions in P will be faster than
the usual methods since N is small an the overhead for planar MCS may
be large, such that N may need to be large for the planar method to
beat the non-planar heuristics."

&lt;/P&gt;&lt;P&gt;

This leads to a couple of questions. One is, what is the shape of the
run-time of &lt;a
href="http://dalkescientific.com/writings/diary/archive/2012/05/13/fmcs.html"&gt;my
MCS algorithm&lt;/a&gt;? I don't actually know. Some tests take a very long
time, but that's not enough to establish that it's polynomial for
real-world compounds or exponential. I'm not going to research this
now.

&lt;/P&gt;&lt;P&gt;

Another is, are there real-world small molecules which are non-planar,
in the &lt;a
href="http://mathworld.wolfram.com/PlanarGraph.html"&gt;topological
sense&lt;/a&gt;, and not the 
&lt;a href="http://www.uscibooks.com/Anslyn/AD06rt.pdf"&gt;chemistry sense&lt;/a&gt;
where all of the non-hydrogen atoms line on or near a plane?

&lt;/P&gt;
&lt;h2&gt;Previous work in topologically non-planar compounds&lt;/h2&gt;
&lt;P&gt;

A quick literature seach finds &lt;a href="http://www.sciencedirect.com/science/article/pii/0040403981800779"&gt;Synthesis of the first topologically non-planar molecule&lt;/a&gt;, which says:

&lt;blockquote&gt;
We report here the synthesis and characterization of the tris-ether
2,5,14-trioxahexacyclo-[5.5.2.1.2&lt;sup&gt;4,10&lt;/sup&gt;.O&lt;sup&gt;4,17&lt;/sup&gt;.O&lt;sup&gt;10,17&lt;/sup&gt;]-heptadecane
3; this topologically unique (graph theory) molecule is prepared via a
novel intramolecular rearrangement of either of two isomeric
propellane spiro-epoxides, 1 and 2.
&lt;/blockquote&gt;

&lt;/P&gt;&lt;P&gt;

There's also &lt;a
href="http://walba.colorado.edu/Reprints/Walba%20New%20J%20Chem%201993.pdf"&gt;Topological
stereochemistry. 9. Synthesis and cutting "in-half" of a molecular
M&amp;ouml;bius strip&lt;/a&gt; by Walba, Homan, Richards, and Haltiwanger in
New. J. Chem., 1993, 17, 661-681.

&lt;/P&gt;&lt;P&gt;

Quoting from the above link to
&lt;a href="http://www.uscibooks.com/Anslyn/AD06rt.pdf"&gt;Modern Physical
Organic Chemistry&lt;/a&gt; by Ansyln and Dougherty:

&lt;blockquote&gt;
We mention briefly here another topological issue that has fascinated
chemists. For the overwhelming majority of organic molecules, we can
draw a two-dimensional representation with no bonds crossing each
other. ... It may seem surprising, but most molecules have planar graphs.
&lt;br /&gt;&lt;br /&gt;
Recent efforts have produced chemical structures that successfully
realize many interesting and novel topologies. A landmark was
certainly the synthesis of a trefoil knot using Sauvage's
Cu+/phenanthroline templating strategy.... V&amp;ouml;gtle and co-workers
have described an "all organic" approach to amide-containing trefoil
knots, and have been able to separate the two enantiomeric knots using
chiral chromatography.  Another seminal advance in the field was the
synthesis and characterization of a 'M&amp;ouml;bius strip' molecule..."
&lt;/blockquote&gt;

&lt;/P&gt;&lt;P&gt;

So it's well-established that there are topologically non-planar
structures. But are they in compound databases which I can access?

&lt;/P&gt;
&lt;h2&gt;Searching PubChem for topologially non-planar compounds&lt;/h2&gt;
&lt;P&gt;

Take a look at
&lt;a href="http://ukpmc.ac.uk/articles/PMC2807378"&gt;Scaffold Topologies II:
Analysis of Chemical Databases&lt;/a&gt; by Wester, Pollock, Coutsias, Allu,
Muresan and Oprea in PMC 2010 January 15., Published in final edited
form as: J Chem Inf Model 2008 July; 48(7): 1311-1324.

&lt;blockquote&gt;

Only 12 nonplanar and 2,099 spiro node topologies (all of which are
planar) are present in the merged database. 9 of the nonplanar
topologies are found only in PubChem and the total number of molecules
represented by such topologies in the merged database is a mere 44,
agreeing with Walba's assessment concerning the rarity of chemicals
with nonplanar graphs.

&lt;/blockquote&gt;

&lt;/P&gt;&lt;P&gt;

This establishes that in 2008 there were no more than 44 topologically
non-planar structures in PubChem. Can I find them?

&lt;/P&gt;&lt;P&gt;

I don't think any of the database search engines support this
capability, so I need to write a program.

&lt;/P&gt;&lt;P&gt;


For that I need a method for &lt;a
href="http://en.wikipedia.org/wiki/Planarity_testing"&gt;planarity
testing&lt;/a&gt;.  Various sources say that the linear time algorithm is
"&lt;a href="http://code.google.com/p/planarity/wiki/History"&gt;widely
regarded as being quite complex&lt;/a&gt;", but that one of the
&lt;a href="http://code.google.com/p/planarity/"&gt;linear time planarity
algorithms&lt;/a&gt; is part of &lt;a href="http://www.sagemath.org/"&gt;Sage&lt;/a&gt;

&lt;/P&gt;&lt;P&gt;

Sage is a comprehensive mathematical software system, which uses
Python. There's a pre-built binary distribution for my OS, which I
downloaded. It includes Python, the IPython shell, and everything else
it needs, so it really is a &lt;i&gt;system&lt;/i&gt; and not a set of Python
modules.

&lt;/P&gt;&lt;P&gt;

With Sage it's easy to make a graph and test for planarity.

&lt;pre class="code"&gt;
% sage
----------------------------------------------------------------------
| Sage Version 5.0, Release Date: 2012-05-14                         |
| Type notebook() for the GUI, and license() for information.        |
----------------------------------------------------------------------
sage: d = {1: [2,3], 2: [3], 3: [4, 6], 4: [5]}
sage: g = Graph(d)
sage: g.is_planar()
True
sage: 
sage: &lt;a href="http://en.wikipedia.org/wiki/File:Biclique_K_3_3.svg"&gt;k3_3&lt;/a&gt; = {1: [2,3,4,5,6], 2: [3,4,5,6], 3: [4, 5, 6]}
sage: Graph(k3_3).is_planar()
False
sage: 
&lt;/pre&gt;


&lt;/P&gt;&lt;P&gt;

The dictionary I use as input to "Graph" contains an upper-triangle
connection matrix. So to test if a molecule is topologically planar, I
just need to convert its connectivity information into a dictionary of
the right form, turn the dictionary into a Graph, and test the graph
for is_planar().

&lt;/P&gt;&lt;P&gt;

&lt;h3&gt;Roadblock! My Python 2.6 modules and Sage's 2.7 Python don't mix&lt;/h3&gt;

&lt;/P&gt;&lt;P&gt;

The prebuilt binaries include its own Python distribution, and when
you use "sage", it replaces the PYTHONPATH with its own settings. This
means when I run sage I don't have access to the cheminformatics tools
I've already set up for my system:

&lt;pre class="code"&gt;
% sage
----------------------------------------------------------------------
| Sage Version 5.0, Release Date: 2012-05-14                         |
| Type notebook() for the GUI, and license() for information.        |
----------------------------------------------------------------------
sage: import rdkit
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)

/Users/dalke/&amp;lt;ipython console&amp;gt; in &amp;lt;module&amp;gt;()

ImportError: No module named rdkit
sage: from openeye.oechem import *
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)

/Users/dalke/&amp;lt;ipython console&amp;gt; in &amp;lt;module&amp;gt;()

ImportError: No module named openeye.oechem
sage: os.environ["PYTHONPATH"]
'/Users/dalke/ftps/sage/local/lib/python'
&lt;/pre&gt;

I tried to force it to include the right path, and that failed.

&lt;pre class="code"&gt;
% printenv PYTHONPATH
/Users/dalke/ftps/openeye//wrappers/v2011.Oct.1/python/:/Users/dalke/envs/RDKit_2011_12-svn
% sage
   ...
sage: import sys
sage: sys.path.insert(0, "/Users/dalke/envs/RDKit_2011_12-svn")
sage: import os
sage: os.environ["DYLD_LIBRARY_PATH"]
'/Users/dalke/ftps/sage/local/lib:/Users/dalke/ftps/sage/local/lib/R/lib::/Users/dalke/ftps/openeye//wrappers/libs:/Users/dalke/envs/RDKit_2011_12-svn/lib:/Users/dalke/ftps/sage/local/lib/R/lib'
sage: from rdkit import Chem
Fatal Python error: Interpreter not initialized (version mismatch?)

------------------------------------------------------------------------
Unhandled SIGABRT: An abort() occurred in Sage.
This probably occurred because a *compiled* component of Sage has a bug
in it and is not properly wrapped with sig_on(), sig_off(). You might
want to run Sage under gdb with 'sage -gdb' to debug this.
Sage will now terminate.
------------------------------------------------------------------------
/Users/dalke/ftps/sage/spkg/bin/sage: line 312: 56741 Abort trap              sage-ipython "$@" -i
&lt;/pre&gt;

What happened here is that Sage ships with Python 2.7, while the
locally installed cheminformatics toolkits use Python 2.6.

&lt;/P&gt;&lt;P&gt;

I decided to use another technique. I would have sage call out to
another program to handle the cheminformatics. For each structure it
will output a line containing the identifier, the SMILES, and the
needed upper-triangle dictionary data structure. One line of output
will look like:

&lt;pre class="code"&gt;
('15', 'OC1C2(C(C3C(C4(C(CC3)CC(=O)CC4)C)CC2)CC1)C', {0: [1], 1: [2, 19],
 2: [3, 20, 17], 3: [4, 18], 4: [5, 9], 5: [6, 16], 6: [7, 15, 14],
 7: [8, 10], 8: [9], 10: [11], 11: [12, 13], 13: [14], 16: [17], 18: [19]})
&lt;/pre&gt;

All the sage code needs to do is read those lines, extract the data,
pass the graph into Sage for analysis, and print out those which are
non-planar.

&lt;/P&gt;&lt;P&gt;

Even this wasn't as easy as I thought. The PYTHONPATH environment
variable persists in spawned processes, which the python subprocess I
started uses the wrong PYTHONPATH. I ended up setting the environment
variables myself - including PYTHONHOME - before it would work
correctly:

&lt;pre class="code"&gt;
import sys
import os
env = os.environ.copy()
env["DYLD_LIBRARY_PATH"] = "/Users/dalke/ftps/openeye//wrappers/libs:/Users/dalke/envs/RDKit_2011_12-svn/lib"
env["PYTHONPATH"] = "/Users/dalke/envs/RDKit_2011_12-svn"
env["RDBASE"] = "/Users/dalke/envs/RDKit_2011_12-svn"
env["PYTHONHOME"] = "/System/Library/Frameworks/Python.framework/Versions/2.6"

import subprocess

# Import the "Graph" constructor
from sage.all import *

# Use the version of Python for which RDKit was built to use
p = subprocess.Popen(["/usr/bin/python2.6", "pubchem_connectivity.py"],
                     env = env,
                     stdout = subprocess.PIPE)
                 
for i, line in enumerate(p.stdout):
    id, smiles, d = eval(line)
    G = Graph(d)
    if not G.is_planar():
        print "======= Found one!!!"
        print id, smiles
    if i % 100 == 0:
        sys.stderr.write("%d ...\r" % (i,))
        sys.stderr.flush()
&lt;/pre&gt;

&lt;/P&gt;&lt;P&gt;

Now I just need the "pubchem_connectivity.py" program to generate the connectivity information.

&lt;/P&gt;
&lt;h3&gt;Call another program to generate the connectivity information&lt;/h3&gt;
&lt;P&gt;

I've been using RDKit these days, because the funding I got for the
MCS project said I needed to use RDKit. I usually use OEChem, which
(among other things) has much faster structure parsers than
RDKit. Since I'm going to read tens of millions of structures, I
wanted a way to reduce the number of structures to process.

&lt;/P&gt;&lt;P&gt;

Now, if a graph has no cycles then it's always planar. Even if it has
one, two, or three cycles, it's still impossible for it to be
non-planar. The number of cycles in a single component graph is E-V+1,
which is the number of edges (bonds), minus the number of vertices
(atoms), plus 1. So a simple test is to exclude molecules where the
number of bonds is less than three greater than the number of atoms.

&lt;/P&gt;&lt;P&gt;

Mind you, this is an exclusion test which rejects graphs that cannot
be planar. There are plenty of molecules with dozens or rings which
are topologically planar. Also, PubChem has plenty of records with
multiple components, which means I may miss a few cases. But mind you,
my goal is to find some non-planar structures in PubChem, not find all
of them.

&lt;/P&gt;&lt;P&gt;

I had previously converted my local copy of PubChem to a set of
compressed SMILES files. A few months ago I wrote a set of &lt;a
href="https://bitbucket.org/dalke/opensmiles-ragel"&gt;Ragel definitions
for SMILES&lt;/a&gt;, which includes a demonstration of how to use count the
number of atoms and bonds in a SMILES file.  I can use that unmodified
as a co-process to do high-speed counting for me.

&lt;/P&gt;&lt;P&gt;

&lt;pre class="code"&gt;
import glob
import subprocess
import gzip
import sys
from collections import defaultdict
from rdkit import Chem

# Start a co-process to count the number of atoms and bonds in a SMILES string
p = subprocess.Popen(["/Users/dalke/opensmiles-ragel/smiles_counts"],
                     stdin = subprocess.PIPE,
                     stdout = subprocess.PIPE)

filenames = glob.glob("/Users/dalke/databases/pubchem/*.smi.gz")
for i, filename in enumerate(filenames):
    msg = "Processing %d/%d\n" % (i+1, len(filenames))
    sys.stderr.write(msg)
    sys.stderr.flush()

    for line in gzip.open(filename):
        # Read a line from the data file
        smiles, id = line.split()

        # Send it to the co-process
        p.stdin.write(smiles + "\n")
        p.stdin.flush()

        # Get the counts (looks like "atoms: 34 bonds: 38")
        line = p.stdout.readline()
        _, atoms, _, bonds = line.split()
        num_atoms = int(atoms)
        num_bonds = int(bonds)

        # Skip records which cannot be non-planar
        # (Note: assumes single component structures!)
        if num_bonds &amp;lt; num_atoms + 3:
            continue

        # Extract the topology into upper-triangle dictionary form
        mol = Chem.MolFromSmiles(smiles)
        if mol is None:
            continue
        d = defaultdict(list)
        for bond in mol.GetBonds():
            b1 = bond.GetBeginAtomIdx()
            b2 = bond.GetEndAtomIdx()
            if b1 &amp;lt; b2:
                d[b1].append(b2)
            else:
                d[b2].append(b1)

        # print to stdout so the Sage program can read it
        data = (id, smiles, dict(d))
        print repr(data)
&lt;/pre&gt;

BTW, if you haven't been paying attention, I have one process which
tests if a SMILES string has enough bonds in it, another to convert
structure information into a simple topology, and a third program to
do the graph planarity test. "Bailing-wire and chewing gum!" to repeat
an old phrase.

&lt;/P&gt;
&lt;h2&gt;The structures!&lt;/h2&gt;
&lt;P&gt;

It took many hours for my computer to chug along (while I slept). I
ended up with 224 of the non-planar SMILES in the subset of 28.5
million PubChem structures I have on my machine. (In other words, bear
in mind that this is not a complete search!)

&lt;/P&gt;&lt;P&gt;

Here are a few structures to show you what they look like:

&lt;/P&gt;&lt;P&gt;
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=3084099"&gt;&lt;img src="http://pubchem.ncbi.nlm.nih.gov/image/imgsrv.fcgi?t=l&amp;cid=3084099"&gt;&lt;/a&gt;
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=4868274"&gt;&lt;img src="http://pubchem.ncbi.nlm.nih.gov/image/imgsrv.fcgi?t=l&amp;cid=4868274"&gt;&lt;/a&gt;
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=50919058"&gt;&lt;img src="http://pubchem.ncbi.nlm.nih.gov/image/imgsrv.fcgi?t=l&amp;cid=50919058"&gt;&lt;/a&gt;
&lt;/P&gt;

&lt;P&gt;

That first structure, &lt;a
href="http://en.wikipedia.org/wiki/Silicon_nitride"&gt;silicon
nitride&lt;/a&gt;, is a ceramic, and which cannot be expressed in
SMILES. (That is "[C]" is an equally poor SMILES representation of
graphite as it is for diamond.) The others look like progressively
more realistic chemical representations.

&lt;/P&gt;&lt;P&gt;

The non-planar structures were a bit too verbose to show as a SMILES
file. Instead, here's the full list of identifiers, which you can
easily use to get the structures yourself (or follow the hyperlink to
look at the non-planar depictions).

&lt;blockquote&gt;
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=390566"&gt;390566&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=498002"&gt;498002&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=636755"&gt;636755&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=3084099"&gt;3084099&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=4868274"&gt;4868274&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=5104674"&gt;5104674&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=6712449"&gt;6712449&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=10019039"&gt;10019039&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=10882690"&gt;10882690&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=10895017"&gt;10895017&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=10898366"&gt;10898366&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=10994362"&gt;10994362&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=11027681"&gt;11027681&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=11126005"&gt;11126005&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=11131973"&gt;11131973&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=11187418"&gt;11187418&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=11340053"&gt;11340053&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=11350583"&gt;11350583&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=11360285"&gt;11360285&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=11384163"&gt;11384163&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=11407796"&gt;11407796&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=11672382"&gt;11672382&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=14381365"&gt;14381365&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=14381430"&gt;14381430&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=14381432"&gt;14381432&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=14381435"&gt;14381435&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16132679"&gt;16132679&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16132681"&gt;16132681&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16132995"&gt;16132995&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16132999"&gt;16132999&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16133150"&gt;16133150&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16133259"&gt;16133259&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16133262"&gt;16133262&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16133397"&gt;16133397&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16133412"&gt;16133412&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16133413"&gt;16133413&lt;/a&gt;, 
ee&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16133414"&gt;16133414&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16133878"&gt;16133878&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16145580"&gt;16145580&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16146229"&gt;16146229&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16146230"&gt;16146230&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16148442"&gt;16148442&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16148609"&gt;16148609&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16148632"&gt;16148632&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16148888"&gt;16148888&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16148900"&gt;16148900&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16149007"&gt;16149007&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16149114"&gt;16149114&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16149222"&gt;16149222&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16149238"&gt;16149238&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16149361"&gt;16149361&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16149362"&gt;16149362&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16149482"&gt;16149482&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16149579"&gt;16149579&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16149602"&gt;16149602&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16149827"&gt;16149827&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16149958"&gt;16149958&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16150054"&gt;16150054&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16150193"&gt;16150193&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16150399"&gt;16150399&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16150419"&gt;16150419&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16150654"&gt;16150654&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16150658"&gt;16150658&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16150833"&gt;16150833&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16150994"&gt;16150994&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16151169"&gt;16151169&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16151337"&gt;16151337&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16151360"&gt;16151360&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16151567"&gt;16151567&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16151729"&gt;16151729&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16151960"&gt;16151960&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16152362"&gt;16152362&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16152566"&gt;16152566&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16152608"&gt;16152608&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16152699"&gt;16152699&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16152729"&gt;16152729&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16153098"&gt;16153098&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16153207"&gt;16153207&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16153649"&gt;16153649&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16154275"&gt;16154275&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16154327"&gt;16154327&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16154453"&gt;16154453&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16154584"&gt;16154584&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16154971"&gt;16154971&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16155013"&gt;16155013&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16155014"&gt;16155014&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16155015"&gt;16155015&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16155069"&gt;16155069&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16155075"&gt;16155075&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16155076"&gt;16155076&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16155130"&gt;16155130&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16155140"&gt;16155140&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16155152"&gt;16155152&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16155193"&gt;16155193&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16155194"&gt;16155194&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16155197"&gt;16155197&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16155399"&gt;16155399&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16155442"&gt;16155442&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16155443"&gt;16155443&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16155456"&gt;16155456&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16155607"&gt;16155607&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16155626"&gt;16155626&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16155630"&gt;16155630&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16155631"&gt;16155631&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16155632"&gt;16155632&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16155633"&gt;16155633&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16155652"&gt;16155652&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16155784"&gt;16155784&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16155884"&gt;16155884&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16155916"&gt;16155916&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16155917"&gt;16155917&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16155920"&gt;16155920&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16155972"&gt;16155972&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16156042"&gt;16156042&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16156080"&gt;16156080&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16156082"&gt;16156082&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16156145"&gt;16156145&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16156198"&gt;16156198&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16156260"&gt;16156260&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16156261"&gt;16156261&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16156264"&gt;16156264&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16156265"&gt;16156265&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16156312"&gt;16156312&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16156411"&gt;16156411&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16156413"&gt;16156413&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16156417"&gt;16156417&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16156432"&gt;16156432&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16156495"&gt;16156495&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16156539"&gt;16156539&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16156544"&gt;16156544&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16156545"&gt;16156545&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16156609"&gt;16156609&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16156610"&gt;16156610&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16156613"&gt;16156613&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16157557"&gt;16157557&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16214951"&gt;16214951&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=17749011"&gt;17749011&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=21597602"&gt;21597602&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=21597607"&gt;21597607&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=21597610"&gt;21597610&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=21597611"&gt;21597611&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=21770498"&gt;21770498&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=22294696"&gt;22294696&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=22835058"&gt;22835058&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=22835161"&gt;22835161&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=22835262"&gt;22835262&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=22835624"&gt;22835624&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=22835636"&gt;22835636&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=22835637"&gt;22835637&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=23327291"&gt;23327291&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=23584643"&gt;23584643&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=23726086"&gt;23726086&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=23727886"&gt;23727886&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=23955822"&gt;23955822&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=24764125"&gt;24764125&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=24770227"&gt;24770227&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=24770228"&gt;24770228&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=24770229"&gt;24770229&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=24770290"&gt;24770290&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=24770291"&gt;24770291&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=24871221"&gt;24871221&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=24940071"&gt;24940071&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=25200029"&gt;25200029&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=44239114"&gt;44239114&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=44303783"&gt;44303783&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=44303799"&gt;44303799&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=44303821"&gt;44303821&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=44382489"&gt;44382489&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=44397933"&gt;44397933&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=44397934"&gt;44397934&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=44566282"&gt;44566282&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=44566284"&gt;44566284&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=44575206"&gt;44575206&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=44575207"&gt;44575207&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=44575208"&gt;44575208&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=44592641"&gt;44592641&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=44592642"&gt;44592642&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=44592645"&gt;44592645&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=44592646"&gt;44592646&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=44592647"&gt;44592647&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=44592648"&gt;44592648&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=44592945"&gt;44592945&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=44593582"&gt;44593582&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=44606373"&gt;44606373&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=44606374"&gt;44606374&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=46882313"&gt;46882313&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=46882314"&gt;46882314&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=46882315"&gt;46882315&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=46882316"&gt;46882316&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=46882317"&gt;46882317&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=46882318"&gt;46882318&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=46882319"&gt;46882319&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=46882320"&gt;46882320&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=46882321"&gt;46882321&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=46882322"&gt;46882322&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=46882323"&gt;46882323&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=46882324"&gt;46882324&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=46882325"&gt;46882325&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=46882326"&gt;46882326&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=46882327"&gt;46882327&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=46882328"&gt;46882328&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=46891923"&gt;46891923&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=46895835"&gt;46895835&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=49799159"&gt;49799159&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=49799160"&gt;49799160&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=49873810"&gt;49873810&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=50897242"&gt;50897242&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=50900298"&gt;50900298&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=50900299"&gt;50900299&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=50900300"&gt;50900300&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=50919058"&gt;50919058&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=51004304"&gt;51004304&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=51026319"&gt;51026319&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=52945815"&gt;52945815&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=52952313"&gt;52952313&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=52952314"&gt;52952314&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=52952315"&gt;52952315&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=52952316"&gt;52952316&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=52952317"&gt;52952317&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=53468167"&gt;53468167&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=53468168"&gt;53468168&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=53468169"&gt;53468169&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=53468170"&gt;53468170&lt;/a&gt;, 
&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=53468171"&gt;53468171&lt;/a&gt;
&lt;/blockquote&gt;

&lt;/P&gt;&lt;P&gt;

Still, having some SMILES on-hand is nice so here are some of the
shorter SMILES strings. At the least, you can use these as test-cases
for an MCS search engine, and perhaps force a linear-time
planar-graph-only method to fail &lt;tt&gt;:)&lt;/tt&gt;

&lt;pre class="code"&gt;
O1C23C4C5(C16OCC7C4(CC(C2(OC(=O)C3(CC5C8C6CC=C9C8(C(=O)C=CC9)C)O)C)OC7=O)C)O &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=390566"&gt;390566&lt;/a&gt;
O=C1N2C34N5CCC(C3CCC4(C1)C=CC5)c6c2cccc6 &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=498002"&gt;498002&lt;/a&gt;
O=C1N2C34N5CCC(C3CCC4(C1)C=CC5)c6c2cc(cc6)OC &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=636755"&gt;636755&lt;/a&gt;
[Si]123N4[Si]56N1[Si]4(N25)N36 &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=3084099"&gt;3084099&lt;/a&gt;
C123C45C16C24C6C7C(C35)CC(C(C7)C)C &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=4868274"&gt;4868274&lt;/a&gt;
O1C23C4C5(C16OCC7C4(CC(C2(OC(=O)C3(CC5C8C6CC=C9C8(C(=O)C=CC9)C)O)C)OC7=O)C)O &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=5104674"&gt;5104674&lt;/a&gt;
O1C23C4C5(C16OCC7C4(CC(C2(OC(=O)C3(CC5C8C6CC=C9C8(C(=O)C=CC9)C)O)C)OC7=O)C)O &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=6712449"&gt;6712449&lt;/a&gt;
O=C1N2C34N5CCC(C3CCC4(C1)C=CC5)c6c2cc(cc6)OC &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=10019039"&gt;10019039&lt;/a&gt;
O=C1N2C34N5CCC(C3CCC4(C1)C=CC5)c6c2cc(c(c6)OC)OC &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=10882690"&gt;10882690&lt;/a&gt;
O1C2C34C5C(CN(C3C(C5OC)c6c(cc(c(c6)OC)O)C4C1)CC)(C(C2)O)CO &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=10895017"&gt;10895017&lt;/a&gt;
O1c2c3cccc2Cc4c5c(ccc4)Cc6c7c(ccc6)Cc8c(c(ccc8)C3)OCCOCCN(CCOCCO5)CCOCCOc9c(cccc9)OCCOCCN(CCOCC1)CCOCCO7 &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=10898366"&gt;10898366&lt;/a&gt;
O1C2C34C5C(C(C2)OC(=O)C)(CN(C3C(C5OC)c6c(cc(c(c6)OC)O)C4C1)CC)COC &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=10994362"&gt;10994362&lt;/a&gt;
C123C45c6c(cccc6)C17c8c(cccc8)C2(c9c(cccc9)C3(c1c4cccc1)c1c7cccc1)c1c5cccc1 &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=11027681"&gt;11027681&lt;/a&gt;
OC1N2C34N5CCC(C3CCC4(C1)C=CC5)c6c2cc(c(c6)OC)OC &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=11131973"&gt;11131973&lt;/a&gt;
O1c2c3c(ccc2)OB4Oc5c6c(ccc5)OB1Oc7c(c(ccc7)O4)N63 &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=11187418"&gt;11187418&lt;/a&gt;
C123C45C6C7C8C19C1C(C4C4C%10C22C%11C(C5C5C%12C33C(C6C5)C8CC5C9C(C2C(C35)CC%12%11)CC1%10)C4)C7 &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=11340053"&gt;11340053&lt;/a&gt;
O1C2C34C5C(CN(C3C(C5OC)c6c(cc(c(c6)O)O)C4C1)CC)(C(C2)O)COC &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=11350583"&gt;11350583&lt;/a&gt;
C123C4C5C1C6C5(C4C26)C=CC#CC=CC78C9C1C7C2C1(C9C82)C=CC#CC=C3 &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=11360285"&gt;11360285&lt;/a&gt;
C123C4C5C1C6C5(C4C26)C=CC#CC#CC=CC78C9C1C7C2C1(C9C82)C=CC#CC#CC=C3 &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=11384163"&gt;11384163&lt;/a&gt;
O1C2C34C5C(CN(C3C(C5OC)c6c(cc(c(c6)OC)O)C4C1)CC)(C(C2)O)COC &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=11407796"&gt;11407796&lt;/a&gt;
O(c1cc2c(cc1OC)C34C56C27c8c(cc(c(c8)OC)OC)C5(c9c3cc(c(c9)OC)OC)c1c(cc(c(c1)OC)OC)C6(c1c7cc(c(c1)OC)OC)c1c4cc(c(c1)OC)OC)C &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=11672382"&gt;11672382&lt;/a&gt;
O(C12NC(=O)C3C4C56C1C7C8C59C15C6(C3C3C1C(C9C43)C#N)C2C7C5C8=O)C(=O)C &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=14381365"&gt;14381365&lt;/a&gt;
O1C2C3C45C67C2C8C3C9C42C6(C8C9=O)C3C4C7C(C5C4C2C3C#N)C1=O &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=14381430"&gt;14381430&lt;/a&gt;
O1C2C3C45C67C2C8C3C9C42C63C8C9OC(=O)C4C2C2C5C(C7C2C34)C1=O &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=14381432"&gt;14381432&lt;/a&gt;
BrC12C34C56C7(C8C9C5C5C3C9C1C8C(=O)OC1C2C2C4C(C6C2C71)OC5=O)Br &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=14381435"&gt;14381435&lt;/a&gt;
O1c2c3c4c5c6c7c3c(cc8c7c(cc6Oc9cc(ccc9)OCCOCCOc3cc1ccc3)C(=O)N(C8=O)C(CCCCCC)C)Oc1cc(ccc1)OCCOCCOc1cc(ccc1)Oc5cc1c4c(c2)C(=O)N(C1=O)C(CCCCCC)C &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16214951"&gt;16214951&lt;/a&gt;
BrC12C34C56C7(C8C9C5C5C3C9C1C8C(=O)OC1(C2C2C4C(C6C2C71)OC5=O)Br)Br &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=21597602"&gt;21597602&lt;/a&gt;
BrC12C34C56C7C8C9C3C7C(=O)OC3C4C4C1C(=O)C(C5(C8C(C29)C(=O)OCC)Br)C4C63 &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=21597607"&gt;21597607&lt;/a&gt;
BrC12C34C56C7C8C9C5C5C3C9C1C8C(=O)OC1C2C2C4C(C6C2C71)OC5=O &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=21597610"&gt;21597610&lt;/a&gt;
O1C2C3C45C67C2C8C3C9C42C6(C8C9O)C3C4C7C(C5C4C2C3C(=O)O)C1 &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=21597611"&gt;21597611&lt;/a&gt;
O1C23OC(OC4C25C6C7(C1=O)C(C3OC(OC5C(=C)C4CC6)(C)C)C(CCC7)(C)C)(C)C &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=21770498"&gt;21770498&lt;/a&gt;
ClC1=CC2OC3C1C4OC5C2C4C3C5 &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=22294696"&gt;22294696&lt;/a&gt;
O=C1N2C34N5CCC(C3CCC4(C1)C=CC5)c6c2cc(cc6)OC &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=23327291"&gt;23327291&lt;/a&gt;
O1c2cc3c4cc2OCCOCCOCCOc5c(cc6c(c5)c(c7c(c6C)cc8c(c7)OCCOCCOCCOc9cc2c(cc9OCCOCCOCCO8)C4(c4c(cccc4)C32C)C)C)OCCOCCOCC1 &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=23584643"&gt;23584643&lt;/a&gt;
O1C2(OCC34C5C2(CCC3OC(=O)C67C4C(OC51)CC(C6)C(=C)C7=O)C)C &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=23727886"&gt;23727886&lt;/a&gt;
O=C1N2C34N5CCC(C3CCC4(C1)C=CC5)c6c2cc(c(c6)OC)OC &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=23955822"&gt;23955822&lt;/a&gt;
O1C2C3C45C6C1OC(C6(CCC4OC(=O)C37CC(C2)C(=C)C7=O)C)OC5 &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=24764125"&gt;24764125&lt;/a&gt;
O=C1N2C34N5CCC(C3CCC4(C1)C=CC5)C6=C2CC(=C(C6)OC)OC &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=24871221"&gt;24871221&lt;/a&gt;
O1C2C34C56C(C(C2(C=C5)OC)C(O)(CCCC)C)CCN(C6Cc7c3c1c(cc7)O)CC4 &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=44303783"&gt;44303783&lt;/a&gt;
O1C2C34C56C(C(C2(C=C5)OC)C(O)(CCCC)C)CN(C6Cc7c3c1c(cc7)O)CC4 &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=44303799"&gt;44303799&lt;/a&gt;
O1C2C34C56C(C(C2(C=C5)OC)C(O)(CCCC)C)CCN(C6Cc7c3c1c(cc7)O)CC4 &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=44303821"&gt;44303821&lt;/a&gt;
S(=O)(=O)(OCC12C3C4C(C1)CC(C3)CC4C2)N &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=44382489"&gt;44382489&lt;/a&gt;
O=C1N2C34N5CCC(C3CCC4(C1)C=CC5)c6c2cc(c(c6)OC)OC &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=44592945"&gt;44592945&lt;/a&gt;
O1C23C4C5(C16OCC7C4(CC(C2(OC(=O)C3(CC5C8C6CC=C9C8(C(=O)C=CC9)C)O)C)OC7=O)C)O &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=49873810"&gt;49873810&lt;/a&gt;
O=C1N2C34N5CCC(C3CCC4(C1)C=CC5)c6c2cccc6 &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=50897242"&gt;50897242&lt;/a&gt;
O=C1N2C34N5CCC(C3CCC4(C1)C=CC5)c6c2cc(cc6)OC &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=50919058"&gt;50919058&lt;/a&gt;
O1C23C4C5(C16OCC7C4(CC(C2(OC(=O)C3(CC5C8C6CC=C9C8(C(=O)C=CC9)C)O)C)OC7=O)C)O &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=51004304"&gt;51004304&lt;/a&gt;
O=C1N2C34N5CCC(C3CCC4(C1)C=CC5)c6c2cc(cc6)OC &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=51026319"&gt;51026319&lt;/a&gt;
&lt;/pre&gt;

&lt;/P&gt;&lt;P&gt;

Feel free to &lt;a href="http://dalkescientific.blogspot.com/2012/05/topologically-non-planar-molecules.html"&gt;leave a comment&lt;/a&gt;.

&lt;/P&gt;

</description><guid isPermaLink="true">http://www.dalkescientific.com/writings/diary/archive/2012/05/18/nonplanar_compounds.html</guid><pubDate>Fri, 18 May 2012 12:00:00 GMT</pubDate></item><item><title>Finding the MCSes for the ChEBI ontology</title><link>http://www.dalkescientific.com/writings/diary/archive/2012/05/13/mcs_chebi.html</link><description>&lt;P&gt;
This is part 3 of a series on MCS:
&lt;ol&gt;
 &lt;li&gt;&lt;a href="http://dalkescientific.com/writings/diary/archive/2012/05/12/mcs_background.html"&gt;MCS background&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="http://dalkescientific.com/writings/diary/archive/2012/05/13/fmcs.html"&gt;fmcs - find the MCS of a set of compounds&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;b&gt;Finding the MCSes for the ChEBI ontology&lt;/b&gt;&lt;/li&gt;
&lt;/ol&gt;


&lt;/P&gt;&lt;P&gt;

The industrious folks at EBI have been developing &lt;a
href="http://www.ebi.ac.uk/chebi/"&gt;ChEBI&lt;/a&gt;, which expands to
"Chemical Entities of Biological Interest." Quoting Wikipedia, "[&lt;a href="http://en.wikipedia.org/wiki/ChEBI"&gt;ChEBI&lt;/a&gt;] is a database and ontology of molecular entities focused on 'small' chemical compounds, that is part of the Open Biomedical Ontologies effort."

&lt;/P&gt;&lt;P&gt;

They define several distinct ontologies. One is a chemical structure
ontology. For example, the identifier CHEBI:33567 contains
catecholamine, and a few examples of catecholamines are hexoprenaline
(CHEBI:37950), arbutamine (CHEBI:50580), L-isoprenaline
(CHEBI:6257). In addition, catecholamine is a catechol (CHEBI:33566),
which in turn is a benzenediol (CHEBI:33570), and so on. A group can
have more than one parent; catecholamine is also a monoamine molecular
messenger (CHEBI:25375).

&lt;/P&gt;&lt;P&gt;


The end result is a hierarchial structure. The bottom of the hierarchy
are structures, and intermediate nodes are such that all children of
the node have some common property.

&lt;/P&gt;&lt;P&gt;

Some of these common properties map directly to a common
substructure. For example, &lt;a
href="http://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI%3a33853"&gt;CHEBI:33853&lt;/a&gt;
contains phenols, so every compound under that node has "one or more
hydroxy groups attached to a benzene or other arene ring."

&lt;/P&gt;&lt;P&gt;

However, not all of them do. As Chepelev, Hastings, Ennis, Steinbeck,
and Dumontier pointed out in "&lt;a
href="http://www.biomedcentral.com/1471-2105/13/3"&gt;Self-organizing
ontology of biochemically relevant small molecules&lt;/a&gt;", BMC
Bioinformatics 2012, 13:3, the term "'ester' includes compounds that
conform to C(=O)OC (i.e. carboxylic esters) and C(=S)OC patterns, among others."

&lt;/P&gt;&lt;P&gt;

Other cases can't even be represented as SMARTS. They give "bicyclic"
as one such example.

&lt;/P&gt;
&lt;h2&gt;Can I find the MCS of all structures in a node in the ontology?&lt;/h2&gt;
&lt;P&gt;

I was curious to see if I could use their data set as a test of &lt;a
href="https://bitbucket.org/dalke/fmcs/"&gt;fmcs&lt;/a&gt;. If their
intermediate nodes have a machine-readable way to tell if it's a
purely substructure-based node, and if I could get the size
information, then I could get all the structures underneath it, find
the MCS, and compare my answer to theirs.

&lt;/P&gt;&lt;P&gt;

Alas, they don't have that annotation information. It's something they
are working on, but I didn't get the impression that it's a high
priority. (I don't see why it should, either.)

&lt;/P&gt;&lt;P&gt;

Still, it's an interesting thought - what if I were to generate the
MCS for all nodes, and visualize the results somehow?

&lt;/P&gt;&lt;P&gt;

It took a bit longer than I thought, but I finally downloaded their
ontology (in OBO format), parsed it, extracted the hierarchy, figured
out the compounds in each node, tossed out the structures that RDKit
couldn't parse, and the nodes which didn't have at least two remaining
structures in them.

&lt;/P&gt;&lt;P&gt;

One that was done, I let my MCS algorithm at it. It took about 50
minutes to process. (Well, I had a 15 second timeout on the MCS. I've
found that 15 seconds is usually good enough.)

&lt;/P&gt;&lt;P&gt;

I also developed a visualizer for the result, using Karen Schomburg's
&lt;a href="http://smartsview.zbh.uni-hamburg.de/"&gt;SMARTSviewer&lt;/a&gt; and
Daylight's &lt;a
href="http://www.daylight.com/daycgi_tutorials/depictmatch.cgi"&gt;depictmatch.cgi&lt;/a&gt;

&lt;/P&gt;
&lt;h2&gt;Oooh! More pictures!&lt;/h2&gt;
&lt;P&gt;

Here's a snapshot of one of the successful cases, &lt;a
href="http://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:16648"&gt;CHEBI:16648&lt;/a&gt;,
which is dialkyl phosphate:

&lt;/P&gt;&lt;P&gt;
&lt;img src="http://dalkescientific.com/writings/mcs_dialkyl_phosphate.png"&gt;&lt;br /&gt;
&lt;P&gt;

Most of the results aren't as clear-cut. For example, &lt;a
href="http://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:16389"&gt;CHEBI:16389&lt;/a&gt;
contains the ubiquinones. I found the MCS:

&lt;/P&gt;&lt;P&gt;
&lt;img src="http://dalkescientific.com/writings/mcs_ubiquinones.png"&gt;&lt;br /&gt;
&lt;P&gt;

which is nearly right, but the Wikipedia page for &lt;a
href="http://en.wikipedia.org/wiki/Coenzyme_Q10"&gt;Coenzyme_Q10&lt;/a&gt;
("Coenzyme Q&lt;sub&gt;10&lt;/sub&gt;, also known as ubiquinone, ...") shows a
methyl attached to the top-most oxygen this SMARTS depiction. This is because 
&lt;a href="http://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:18238"&gt;CHEBI:18238&lt;/a&gt;
is a structure in the set which does not have that methyl attached!

&lt;/P&gt;&lt;P&gt;

It this methyl important? I don't know. I'm not a chemist, and this
requires expertise I simply don't have.

&lt;/P&gt;
&lt;h2&gt;An oopsie in the oxolanes?&lt;/h2&gt;
&lt;P&gt;

What I do know is that there's a mistake in the oxolanes, &lt;a
href="http://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:18238"&gt;CHEBI:26912&lt;/a&gt;.
Wikipedia calls this &lt;a
href="http://en.wikipedia.org/wiki/Tetrahydrofuran"&gt;tetrahydrofuran&lt;/a&gt;
and says it's an 5-membered ring with the formula (CH2)4O. I would
write it as the SMILES/SMARTS "O1CCCC1".

&lt;/P&gt;&lt;P&gt;

However, my search finds only "OCCCC"; it doesn't find the
cycle. There shouldn't be a problem with this one so I investigated
further, wondering if it was a bug. It ended up that acetylblasticidin S
(&lt;a href="http://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:2413"&gt;CHEBI:2413&lt;/a&gt;)
is considered an oxolane. A quick look at the structure though
shows that it has no 5-membered ring.

&lt;/P&gt;&lt;P&gt;

I think that's an annotation error. BTW, I do not envy the job of
annotator. There's a lot of data to review, and people like me end up
pointing out the mistakes, not the huge amount of work to get all the
other parts right.

&lt;/P&gt;
&lt;h2&gt;Even more pictures... most of the ChEBI ontology!&lt;/h2&gt;
&lt;P&gt;

Do you want to see the output of my full analysis? Do you have a lot
of memory on your computer? If so, download &lt;a
href="http://dalkescientific.com/fmcs_chebi.html.bz2"&gt;fmcs_chebi.html.bz2&lt;/a&gt;. It's
only 7.5 MB but it bzip2 uncompresses to 166 MB. Open fmcs_chebi.html
in your browser, and have fun! (Note: I'll probably delete it after a
month or so.)

&lt;/P&gt;&lt;P&gt;

BTW: the images are computed on-demand using servers from the
University of Hamburg and from Daylight. I didn't want to show
everything at once since that would put a huge demand on those
servers. Instead, you'll need to press the "Toggle images" button in
order to see the SMARTS and the graphical depiction of the matches.

&lt;/P&gt;
&lt;h3&gt;Comments&lt;/h3&gt;
&lt;P&gt;
If you have comments or questions,
&lt;a href="http://dalkescientific.blogspot.com/2012/05/maximum-common-substructures-and-fmcs.html"&gt;leave them here&lt;/a&gt;.
&lt;/P&gt;

</description><guid isPermaLink="true">http://www.dalkescientific.com/writings/diary/archive/2012/05/13/mcs_chebi.html</guid><pubDate>Sun, 13 May 2012 12:00:00 GMT</pubDate></item></channel></rss>