Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2006/09/07/simple_lazy_dependencies

Simple lazy dependencies

Background

I do a lot of work in computational chemistry, and more specifically in chemical informatics. The scientists I work with develop ways (sometimes successfully) to understand how a small molecule will work in a living system. For example, is the molecule likely to go through the blood/brain barrier (BBB). I'll call this a "predictive model" because in a bit I'll use "model" for a different concept.

A predictive model may be built on top of other models. A great example is a consensus model which merges the results of three other models. At some point the models use values from general chemistry and not biochemistry. These are things like molecular weight, graph theoretical measures, and computed logP. These terms are called "descriptors." In turn these are based on the chemical structure. Often the input structure needs cleanup and registration: removing salts, fixing the charge, canonicalizing tautomers, and other stages of input data normalization.

Normally these steps are done imperatively. "Read structure. Register. Compute molecular weight and clogp. Compute BBB predictive model #1, #2 and #3. Merge the results into a consensus model. Save result." Keeping track of the dependencies is hard. This breaks down quickly

A simple lazy dependency system

I developed a dependency tracking system for one of my clients many years ago - CombiChem, when they were part of DuPont Pharma. It wasn't all that complicated or sophisticated. Later I implemented a similar but much simpler system for another client, AstraZeneca. It's simpler because unlike the CombiChem version the AZ one assumes that once a value is computed it will never change.

Here is the heart of the AZ system. Note that this "Model" has little to do with "predictive model."

class Model(object):
    def __init__(self, data=None, resources=None):
        if data is None:
            data = {}
        if resources is None:
            resources = {}
        self.data = data
        self.resources = resources
        
    def __getitem__(self, key):
        try:
            return self.data[key]
        except KeyError:
            pass
        
        resource = self.resources[key]
        resource(key, self)  # recursive call
        return self.data[key]
    
    def __setitem__(self, key, value):
        self.data[key] = value
The 'data' dictionary contains initial values and any other computed values, cached for later. The 'resources' dictionary contains handlers which are callables. If the requested data item is not in the data dictionary then get the associated resource and call it. The resource must set the expected property, which is returned to the caller. Setting an item saves it in the data dictionary.

The trick here is this is a recursive call. The handler gets the name of the desired property and the model object. If it needs a property it can turn around and query the model for it.

StringModel - an example

For example, here's a StringModel which takes a string and the choice of forcing it to lower case for use by the analysis handlers.

def normalize(name, model):
    if model["force_lowercase"]:
        model["_text"] = model["text"].lower()
    else:
        model["_text"] = model["text"]


def num_letter_a(name, model):
    model["num_letter_a"] =  model["_text"].count("a")

def num_letter_i(name, model):
    model["num_letter_i"] = model["_text"].count("i")

string_resources = {
    "_text": normalize,
    "num_letter_a": num_letter_a,
    "num_letter_i": num_letter_i,
    }

class StringModel(Model):
    def __init__(self, text, force_lowercase=True):
        Model.__init__(self,
                       dict(text=text, force_lowercase=force_lowercase),
                       string_resources)
Note that the "num_letter_{a,i}" functions use the internal "_text" property rather than the intial "text" property. "_text" is normalized to lower-case by default, else it's the actual input string.

Making and using a StringModel is easy:

smodel = StringModel("I came, I saw, I kicked butt.")
assert smodel["num_letter_a"] == 2
assert smodel["num_letter_i"] == 4

smodel = StringModel("I came, I saw, I kicked butt.",
                     force_lowercase=False)
assert smodel["num_letter_a"] == 2
assert smodel["num_letter_i"] == 1

(Thank to Marcin Feder for spotting and correcting mistakes in my original code for the StringModel and the asserts.)

A resource may compute multiple properties at once, as

LC_VOWELS = "aeiou"
LC_CONSONANTS = "bcdfghjklmnpqrstvwxyz"

def compute_lc_vowel_and_consonant_counts(name, model):
    text = model["_text"]  # the normalized string
    d = dict.fromkeys(text, 0)
    for c in text:
        d[c] += 1

    model["num_lc_vowels"] = sum([d.get(c, 0) for c in LC_VOWELS])
    model["num_lc_consonants"] = sum([d.get(c, 0) for c in LC_CONSONANTS])
which is added to the "string_resources" as

string_resources = {
    "_text": normalize,
    "num_letter_a": num_letter_a,
    "num_letter_i": num_letter_i,
    "num_lc_vowels": compute_lc_vowel_and_consonant_counts,
    "num_lc_consonants": compute_lc_vowel_and_consonant_counts,
    }

The "name" parameter is the name of requested descriptor. It's most often used for similar properties which can share much of the same code through a function or class instance. For example, here's an adapter to convert value into roman numbers. The descriptor "X_roman" is the roman numeral form of the descriptor "X".

roman_numerals = dict(
     zip(range(16),
    "* I II III IV V VI VII VIII IX X X1 XII XIII XIV XV".split()))

def roman(name, model):
    if not name.endswith("_roman"):
        raise TypeError("unexpected descriptor name %r" % (name,))
    model[name] = roman_numerals[model[name[:-6]]]
and it's registered as
string_resources = {
    "_text": normalize,
    "num_letter_a": num_letter_a,
    "num_letter_i": num_letter_i,
    "num_lc_vowels": compute_lc_vowel_and_consonant_counts,
    "num_lc_consonants": compute_lc_vowel_and_consonant_counts,
    "num_letter_a_roman": roman,
    "num_letter_i_roman": roman,
    }

Step 3: Success

Developing the first version for AZ, including testing and adding features over time like trace logging and error handling took perhaps 2 weeks altogether. That includes code review and training of the other people on the team as well as porting existing compute functions to the new system. They really like how easy it is to add new descriptors, and they continue to contract my service in part to add new resources. Usually only the hard ones which require combined knowledge of unix and chemistry.

The approach I used is sometimes called "declarative", because there are a set of rules which declare how to compute a result given input, and not an ordered set of instructions. Famous declarative systems include Prolog and the CLIPS expert system. My first published paper was in the proceedings of a CLIPS conference, in 1992, so it's not like I came up with this on my own.

Python is not a declarative language. While is it possible to introspect a function to determine most dependencies on other variables, part the elegance of my solution was to ignore that hard problem and compute dependencies only when needed. This is called lazy evaluation. The chain of dependences is pull-oriented meaning that you set up the network of resources and ask for a value. That value depends on others, which depends on other, which ... until it gets to the input data.

Compare that to most dataflow systems, like the commercial Pipeline Pilot for chemistry or EBI's open-source Taverna for bioinformatics. From what I've seen those are push oriented. When all input data is available for a node, compute values and push the result to the next nodes in the network. Drop a compound at the top of the network and it will merrily compute all properties. If you only want to compute a few properties you need to make a new network, and be ordered correctly. In a pull system you can use the same network -- AZ's has several hundred compute rule and over 1,000 properties -- and only get what you asked for.

In theory I could keep track of the resources used by each resource. Then if a dependant value changes I can remove the computed value from cache and recompute it the next time it's needed. This is pretty complicated and not something I needed for the project. It would be needed in a more interactive research environment where someone might ask "what's the change in the XYZ prediction if the pH was dropped by 0.1?" or "how does this conformational change affect the binding quality?"

Really new ideas are rare in any field. What I did was nice and useful, and new to me and my clients, but others have worked on the same problem. One solution, with dependency tracking, is PyCells developed by Ryan Forsythe and based on Ken Tilton's Cells code for Lisp. I'll talk more about it next time.


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2020 Andrew Dalke Scientific AB