Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2005/04/22/statistics

statistics

While there is a standard Python numeric library (two, actually), there is no standard statistic library. The closest is Gary Strangman's stats.py package. It is also used as part of the SciPy distribution.

I'm going to write a program that reads an SD file, extracts a data field as a float, and prints the average and standard deviation. As my input I'll use one of the data files from PubChem; compounds_500001_510000.sdf.gz and for now I'll use the computed molecular weight from PUBCHEM_OPENEYE_MW field.

I need a way to get the fields from an SD file. I looked at the data file. Because I only need the first line after a data tag, and because all the records are going to be in the correct format, with no difficult ambiguities, I can write a new parser pretty quickly. This function will take a file handle as input and iterate through each of the records. For each record it will return a dictionary of the tag values. It uses the yield keyword, which turns the function into an iterator.

def read_tags(infile):
    data = {}
    tag = None
    i = 0
    for line in infile:
        if tag is not None:
            # The previous line defined a tag
            # Read this line and reset, waiting until the next tag.
            data[tag] = line.rstrip()
            tag = None
            
        elif line.startswith("> <"):
            # This defines a new tag
            tag = line[3:line.index(">", 3)]
            
        elif line.startswith("$$$$"):
            # End of the record; yield the dictionary
            # of parsed values then reset for the next record
            yield data
            data = {}
    
if __name__ == "__main__":
    # Read directly from the compressed file
    import gzip
    filename = "/Users/dalke/databases/compounds_500001_510000.sdf.gz"
    infile = gzip.open(filename)
    
    for i, record in enumerate(read_tags(infile)):
        print record["PUBCHEM_OPENEYE_MW"]
        # only read the first 10 records
        if i == 10:
            break
when I test the code I get the following, which is the expected output.
162.132
181.275
381.469
321.373
248.233
365.427
317.384
393.44
323.472
250.339
820.901

I want to report the average and standard deviations so I'll modify the code somewhat to save the weights (after converting to a float) into a list, then use the stats.py module to do the analysis.

import gzip
import stats

def read_tags(infile):
    data = {}
    tag = None
    i = 0
    for line in infile:
        if tag is not None:
            # The previous line defined a tag
            # Read this line and reset, waiting until the next tag.
            data[tag] = line.rstrip()
            tag = None
            
        elif line.startswith("> <"):
            # This defines a new tag
            tag = line[3:line.index(">", 3)]
            
        elif line.startswith("$$$$"):
            # End of the record; yield the dictionary
            # of parsed values then reset for the next record
            yield data
            data = {}

def main():
    filename = "/Users/dalke/databases/compounds_500001_510000.sdf.gz"
    # Read directly from the compressed file
    infile = gzip.open(filename)

    weights = []
    for i, record in enumerate(read_tags(infile)):
        weights.append(float(record["PUBCHEM_OPENEYE_MW"]))

    print len(weights), "records"
    print "average weight is", stats.lmean(weights)
    print "std. dev is", stats.lstdev(weights)
    
if __name__ == "__main__":
    main()
Run it and wait. And wait. It takes a while to read that many records. Ah, there is goes.
% time python read_tags.py
10000 records
average weight is 483.88321269
std. dev is 217.614702277
106.430u 11.100s 2:17.53 85.4%  0+0k 0+0io 0pf+0w

In general this simple parser is insufficient because real SD files aren't always this clean. I'll rewrite the code to use OEChem's parser. It already has a file iterator that reads at record level so my adapater only needs to extract the tags from the file and put them into a dictionary. One nice thing about OEChem is its built-in support for compressed files.

Here's my test version. I wanted to see what it did before processing all 10,000 records.

from openeye.oechem import *

def read_tags(imolstream):
    tag = None
    i = 0
    while 1:
        mol = imolstream.next()
        data = {}
        # If there are multiple fields with the same tag
        # then the last one is used.
        for pair in OEGetSDDataPairs(mol):
            data[pair.GetTag()] = pair.GetValue()
        yield data

def main():
    filename = "/Users/dalke/databases/compounds_500001_510000.sdf.gz"
    ifs = oemolistream(filename)

    for i, record in enumerate(read_tags(ifs.GetOEGraphMols())):
        print record["PUBCHEM_OPENEYE_MW"]
        if i == 10:
            break
    
if __name__ == "__main__":
    main()
Much easier to read because I'm working closer to the data model than to the syntax. Here's the output
% python read_tags.py
162.132
181.275
381.469
321.373
Warning: Stereochemistry corrected on atom number 2 of 500005
Warning: Stereochemistry corrected on atom number 4 of 500005
Warning: Stereochemistry corrected on atom number 7 of 500005
Warning: Stereochemistry corrected on atom number 8 of 500005
Warning: Stereochemistry corrected on atom number 16 of 500005
248.233
365.427
317.384
393.44
323.472
250.339
820.901
% 
Grr, OEChem's warnings by default go to stderr. I don't want to see them so I'll change the output threshold to only show errors and higher severity.
if __name__ == "__main__":
    OEThrow.SetLevel(OEErrorLevel_Error)
    main()
As far as I know there's no simple way to get the warning messages redirected to Python. The best workaround I could figure out is to redirect to a file and have the Python code check that file. OpenEye? You reading this? :)

(Heard back from someone at OpenEye. One solution is to use an oeosstream like this

>>> from openeye.oechem import *
>>> out = oeosstream()
>>> print str(out)

>>> OEThrow.SetOutputStream(out)
>>> mol = OEMol()
>>> OEParseSmiles(mol, "C1CC")
False
>>> print str(out)
Warning: Error parsing SMILES:
Warning: Unclosed ring.
Warning: C1CC
Warning:    ^


>>> 
You'll need to parse the error text yourself, similar to what I did in earlier essays. It's a bit difficult because the text wasn't designed to be parsed by people. I wonder about the thread-safety of this approach.)

Why do I have the read_tag() adapater? It was there in the non-OE version mostly to convert a line reader to a record reader. The OEChem interface doesn't really need it since it has a perfect acceptable way to get the tag data from the record. I could say that it's a buffering layer to make it easier to switch between different libraries. That would be a retrospective justification, since I don't need that. (And if I did I would make the new API act more similar to OEChem and not the other way around.)

Dropping the adapter entirely gives this nice, simple code.

from openeye.oechem import *

def main():
    filename = "/Users/dalke/databases/compounds_500001_510000.sdf.gz"
    ifs = oemolistream(filename)

    for i, mol in enumerate(ifs.GetOEGraphMols()):
        weight = OEGetSDData(mol, "PUBCHEM_OPENEYE_MW")
        print weight
        if i == 10:
            break
    
if __name__ == "__main__":
    OEThrow.SetLevel(OEErrorLevel_Error)
    main()
And yes, the output is as expected.

Putting in the code needed to generate the statistics

from openeye.oechem import *
import stats

def main():
    filename = "/Users/dalke/databases/compounds_500001_510000.sdf.gz"
    ifs = oemolistream(filename)

    weights = []
    for mol in ifs.GetOEGraphMols():
        weights.append(float(OEGetSDData(mol, "PUBCHEM_OPENEYE_MW")))

    print len(weights), "records"
    print "average weight is", stats.lmean(weights)
    print "std. dev is", stats.lstdev(weights)
    
if __name__ == "__main__":
    OEThrow.SetLevel(OEErrorLevel_Error)
    main()
and here's the results along with the timing
% time python read_tags.py
10000 records
average weight is 483.88321269
std. dev is 217.614702277
31.250u 0.670s 0:36.36 87.7%    0+0k 0+0io 0pf+0w
% 
Nice! It's about three times faster than my simple Python parser, and it's doing a lot more work.

I do have tricks up my sleeve that would make the Python code faster but it's just not worthwhile.


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2013 Andrew Dalke Scientific AB