statistics
While there is a standard Python numeric library (two, actually), there is no standard statistic library. The closest is Gary Strangman's stats.py package. It is also used as part of the SciPy distribution.
I'm going to write a program that reads an SD file, extracts a data field as a float, and prints the average and standard deviation. As my input I'll use one of the data files from PubChem; compounds_500001_510000.sdf.gz and for now I'll use the computed molecular weight from PUBCHEM_OPENEYE_MW field.
I need a way to get the fields from an SD file. I looked at the data file. Because I only need the first line after a data tag, and because all the records are going to be in the correct format, with no difficult ambiguities, I can write a new parser pretty quickly. This function will take a file handle as input and iterate through each of the records. For each record it will return a dictionary of the tag values. It uses the yield keyword, which turns the function into an iterator.
def read_tags(infile): data = {} tag = None i = 0 for line in infile: if tag is not None: # The previous line defined a tag # Read this line and reset, waiting until the next tag. data[tag] = line.rstrip() tag = None elif line.startswith("> <"): # This defines a new tag tag = line[3:line.index(">", 3)] elif line.startswith("$$$$"): # End of the record; yield the dictionary # of parsed values then reset for the next record yield data data = {} if __name__ == "__main__": # Read directly from the compressed file import gzip filename = "/Users/dalke/databases/compounds_500001_510000.sdf.gz" infile = gzip.open(filename) for i, record in enumerate(read_tags(infile)): print record["PUBCHEM_OPENEYE_MW"] # only read the first 10 records if i == 10: breakwhen I test the code I get the following, which is the expected output.
162.132 181.275 381.469 321.373 248.233 365.427 317.384 393.44 323.472 250.339 820.901
I want to report the average and standard deviations so I'll modify the code somewhat to save the weights (after converting to a float) into a list, then use the stats.py module to do the analysis.
import gzip import stats def read_tags(infile): data = {} tag = None i = 0 for line in infile: if tag is not None: # The previous line defined a tag # Read this line and reset, waiting until the next tag. data[tag] = line.rstrip() tag = None elif line.startswith("> <"): # This defines a new tag tag = line[3:line.index(">", 3)] elif line.startswith("$$$$"): # End of the record; yield the dictionary # of parsed values then reset for the next record yield data data = {} def main(): filename = "/Users/dalke/databases/compounds_500001_510000.sdf.gz" # Read directly from the compressed file infile = gzip.open(filename) weights = [] for i, record in enumerate(read_tags(infile)): weights.append(float(record["PUBCHEM_OPENEYE_MW"])) print len(weights), "records" print "average weight is", stats.lmean(weights) print "std. dev is", stats.lstdev(weights) if __name__ == "__main__": main()Run it and wait. And wait. It takes a while to read that many records. Ah, there is goes.
% time python read_tags.py 10000 records average weight is 483.88321269 std. dev is 217.614702277 106.430u 11.100s 2:17.53 85.4% 0+0k 0+0io 0pf+0w
In general this simple parser is insufficient because real SD files aren't always this clean. I'll rewrite the code to use OEChem's parser. It already has a file iterator that reads at record level so my adapater only needs to extract the tags from the file and put them into a dictionary. One nice thing about OEChem is its built-in support for compressed files.
Here's my test version. I wanted to see what it did before processing all 10,000 records.
from openeye.oechem import * def read_tags(imolstream): tag = None i = 0 while 1: mol = imolstream.next() data = {} # If there are multiple fields with the same tag # then the last one is used. for pair in OEGetSDDataPairs(mol): data[pair.GetTag()] = pair.GetValue() yield data def main(): filename = "/Users/dalke/databases/compounds_500001_510000.sdf.gz" ifs = oemolistream(filename) for i, record in enumerate(read_tags(ifs.GetOEGraphMols())): print record["PUBCHEM_OPENEYE_MW"] if i == 10: break if __name__ == "__main__": main()Much easier to read because I'm working closer to the data model than to the syntax. Here's the output
% python read_tags.py 162.132 181.275 381.469 321.373 Warning: Stereochemistry corrected on atom number 2 of 500005 Warning: Stereochemistry corrected on atom number 4 of 500005 Warning: Stereochemistry corrected on atom number 7 of 500005 Warning: Stereochemistry corrected on atom number 8 of 500005 Warning: Stereochemistry corrected on atom number 16 of 500005 248.233 365.427 317.384 393.44 323.472 250.339 820.901 %Grr, OEChem's warnings by default go to stderr. I don't want to see them so I'll change the output threshold to only show errors and higher severity.
if __name__ == "__main__": OEThrow.SetLevel(OEErrorLevel_Error) main()As far as I know there's no simple way to get the warning messages redirected to Python. The best workaround I could figure out is to redirect to a file and have the Python code check that file. OpenEye? You reading this? :)
(Heard back from someone at OpenEye. One solution is to use an oeosstream like this
>>> from openeye.oechem import * >>> out = oeosstream() >>> print str(out) >>> OEThrow.SetOutputStream(out) >>> mol = OEMol() >>> OEParseSmiles(mol, "C1CC") False >>> print str(out) Warning: Error parsing SMILES: Warning: Unclosed ring. Warning: C1CC Warning: ^ >>>You'll need to parse the error text yourself, similar to what I did in earlier essays. It's a bit difficult because the text wasn't designed to be parsed by people. I wonder about the thread-safety of this approach.)
Why do I have the read_tag() adapater? It was there in the non-OE version mostly to convert a line reader to a record reader. The OEChem interface doesn't really need it since it has a perfect acceptable way to get the tag data from the record. I could say that it's a buffering layer to make it easier to switch between different libraries. That would be a retrospective justification, since I don't need that. (And if I did I would make the new API act more similar to OEChem and not the other way around.)
Dropping the adapter entirely gives this nice, simple code.
from openeye.oechem import * def main(): filename = "/Users/dalke/databases/compounds_500001_510000.sdf.gz" ifs = oemolistream(filename) for i, mol in enumerate(ifs.GetOEGraphMols()): weight = OEGetSDData(mol, "PUBCHEM_OPENEYE_MW") print weight if i == 10: break if __name__ == "__main__": OEThrow.SetLevel(OEErrorLevel_Error) main()And yes, the output is as expected.
Putting in the code needed to generate the statistics
from openeye.oechem import * import stats def main(): filename = "/Users/dalke/databases/compounds_500001_510000.sdf.gz" ifs = oemolistream(filename) weights = [] for mol in ifs.GetOEGraphMols(): weights.append(float(OEGetSDData(mol, "PUBCHEM_OPENEYE_MW"))) print len(weights), "records" print "average weight is", stats.lmean(weights) print "std. dev is", stats.lstdev(weights) if __name__ == "__main__": OEThrow.SetLevel(OEErrorLevel_Error) main()and here's the results along with the timing
% time python read_tags.py 10000 records average weight is 483.88321269 std. dev is 217.614702277 31.250u 0.670s 0:36.36 87.7% 0+0k 0+0io 0pf+0w %Nice! It's about three times faster than my simple Python parser, and it's doing a lot more work.
I do have tricks up my sleeve that would make the Python code faster but it's just not worthwhile.
Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me
Copyright © 2001-2020 Andrew Dalke Scientific AB