Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2014/09/26/format_api

chemfp's format API

This is part of a series of essays describing the reasons behind chemfp's new APIs. This one, about the new format API, is a lot shorter than the previous one on parse_molecule() and parse_id_and_molecule().

It's sometimes useful to know which cheminformatics formats are available, if only to display a help message or pulldown menu. The get_formats() toolkit function returns a list of available formats, as 'Format' instances.

>>> from chemfp import rdkit_toolkit as T
>>> T.get_formats()
[Format('rdkit/canstring'), Format('rdkit/inchikey'),
Format('rdkit/usmstring'), Format('rdkit/smistring'),
Format('rdkit/molfile'), Format('rdkit/usm'),
Format('rdkit/inchikeystring'), Format('rdkit/sdf'),
Format('rdkit/can'), Format('rdkit/smi'), Format('rdkit/inchi'),
Format('rdkit/rdbinmol'), Format('rdkit/inchistring')]
(The next version of chemfp will likely support RDKit's relatively new PDB reader.)

You can ask a format for its name, or see if it is an input format or output format by checking respectively "is_input" and "is_output". If you just want the list of input formats or output formats, use get_input_formats() or get_output_formats().

Here's an example to show which output formats are not also input formats:

>>> [format.name for format in T.get_output_formats() if not format.is_input_format]
['inchikey', 'inchikeystring']

You may recall that some formats are record-based and others are, for lack of a better word, "string-based". The latter include "smistring", "inchistring", and "inchikeystring". These are not records in their own right, so can't be read or written to a file.

I really couldn't come up with a good predicate which described those formats. This closest was "is_a_record". I ended up with "supports_io". I'm not happy with the name. If true, the format can be used in file I/O.

The RDKit input formats which do not support I/O are the expected ones ... and rdbinmol.

>>> [format.name for format in T.get_input_formats() if not format.supports_io]
['canstring', 'usmstring', 'smistring', 'molfile', 'rdbinmol', 'inchistring']
(The "rdbinmol" is an experimental format. It's the byte string from calling an RDKit molecule's "ToBinary()" method, which is also the basis for its pickle support.)

get_format() and compression

You can get a specific format by name using get_format(). This can also be used to specify a compressed format:

>>> T.get_format("sdf")
Format('rdkit/sdf')
>>> T.get_format("smi.gz")
Format('rdkit/smi.gz')
>>> format = T.get_format("smi.gz")
>>> format
Format('rdkit/smi.gz')
>>> format.name
'smi'
>>> format.compression
'gz'

Default reader and writer arguments

Toolkit- and format-specific arguments were a difficult challenge. I want chemfp to support multiple toolkits, because I know people work with fingerprints from multiple toolkits. Each of the toolkit has its own way to parse and generate records. I needed some way to have a common API but with a mechanism to control the underlying toolkit options.

The result are reader_args, which I discussed in the previous essay, and the writer_args complement for turning a molecule into a record.

A Format instance can be toolkit specific; the "rdkit/smi.gz" is an RDKit format. (The toolkit name is available from the aptly named attribute 'toolkit_name'.) Each Format has a way to get the default reader_args and writer_args for the format:

>>> format = T.get_format("smi.gz")
>>> format
Format('rdkit/smi.gz')
>>> format.get_default_reader_args()
{'delimiter': None, 'has_header': False, 'sanitize': True}
>>> format.get_default_writer_args()
{'isomericSmiles': True, 'delimiter': None, 'kekuleSmiles': False, 'allBondsExplicit': False, 'canonical': True}
This is especially useful if you are on the interactive prompt and have forgotten the option names.

Convert text settings into arguments

The -R command-line options for the chemfp tools rdkit2fps, ob2fps, and oe2fps let users set the reader_args. If your target molecules are in a space-delimited SMILES file then you can set the 'delimiter' option to 'space':

oe2fps -R delimiter=space targets.smi.gz -o targets.fps
or ask RDKit to disable sanitization using:
rdkit2fps -R sanitize=false targets.smi.gz -o targets.fps
The -R takes string keys and values. On the other hand reader_args take a dictionary with string keys but possibly integers and booleans as values. You could write the converter yourself, but that gets old very quickly. Instead, I included it the format's get_reader_args_from_text_settings(). (The *2fps programs don't generate structure output, but if they did the equivalent command-like flag would be -W, and the equivalent format method is get_writer_args_from_text_settings().)

Yes, I agree that get_..._settings() is a very long name. I couldn't think of a better one. I decided that "text settings" are the reader_args and writer_args expressed as a dictionary with string names and string values.

I'll use that long named function to convert some text settings into proper reader_args:

>>> format.get_reader_args_from_text_settings({
...    "delimiter": "tab",
...    "sanitize": "false",
... })
{'delimiter': 'tab', 'sanitize': False}
You can see that the text "false" was converted into the Python False value.

Namespaces

Names like "delimiter" and "sanitize" are 'unqualified' and apply for every toolkit and every format which accept them. This makes sense for "delimiter" because it's pointless to have OEChem parse a SMILES file using a different delimiter style than RDKit. It's acceptable for "sanitize" because only RDKit knows what it means, and the other toolkits will ignore unknown names. For many cases then you could simply do something like:

reader_args = {
  "delimiter": "tab",       # for SMILES files
  "strictParsing": False,   # for RDKit SDF
  "perceive_stereo": True,  # for Open Babel SDF
  "aromaticity": "daylight, # for all OEChem readers
}

At the moment the toolkits all have different names for option names for the same format, so there's no conflict there. But toolkits do use the same name for options on different formats, and there can be a good reason for why the value for a SMILES output is different than a value for an SDF record output.

The best example is OEChem, which uses a "flavor" flag to specify the input and output options for all formats. (For chemfp I decided to split OEChem's flavor into 'flavor' and 'aromaticity' reader and writer arguments. I leave that discussion for elsewhere.) I'll start by making an OEGraphMol.

from chemfp import openeye_toolkit

phenol = "c1ccccc1[16OH]"
oemol = openeye_toolkit.parse_molecule(phenol, "smistring")
Even though "smistring" output by default generates the canonical isomorphic SMILES for the record, I can ask it to generate a different output flavor. For convience, the flavor value can be an integer, which is treated as the flavor bitmask, or it can be a string of "|" or "," separated bitmask names. Usually the bitmask names are or'ed together, but a leading "-" means to unset the corresponding bits for that flag.
>>> openeye_toolkit.create_string(oemol, "smistring")
'c1ccc(cc1)[16OH]'
>>> openeye_toolkit.create_string(oemol, "smistring",
...      writer_args={"flavor": "Default"})
'c1ccc(cc1)[16OH]'
>>> openeye_toolkit.create_string(oemol, "smistring",
...      writer_args={"flavor": "Default,-Isotopes"})
'c1ccc(cc1)O'
>>> openeye_toolkit.create_string(oemol, "smistring",
...      writer_args={"flavor": "Canonical|Kekule|Isotopes"})
'C1=CC=C(C=C1)[16OH]'
Here I'll ask for the SDF record output in V3000 format. (In the future I expect to have a special "sdf3" or "sdf3000" format, to make it easier to specify V3000 output across all toolkits.)
>>> print(openeye_toolkit.create_string(oemol, "sdf",
...        writer_args={"flavor": "Default|MV30"}))

  -OEChem-09261411132D

  0  0  0     0  0            999 V3000
M  V30 BEGIN CTAB
M  V30 COUNTS 7 7 0 0 0
M  V30 BEGIN ATOM
M  V30 1 C 0 0 0 0
M  V30 2 C 0 0 0 0
M  V30 3 C 0 0 0 0
M  V30 4 C 0 0 0 0
M  V30 5 C 0 0 0 0
M  V30 6 C 0 0 0 0
M  V30 7 O 0 0 0 0 MASS=16
M  V30 END ATOM
M  V30 BEGIN BOND
M  V30 1 2 1 6
M  V30 2 1 1 2
M  V30 3 2 2 3
M  V30 4 1 3 4
M  V30 5 2 4 5
M  V30 6 1 5 6
M  V30 7 1 6 7
M  V30 END BOND
M  V30 END CTAB
M  END
$$$$

What's the problem?

One problem comes when I want to configure chemfp so that if the output is SMILES then use one flavor, and if the output is SDF then use another flavor. You could construct a table of format-specific writer_args, like this:

writer_args_by_format = {
  "smi": {"flavor": "Canonical|Kekule|Isotopes", "aromaticity": "openeye"},
  "sdf": {"flavor": "Default|MV30", "aromaticity": "openeye"},
    ...
}

record = T.create_string(mol, format,
           writer_args = writer_args_by_format[format])
but not only is that tedious, it doesn't handle toolkit-specific options. Nor is there an easy way to turn the text settings into this data structure.

Qualified names

Instead, the reader_args and writer_args accept "qualified" names, which can be format-specific like "sdf.flavor", toolkit-specific like "openeye.*.aromaticity", or both, like "openeye.sdf.aromaticity".

A cleaner way to write the previous example is:

writer_args = {
  "smi.flavor": "Canonical|Kekule|Isotopes",
  "sdf.flavor": "Default|MV30",
  "aromaticity": "openeye",   # Use the openeye aromaticity model for all formats
    ...
}

record = T.create_string(mol, format, writer_args = writer_args)
or if you want to be toolkit-specific, use "openeye.smi.flavor", "openeye.sdf.flavor" and "openeye.*.aromaticity", etc.

Precendence

You probably noticed there are many ways to specify the same setting, as in the following:

reader_args = {
  "delimiter": "tab",
  "openeye.*.delimiter": "whitespace",
  "smi.delimiter": "space",
}
The chemfp precedence goes from most-qualified name to least-qualified, so for this case the search order is:
openeye.smi.delimiter
openeye.*.delimiter
smi.delimiter
delimiter

How to convert qualified names into unqualified names

The Format object's get_unqualified_reader_args() converts a complicated reader_args dictionary which may contain qualified names into a simpler reader_args dictionary with only unqualified names and only the names appropriate for the format. It's used internally to simplify the search for the right name, and it's part of the public API so you can help debug if your qualifiers are working correctly. I'll give an example of debugging in a moment.

Here's an example which shows that the previous 'reader_args' example, with several delimiter specification, is resolved to using the 'whitespace' delimiter style.

>>> from chemfp import openeye_toolkit
>>> 
>>> reader_args = {
...   "delimiter": "tab",
...   "openeye.*.delimiter": "whitespace",
...   "smi.delimiter": "space",
... }
>>> 
>>> format = openeye_toolkit.get_format("smi")
>>> format.get_unqualified_reader_args(reader_args)
{'delimiter': 'whitespace', 'flavor': None, 'aromaticity': None}
You can see that it also fills in the default values for unspecified arguments. Note that this function does not validate values. It's only concerned with resolving the names.

The equivalent method for writer_args is get_unqualified_writer_args() - I try to be predictable in my APIs.

This function is useful for debugging because it helps you spot typos. Readers ignore unknown arguments, so if you type "opneye" instead of "openeye" then it just assumes that you were talking about some other toolkit.

If you can't figure out why your reader_args or writer_args aren't being accepted, pass them through the 'unqualified' method and see what it gives:

>>> format.get_unqualified_reader_args({"opneye.*.aromaticity": "daylight"})
{'delimiter': None, 'flavor': None, 'aromaticity': None}

Qualified names and text settings

The Format object also supports qualifiers in the reader and writer text_settings and applies the same search order to give the unqualified reader_args.

>>> format.get_reader_args_from_text_settings({
...    "sanitize": "true",
...    "rdkit.*.sanitize": "false",
... })
{'sanitize': False}

Errors in the text settings

The get_reader_args_from_text_settings() and get_writer_args_from_text_settings() will validate the values as much as it can, and raise a ValueError with a helpful message if that fails.

>>> from chemfp import openeye_toolkit
>>> sdf_format = openeye_toolkit.get_format("sdf")
>>> sdf_format.get_writer_args_from_text_settings({
...   "flavor": "bland",
... })
Traceback (most recent call last):
  File "", line 2, in 
  File "chemfp/base_toolkit.py", line 407, in get_writer_args_from_text_settings
    return self._get_args_from_text_settings(writer_settings, self._format_config.output)
  File "chemfp/base_toolkit.py", line 351, in _get_args_from_text_settings
    % (self.toolkit_name, name, value, err))
ValueError: Unable to parse openeye setting flavor ('bland'): OEChem sdf format does not support the 'bland' flavor option. Available flavors are: CurrentParity, MCHG, MDLParity, MISO, MRGP, MV30, NoParity

File format detection based on extension

All of the above assumes you know the file format. Sometimes you only know the filename, and want to determine (or "guess") the format based on its extension. The file "abc.smi" is a SMILES file, the file "xyz.sdf" is an SD file, and "xyz.sdf.gz" is a gzip-compressed SD file.

The toolkit function get_input_format_from_source() will try to determine the format for an input file, given the source filename:

>>> from chemfp import openbabel_toolkit as T
>>> T.get_input_format_from_source("example.smi")
Format('openbabel/smi')
>>> T.get_input_format_from_source("example.sdf.gz")
Format('openbabel/sdf.gz')
>>> format = T.get_input_format_from_source("example.sdf.gz")
>>> format.get_default_reader_args()
{'implementation': None, 'perceive_0d_stereo': False, 'perceive_stereo': False, 'options': None}
The equivalent for output files is get_output_format_from_destination().

The main difference between the two is get_input_format_from_source() will raise an exception if the format is known but not supported as an input format, and get_input_format_from_destination() will raise an exception if the format is known but not supported as an output format.

>>> T.get_input_format_from_source("example.inchikey")
Traceback (most recent call last):
  File "", line 1, in 
  File "chemfp/openbabel_toolkit.py", line 109, in get_input_format_from_source
    return _format_registry.get_input_format_from_source(source, format)
  File "chemfp/base_toolkit.py", line 606, in get_input_format_from_source
    format_config = self.get_input_format_config(register_name)
  File "chemfp/base_toolkit.py", line 530, in get_input_format_config
    % (self.external_name, register_name))
ValueError: Open Babel does not support 'inchikey' as an input format

The format detection functions actually take two arguments, where the second is the format name.

>>> T.get_input_format_from_source("example.inchikey", "smi.gz")
Format('openbabel/smi.gz')
This is meant to simplify the logic that would otherwise lead to code like:
if format is not None:
    format = T.get_input_format(format)
else:
    format = T.get_input_format_from_source(source)

By the way, the source and destination can be None. This tells chemfp to read from stdin or write to stdout. Since stdin and stdout don't have a file extension, what format do they have? My cheminformatics roots started with Daylight, so I decided that the default format is "smi".


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2013 Andrew Dalke Scientific AB