Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2020/09/23/extracting_two_sd_data_items

Extracting two SDF data items with chemfp's text toolkit

This is part of a series of essays about working with SD files at the record and simple text level. In yesterday's essay I showed several examples of using chemfp's text toolkit API to process records from an SD file. In some cases, reading the entire record is too much work so in this essay I'll show some examples of extracting just two pieces of information (a title and a single SDF data item value, or two data item values) from the records.

You can follow along yourself by installing chemfp (under the Base License Agreement) using:

python -m pip install chemfp -i https://chemfp.com/packages/

chemfp is a package for high-performance cheminformatics fingerprint similarity search. It includes a text toolkit API for working with SDF and SMILES file records.

Extract SMILES from ChEBI records

In yesterday's essay I noticed that most records in the ChEBI SDF distribution ChEBI_complete.sdf.gz contain a SMILES data item. (112,938 out of 113,902 to be precise.)

Let's extract those to make a SMILES files! (We could of course use a chemistry toolkit to parse the connection table into a molecule then generate a SMILES, but that's not the point of this essay.)

The obvious way, using what I covered yesterday, is to read the ids and records, where the id comes from the ChEBI ID, then for each record get the value of the SMILES data item. If it exists, then write the SMILES and id as a tab-seperated SMILES files. The following program does just that:

from chemfp import text_toolkit as T 

for id, record in T.read_sdf_ids_and_records("ChEBI_complete.sdf.gz", id_tag="ChEBI ID"):
    smiles = T.get_sdf_tag(record, "SMILES")
    if smiles is None:
        continue
    print(f"{smiles}\t{id}")

The first 10 lines of output are:

[H][C@@]1(Oc2cc(O)cc(O)c2C[C@H]1O)c1ccc(O)c(O)c1	CHEBI:90
CC1(C)[C@@H]2CC[C@@](C)(C2)C1=O	CHEBI:165
OCC(CO[*])OC([*])=O	CHEBI:598
[H][C@]12CC[C@]3(C)C(=O)[C@H](O)C[C@@]3([H])[C@]1([H])CCc1cc(O)ccc21	CHEBI:776
Clc1cccc(Cl)c1C#N	CHEBI:943
CCC(O)C(O)=O	CHEBI:1148
OC1=C(C\C=C(\CC\C=C(\CC\C=C(\CC\C=C(\CC\C=C(\CC\C=C(\CC\C=C(\CCC=C(C)C)/C)/C)/C)/C)/C)/C)/C)C=CC=C1O	CHEBI:1233
C1[C@@]2([C@]3(CC[C@]4([C@]([C@@]3(CC=C2C[C@H](C1)O)[H])(CC[C@@]4([C@](C)(CCCC(C)C)O)[H])[H])C)[H])C	CHEBI:1296
C12C(C3C(C(CC3)*)(C)CC1)CC[C@]4(C2(CCC(C4)=O)C)[H]	CHEBI:1624
C12C(C3C(C(CC3)*)(C)CC1)CC=C4C2(CC[C@@H](C4)O)C	CHEBI:1722

However, there is a faster way. The above program extracts a record just to extract one data item value. The text toolkit function read_sdf_ids_and_values() doesn't need that intermediate step - it returns an iterator for the id and one value from each record.

Both the id and value can be specified by tag name. The default (Python's None value) tells it to use title line, so if not specified each iterated 2-tuple contains the title line, twice. In the following, I'll specify the correct tag name for the two values I'm interested in:

from chemfp import text_toolkit as T 

for id, smiles in T.read_sdf_ids_and_values("ChEBI_complete.sdf.gz",
                                            id_tag="ChEBI ID", value_tag="SMILES"):
    if smiles is None:
        continue
    print(f"{smiles}\t{id}")

It gives the same output as the previous program, but according to my timings (with stdout redirected to /dev/null) the first one takes 2.2 seconds while the second takes 1.7 seconds. The 20% faster performance becomes even more noticable when processing larger data sets.

Extract CACTVS fingerprints from PubChem records

This function exists to improve the performance of chemfp's sdf2fps command-line tool, which extracts the identifier and fingerprint from each record of an SD file and saves them to FPS format. For example, here's the shortest record in PubChem's Compound_000000001_000500000.sdf.gz, trimmed a bit further for brevity:

139759
  -OEChem-06172007252D

  2  1  0     0  0  0  0  0  0999 V2000
    2.0000    0.0000    0.0000 Li  0  0  0  0  0  0  0  0  0  0  0  0
    3.0000    0.0000    0.0000 Li  0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
M  END
> <PUBCHEM_COMPOUND_CID>
139759

> <PUBCHEM_COMPOUND_CANONICALIZED>
1

> <PUBCHEM_CACTVS_COMPLEXITY>
0

> <PUBCHEM_CACTVS_HBOND_ACCEPTOR>
0

> <PUBCHEM_CACTVS_HBOND_DONOR>
0

> <PUBCHEM_CACTVS_ROTATABLE_BOND>
0

> < PUBCHEM_CACTVS_SUBSKEYS>
AAADcQwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA==

> <PUBCHEM_COORDINATE_TYPE>
1
5
255

$$$$

Decoding a PubChem fingerprint

The CACTVS fingerprint is encoded in the PUBCHEM_CACTVS_SUBSKEYS data item. Chemfp's encodings module contains decoders for several common fingerprint encodings, including this one. In the following example, I added newlines to make the text more readable:

>>> from chemfp import encodings
>>> encodings.from_cactvs("AAADcQwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"
...     "AAAAAIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"
...     "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA==")
(881, b'0\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00')

It returns a 2-element tuple. The first term is the number of bits, the second is the fingerprint as a byte string.

Hex-encode a byte string

Finally, chemfp's bitops module contains functions for working with byte-encoded and hex-encoded fingerprints, including converting a byte-string to hex:

>>> from chemfp import bitops
>>> bitops.hex_encode(b"A\x94\x00")
'419400'

However, that function only exists so programs can run on both Python 2 and 3. Python 3-only programs can use the byte string's hex() method:

>>> b"A\x94\x00".hex()
'419400'

Export to FPS format

Putting it all together, here's a simple Python 3 program to extract PubChem fingerprints and print the results in FPS format:

from chemfp import text_toolkit as T
from chemfp.encodings import from_cactvs

print("#FPS1")
for id, encoded_fp in T.read_sdf_ids_and_values("Compound_000000001_000500000.sdf.gz",
                                                    value_tag="PUBCHEM_CACTVS_SUBSKEYS"):
    assert encoded_fp is not None, id
    num_bits, fp = from_cactvs(encoded_fp)
    print(f"{fp.hex()}\t{id}")

That takes a bit about 7.7 seconds on my laptop. With 293 PubChem files, and assuming each takes the same amount of time (it doesn't), that's about 37 minutes to extract the fingerprint data.

You can see how the 20% savings might be nice!

Using a gzip co-process to decompress

As I pointed out in the essay Faster gzip reading in Python, chemfp by default uses a single-threaded/in-process gzip decoder. If I switch to using gzip as a co-process, to make better use of the available CPUs, then the wall-clock time goes down to about 6.6 seconds while the user+sytem time goes to 8.2 seconds, or an estimated 32 minutes to process all of PubChem.

Using chemfp's fingerprint writer

The previous program handled the FPS output formatting itself. An alternative is to use chemfp's own open_fingerprint_writer(). If the destination is Python's None value then the results are sent to stdout. The writer object's write_fingerprints() method takes an iterator of (id, fingerprint) pairs, so only a small adapter is needed to convert the CACTVS fingerprints to a byte string. The result is:

import chemfp
from chemfp import text_toolkit as T
from chemfp.encodings import from_cactvs

filename = "Compound_000000001_000500000.sdf.gz"

with T.read_sdf_ids_and_values(filename, value_tag="PUBCHEM_CACTVS_SUBSKEYS") as reader:
    with chemfp.open_fingerprint_writer(None) as writer:
        writer.write_fingerprints(
            (id, from_cactvs(encoded_fp)[1]) for (id, encoded_fp) in reader)

I think it's cleaner. The overall performance is essentially unchanged. And the output destination can be switched to use a filename, including formatting the output fingerprints in FPB file.


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2020 Andrew Dalke Scientific AB