Changes in chemfp 3.4
In a previous essay I talked about the new licensing model in the recent chemfp 3.4 release. In short, no-cost academic licensing is now available, a pre-compiled version of the package, with some restrictions on use, is available for no-cost use on for Linux-based OSes.
The 3.4 release had the unofficial title
back in action. I took
time off from development to (among other things) write a paper about
project and take parental leave for our second kid.
Improved chemistry toolkit support
The world doesn't stop for me. Open Babel 3.0 was released, and all three toolkits (including RDKit and OEChem/OEGraphSim) added new structure formats and new fingerprint types since the chemfp 3.3 release. Here's a few highlights:
- Support for SECFP fingerprints in the RDKit. Backgound: The
Reymond group contributed their MHFP
MinHash fingerprint) to the RDKit. That implementation uses a SMILES-based circular substructure hashing scheme as input to an LSH (locality sensitive hashing) forest algorithm for approximate nearest-neighbor searching. Chemfp doesn't implement LSH search. However, as part of Probst and Reymond's work, they implemented SECFP (
SMILES extended connectivity fingerprint) which uses the same circular substructure generation code as input to a more traditional fingerprint methods. According to their paper,
SECFP6 performed significantly better than both ECFP4/6 (Additional file 1: Fig. S9). These results suggest that SECFP6 can be readily used as a drop-in replacement for ECFP4 with beneficial results.
- Support Open Babel 3.0's new ECFP-like circular fingerprints
- Support OEChem's "oez", and "csv" formats, along with the macromolecular formats CIF, mmCIF, PDB, and FASTA.
- Support for OEChem's recently added SMILES, SMARTS, and MDL query substruture screens. Chemfp can export them to FPS format.
- Support additional RDKit structure file options (like its support for cxsmiles), new structure formats (likes Maestro and HELM files), and additional parameters for existing fingerprint types.
Performance improvements and ZStandard support
I added a number of performance improvements:
- MACCS search is 10-20% faster because of an improved rejection test
- FPS reading is about 20% faster by using a larger block size. (The previous block size was tuned for my laptop in 2010.)
- FPB generation is about 10% faster
- gzip reading is about 15% faster by using my own gzip reader instead of Python's standard library. Overall, fingerprint extraction with sdf2fps on the PubChem sdf.gz files is about 10% faster.
- support ZStandard compressed FPS and FPB files (as well as gzip-compressed). This may give better load performance for network-based storage by reducing the amount of required network I/O.
- rdkit2fps no longer tries to parse the SD tags, giving a ~5% speedup.
Other tool improvements
There are a number of small tool improvements, like adding a --help-formats command-line option to give more detailed information about the support format types and options for each of the toolkits. (Previously much of this information was available from --help but that lead to information overload.)
One nice change is that simsearch now accepts a structure query as command-line input or a file, rather than an FPS file. Simsearch will read the target file to get the fingerprint type, then use that to parse the query structures correctly. For example:
% simsearch --query 'CN1C=NC2=C1C(=O)N(C(=O)N2C)C' chembl_24_1.fps.gz -k 4 #Simsearch/1 #num_bits=2048 #type=Tanimoto k=4 threshold=0.0 #software=chemfp/3.4 #targets=chembl_24_1.fps.gz #target_source=chembl_24.fps.gz 4 Query1 CHEMBL113 1.00000 CHEMBL1232048 0.70968 CHEMBL446784 0.67742 CHEMBL1738791 0.66667
For the full list of changes see the What's New section of the documentation.
Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me
Copyright © 2001-2020 Andrew Dalke Scientific AB