Dalke Scientific Software: More science. Less time. Products
[ previous | next ]     /home/writings/diary/archive/2018/06/04/mmpdb

mmpdb paper, poster, and walkthrough

Last year we released mmpdb program, a tool for identifying and using matched molecular pairs. It is based on the source code that Hussain and Rea contributed to the RDKit project. Some of the features we added are:

Two weeks ago, JCIM published our paper titled "mmpdb: An Open-Source Matched Molecular Pair Platform for Large Multiproperty Data Sets." The DOI is 10.1021/acs.jcim.8b00173 , and the full author list is Andrew Dalke, Jérôme Hert, and Christian Kramer. The paper received the ACS Editors' Choice, which means the article doesn't require a subscription to JCIM to read it. The preprint version of the paper is also available from ChemRXiv.

Last week I presented our poster for the 11th International Conference on Chemical Structures (ICCS), in Noordwijkerhout, The Netherlands.

In this essay I'll walk through an example of how to use mmpdb using the example data from the paper's supporting materials.

Step 1: install mmpdb

Mmpdb requires Python and RDKit. It will work with both Python 2.7 and Python 3.5+. While you can download it from the mmpdb project page, it's easier to install the package with pip, as:

pip install mmpdb
This is a pure Python installation which installs the command-line tool "mmpdb", and the Python library package "mmpdblib".

Step 2: Get the supporting data

I'll structure this as the commands to run, followed by some commentary. I'll also provide a link to the next section if you want to skip the commentary.

curl -O https://pubs.acs.org/doi/suppl/10.1021/acs.jcim.8b00173/suppl_file/ci8b00173_si_001.zip
unzip ci8b00173_si_001.zip
Skip to step 3.

Quoting from the paper:

We used all of the CYP3A4 (ChEMBL target ID ChEMBL340) and hERG (ChEMBL target ID ChEMBL340) data from ChEMBL23 to generate a reproducible timing benchmark. We merged all of the IC50 and Ki data for hERG and IC50, AC50, and Ki data for CYP3A4 with PCHEMBL values and removed undefined compounds and duplicates. The result was 14,377 compounds for CYP3A4 and 6192 compounds for hERG, with 302 compounds having a measured value for both hERG and CYP3A4, yielding a data set with 20,267 compounds overall. It should be noted that we employed this very coarse data cleaning and merging protocol for illustration purposes only. Additional care would need to be taken to assemble a hERG or CYP3A4 data set for actual MMPA and ensure compatibility of the assay, reproducibility of the data, etc.
The SMILES and property data files are available in the supplementary data, which is a zip file containing the following:
% unzip -l ci8b00173_si_001.zip
Archive:  ci8b00173_si_001.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
      987  03-22-2018 17:30   test_calls.txt
  1393558  08-23-2017 15:50   ChEMBL_CYP3A4_hERG.smi
   426852  08-23-2017 15:44   ChEMBL_CYP3A4_hERG_props.txt
---------                     -------
  1821397                     3 files
Download and unzip that file using your tools of choice.

Step 3: Fragment the SMILES

mmpdb fragment --max-heavies 70 --max-rotatable-bonds 20 --has-header \
   ChEMBL_CYP3A4_hERG.smi--output ChEMBL_CYP3A4_hERG.fragments
Skip to step 4.

Each input molecule is fragmented into 0 or more fragment records. The algorithm uses a SMARTS pattern to identify which bonds may be fragmented. Use the --cut-smarts option to specify one of the alternative built-in patterns, or to specify your own SMARTS pattern.

It then generates all of the 1-cut, 2-cut, and 3-cut fragmentations. Use --num-cuts to limit the fragmentation to at most 2 or at most 1 cut.

The input structures come from a SMILES file. The --has-header option tells the parser to ignore the first line, because it is a header line. The built-in help, available with:

mmpdb help-smiles-format
gives more details about the --has-header and describes how the --delimiter option works.

I'll use the ChEMBL_CYP3A4_hERG.smi which was extracted from the zip file. It contains a header line followed by 20,267 SMILES records.

The above fragment command also uses options to limit the fragmentation to input records with at most 70 heavy atoms and at most 20 rotatable bonds. Isotopically labeled hydrogens are considered heavy atoms. You can specify an alternate SMARTS definition for "rotatable bond" if you wish.

The --output tells mmpdb to save the results to the named file rather than stdout.

All of the commands support the --help option, which gives more detailed information on the available command-line options and what they do. For more details about the fragment command, do:

mmpdb fragment --help

Progress information

The fragmentation took about 40 minutes on my 7 year old laptop. I don't like long-running programs which give no output because part of me worries that the program froze, so mmpdb displays a progress indicator, like:

Fragmented record 2845/20267 (14.0%)
This indicates that the 14% of the 20,267 input structures were processed.

You can disable progress information or warning messages with the "--quiet"/"-q" flag, as in:

mmpdb --quiet fragment ...
This is a global option which can apply to all of the subcommands, which is why it goes before the subcommand.

Multiprocessing and caching

The fragment command can fragment multiple input structures at the same time. By default it uses four processes. Since my laptop only has four cores, I kept the default. If you have a more powerful machine then you might want to increase the number of fragmentation jobs to run in parallel, using the "--num-jobs"/"-j" option.

If you have added or modified a few records and want to re-fragment then you can save a lot of time by having mmpdb re-use previous fragmentation information. Specify the old fragment file using the --cache option.

Step 4: Index the fragments

mmpdb index --symmetric ChEMBL_CYP3A4_hERG.fragments
Skip to step 5.

Each fragmentation contains a "constant" part and a "variable" part. In the MMP indexing step, the constants are matched up to generate a pair between one variable part and the other variable part. This pair can be written using a SMIRK-like notation, in the form "A>>B".

(Bear in mind that a SMIRKS describes a reaction, and this isn't a reaction. "A>>B" and "B>>A" are equivalent, and a matched molecular series may also be important.)

The following creates an index from the fragment file and saves the results into a MMPDB database file:

% mmpdb index --symmetric ChEMBL_CYP3A4_hERG.fragments
WARNING: No --output filename specified. Saving to 'ChEMBL_CYP3A4_hERG.mmpdb'.
WARNING: Neither ujson nor cjson installed. Falling back to Python's slower built-in json decoder.
The --symmetric flag adds both "A>>B" and "B>>A" into the database. This roughly doubles the database size. This flag is useful if you want a unique identifier which expressed both the pair and the direction of transformation. It doesn't really affect the downstream processing.

The first warning message is because I prefer that people specify the output file on the command-line, using the "--output" flag. If you don't specify the output filename then it infers it based on the input filename, and it prints the filename so you know where to look. We'll probably drop the "WARNING:" part in the future, because the current behavior seems to be what people want.

The second warning message is because mmpdb was developed under Python 2.7 and we found the performance was limited by the time it took to load the fragment file; each line of the file is a JSON record. The third-party "ujson" and "cjson" JSON parsers were significantly faster than the built-in "json" module. The warning is a strong suggestion to install one of them.

However, I just discovered now that Python 3.6 the ujson package only saves a 4 seconds out of the 130 seconds. A 3% increase is nice, but not enough to warrant the warning message. I've filed an issue to look into this further.

The output "mmpdb" file is a SQLite database, so if you are really curious about its contents you can easily access the file using tools other than mmpdb.

Step 5: Load properties

mmpdb loadprops --properties ChEMBL_CYP3A4_hERG_props.txt ChEMBL_CYP3A4_hERG.mmpdb
Skip to 'transform' analysis.

mmpdb reads the physical property/activity information from a tab-separated file. Only one value is supported per compound+property. For details on the format, use:

mmpdb help-property-format

The CHEMBL_CYP3A4_HERG_props.txt file contains CYP3A4 and hERG_pIC50 values. The first few lines are:

CMPD_CHEMBLID   CYP3A4  hERG_pIC50
CHEMBL3612928   *       4.52
CHEMBL2425617   7       *
CHEMBL3221133   *       6.8
CHEMBL3221131   *       5.9
CHEMBL3221134   *       6.32
CHEMBL1945199   *       4.97
CHEMBL1573507   4.4     *
CHEMBL284328    5       *
CHEMBL1531676   4.9     *
CHEMBL1546374   5.35    *
CHEMBL1379480   5.45    *
CHEMBL1499545   4.9     *
CHEMBL486696    4.5     *
CHEMBL1453970   5.45    *
CHEMBL1490799   5.5     *
CHEMBL121663    4.8     *
CHEMBL282468    4.5     *
CHEMBL1500528   5.5     *
CHEMBL354349    4.6     *
CHEMBL221753    4.6     6.97
   ...

The following shows the output from loading the properties into the existing mmpdb database.

% mmpdb loadprops --properties ChEMBL_CYP3A4_hERG_props.txt ChEMBL_CYP3A4_hERG.mmpdb
WARNING: APSW not installed. Falling back to Python's sqlite3 module.
Using dataset: MMPs from 'ChEMBL_CYP3A4_hERG.fragments'
Reading properties from 'ChEMBL_CYP3A4_hERG_props.txt'
WARNING: the identifier column in the properties file (column 1) has a header of 'CMPD_CHEMBLID'; should be 'id', 'ID', 'Name', or 'name'
Read 2 properties for 20267 compounds from 'ChEMBL_CYP3A4_hERG_props.txt'
5474 compounds from 'ChEMBL_CYP3A4_hERG_props.txt' are not in the dataset at 'ChEMBL_CYP3A4_hERG.mmpdb'
Imported 9691 'CYP3A4' records (9691 new, 0 updated).
Imported 5340 'hERG_pIC50' records (5340 new, 0 updated).
Generated 1246741 rule statistics (1263109 rule environments, 2 properties)
Number of rule statistics added: 1246741 updated: 0 deleted: 0
Loaded all properties and re-computed all rule statistics.
It took about 90 seconds to run.

Timings showed that APSW is faster than Python's built-in sqlite3 module. The "WARNING" is a suggestion that you install that package. Bear in mind that APSW is not available from the Python packaging index at PyPI. It can be installed via pip, using the complex command at the bottom of the APSW download page.

Analysis #1: Transform a compound

mmpdb transform \
  --smiles "C[C@@H](C(=O)OC(C)C)N[P@](=O)(OC[C@@H]1[C@H]([C@@]([C@@H](O1)n2ccc(=O)[nH]c2=O)(C)F)O)Oc3ccccc3" \
  --substructure "C[C@@H](C(=O)OC(C)C)N[P@](=O)(OC[C@@H]1[C@H]([C@@]([C@@H](O1)n2ccc(=O)[nH]c2=O)(C)F)O)Oc3ccccc3" \
  --property hERG_pIC50 --property CYP3A4 \
  ChEMBL_CYP3A4_hERG.mmpdb
Skip to 'predict' analysis.

mmpdb supports two analysis options. The 'transform' command applies the matched pair rules to generate transformations of an input structure, along with a prediction of the change in the requested properties.

The above example starts with sofosbuvir (specified via the --smiles option), and requires that the transform structure still have the sofosbuvir substructure. The total search on the command-line took 8 seconds. We found that we could speed it up by either loading the database into memory (with the --in-memory option; only available if APSW is installed) or by putting the data on a RAM disk.

The output is a tab-separated file with a large number of columns. This is designed to be imported into Excel or Spotfire for further analysis; there are too many columns to make sense of it here. Instead, I'll pull out one record as an example:

ID: 25
SMILES: C=CCC(C)OC(=O)[C@H](C)N[P@](=O)(OC[C@H]1O[C@@H](n2ccc(=O)[nH]c2=O)[C@](C)(F)[C@@H]1O)Oc1ccccc1
hERG_pIC50_from_smiles: [*:1]C
hERG_pIC50_to_smiles: [*:1]CC=C
hERG_pIC50_radius: 0
hERG_pIC50_fingerprint: 59SlQURkWt98BOD1VlKTGRkiqFDbG6JVkeTJ3ex3bOA
hERG_pIC50_rule_environment_id: 774189
hERG_pIC50_count: 2
hERG_pIC50_avg: 0.765
hERG_pIC50_std: 0.61518
hERG_pIC50_kurtosis:
hERG_pIC50_skewness:
hERG_pIC50_min: 0.33
hERG_pIC50_q1: 0.33
hERG_pIC50_median: 0.765
hERG_pIC50_q3: 1.2
hERG_pIC50_max: 1.2
hERG_pIC50_paired_t: 1.7586
hERG_pIC50_p_value: 0.32915
CYP3A4_from_smiles: [*:1]C
CYP3A4_to_smiles: [*:1]CC=C
CYP3A4_radius: 0
CYP3A4_fingerprint: 59SlQURkWt98BOD1VlKTGRkiqFDbG6JVkeTJ3ex3bOA
CYP3A4_rule_environment_id: 774189
CYP3A4_count: 8
CYP3A4_avg: 0.31875
CYP3A4_std: 0.50705
CYP3A4_kurtosis: -0.27899
CYP3A4_skewness: 0.66842
CYP3A4_min: -0.2
CYP3A4_q1: -0.075
CYP3A4_median: 0.2
CYP3A4_q3: 0.6
CYP3A4_max: 1.3
CYP3A4_paired_t: 1.7781
CYP3A4_p_value: 0.11863
You may think it's a bit of a duplication that both the hERG_pIC50 and CYP3A4 have identical "*_from_smiles" and "*_to_smiles" values. What we found during development was there may be several different pairs which result in the same transform. We have some heuristics to select what we think is the most relevant transform, based on the number of pairs found with property information. However, the amount of property information may vary for different properties, causing different transforms to be selected as the "most relevant."

Analysis #2: Predict a change between two compounds

mmpdb predict --reference "C[C@@H](C(=O)OC(C)C)N[P@](=O)(OC[C@@H]1[C@H]([C@@]([C@@H](O1)n2ccc(=O)[nH]c2=O)(C)F)O)Oc3ccccc3" --smiles "C[C@@H](C(=O)OC(C)C)N[P@](=O)(OC[C@@H]1[C@H]([C@@]([C@@H](O1)n2ccc(=O)[nH]c2=O)(C)F)O)Oc3ccc(F)cc3" --property hERG_pIC50 ChEMBL_CYP3A4_hERG.mmpdb
This second (and last) analysis example predicts the shift between two compounds. In this case it predicts the effect on the hERG_pIC50 from the addition of a florine to sofosbuvir, graphically depicted as:

The analysis takes about 3 seconds to generate the following:

% mmpdb --quiet  predict \
  --reference "C[C@@H](C(=O)OC(C)C)N[P@](=O)(OC[C@@H]1[C@H]([C@@]([C@@H](O1)n2ccc(=O)[nH]c2=O)(C)F)O)Oc3ccccc3" \
  --smiles "C[C@@H](C(=O)OC(C)C)N[P@](=O)(OC[C@@H]1[C@H]([C@@]([C@@H](O1)n2ccc(=O)[nH]c2=O)(C)F)O)Oc3ccc(F)cc3" \
  --property hERG_pIC50 \
  ChEMBL_CYP3A4_hERG.mmpdb
predicted delta: +0.220769 +/- 0.354767
(I used the --quiet option so I wouldn't get the warning message about APSW not being installed.)

See 'predict' details

Add the --save-details to have the predict command save prediction details to the files "pred_detail_rules.txt" and "pred_detail_pairs.txt". (Use --prefix to change the output filename prefix from 'pred_details' to something else.)

Again, these are tab-separated files meant more for Spotfire or Excel than a simple HTML page. As before, I'll just show one example record, first from pred_detail_rules.txt:

rule_environment_statistics_id: 1006880
rule_id: 110
rule_environment_id: 1016557
radius: 3
fingerprint: epWXDOOtiVLnFPhsdb89UN1noHJUTNbNiF1h/qpOhmQ
from_smiles: [*:1][H]
to_smiles: [*:1]F
count: 39
avg: 0.22077
std: 0.35477
kurtosis: 0.27687
skewness: -0.081873
min: -0.74
q1: 0
median: 0.2
q3: 0.4875
max: 1.04
paired_t: 3.8862
p_value: 0.00039518
and then from pred_detail_pairs.txt:
rule_environment_id: 698
from_smiles: [*:1][H]
to_smiles: [*:1]F
radius: 0
fingerprint: 59SlQURkWt98BOD1VlKTGRkiqFDbG6JVkeTJ3ex3bOA
lhs_public_id: CHEMBL488468
rhs_public_id: CHEMBL495303
lhs_smiles: CC(C)(C)CCN1CCC(CNC(=O)c2cc(Cl)cc(Cl)c2)CC1
rhs_smiles: CC(C)(C)C(F)CN1CCC(CNC(=O)c2cc(Cl)cc(Cl)c2)CC1
lhs_value: 5.71
rhs_value: 5.33
delta: -0.38


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2013 Andrew Dalke Scientific AB