mmpdb paper, poster, and walkthrough
Last year we released mmpdb program, a tool for identifying and using matched molecular pairs. It is based on the source code that Hussain and Rea contributed to the RDKit project. Some of the features we added are:
- better support for symmetry, which results in fully canonical pair descriptions
- support for chirality, including matching chiral with prochiral structures
- can include the chemical environment when finding pairs
- generate property change statistics for each pair, environment, and property type
- parallelized fragmentation
- fragmentation can re-use fragmentations from a previous run
- performance speedups during indexing
- pair, environment, and property statistics are stored in a SQLite database
- analysis tools to propose possible transforms to an input structure, or to predict property shifts between two structures
Last week I presented our
poster for the 11th International Conference on Chemical
Structures (ICCS), in Noordwijkerhout, The Netherlands.
In this essay I'll walk through an example of how to use mmpdb using the example data from the paper's supporting materials.
Step 1: install mmpdb
pip install mmpdbThis is a pure Python installation which installs the command-line tool "mmpdb", and the Python library package "mmpdblib".
Step 2: Get the supporting data
I'll structure this as the commands to run, followed by some commentary. I'll also provide a link to the next section if you want to skip the commentary.
curl -O https://pubs.acs.org/doi/suppl/10.1021/acs.jcim.8b00173/suppl_file/ci8b00173_si_001.zip unzip ci8b00173_si_001.zipSkip to step 3.
Quoting from the paper:
We used all of the CYP3A4 (ChEMBL target ID ChEMBL340) and hERG (ChEMBL target ID ChEMBL340) data from ChEMBL23 to generate a reproducible timing benchmark. We merged all of the IC50 and Ki data for hERG and IC50, AC50, and Ki data for CYP3A4 with PCHEMBL values and removed undefined compounds and duplicates. The result was 14,377 compounds for CYP3A4 and 6192 compounds for hERG, with 302 compounds having a measured value for both hERG and CYP3A4, yielding a data set with 20,267 compounds overall. It should be noted that we employed this very coarse data cleaning and merging protocol for illustration purposes only. Additional care would need to be taken to assemble a hERG or CYP3A4 data set for actual MMPA and ensure compatibility of the assay, reproducibility of the data, etc.The SMILES and property data files are available in the supplementary data, which is a zip file containing the following:
% unzip -l ci8b00173_si_001.zip Archive: ci8b00173_si_001.zip Length Date Time Name --------- ---------- ----- ---- 987 03-22-2018 17:30 test_calls.txt 1393558 08-23-2017 15:50 ChEMBL_CYP3A4_hERG.smi 426852 08-23-2017 15:44 ChEMBL_CYP3A4_hERG_props.txt --------- ------- 1821397 3 filesDownload and unzip that file using your tools of choice.
mmpdb fragment --max-heavies 70 --max-rotatable-bonds 20 --has-header \ ChEMBL_CYP3A4_hERG.smi--output ChEMBL_CYP3A4_hERG.fragmentsSkip to step 4.
Each input molecule is fragmented into 0 or more fragment records. The algorithm uses a SMARTS pattern to identify which bonds may be fragmented. Use the --cut-smarts option to specify one of the alternative built-in patterns, or to specify your own SMARTS pattern.
It then generates all of the 1-cut, 2-cut, and 3-cut fragmentations. Use --num-cuts to limit the fragmentation to at most 2 or at most 1 cut.
The input structures come from a SMILES file. The --has-header option tells the parser to ignore the first line, because it is a header line. The built-in help, available with:
mmpdb help-smiles-formatgives more details about the --has-header and describes how the --delimiter option works.
I'll use the ChEMBL_CYP3A4_hERG.smi which was extracted from the zip file. It contains a header line followed by 20,267 SMILES records.
The above fragment command also uses options to limit the fragmentation to input records with at most 70 heavy atoms and at most 20 rotatable bonds. Isotopically labeled hydrogens are considered heavy atoms. You can specify an alternate SMARTS definition for "rotatable bond" if you wish.
The --output tells mmpdb to save the results to the named file rather than stdout.
All of the commands support the --help option, which gives more detailed information on the available command-line options and what they do. For more details about the fragment command, do:
mmpdb fragment --help
The fragmentation took about 40 minutes on my 7 year old laptop. I don't like long-running programs which give no output because part of me worries that the program froze, so mmpdb displays a progress indicator, like:
Fragmented record 2845/20267 (14.0%)This indicates that the 14% of the 20,267 input structures were processed.
You can disable progress information or warning messages with the "--quiet"/"-q" flag, as in:
mmpdb --quiet fragment ...This is a global option which can apply to all of the subcommands, which is why it goes before the subcommand.
Multiprocessing and caching
The fragment command can fragment multiple input structures at the same time. By default it uses four processes. Since my laptop only has four cores, I kept the default. If you have a more powerful machine then you might want to increase the number of fragmentation jobs to run in parallel, using the "--num-jobs"/"-j" option.
If you have added or modified a few records and want to re-fragment then you can save a lot of time by having mmpdb re-use previous fragmentation information. Specify the old fragment file using the --cache option.
mmpdb index --symmetric ChEMBL_CYP3A4_hERG.fragmentsSkip to step 5.
Each fragmentation contains a "constant" part and a "variable" part. In the MMP indexing step, the constants are matched up to generate a pair between one variable part and the other variable part. This pair can be written using a SMIRK-like notation, in the form "A>>B".
(Bear in mind that a SMIRKS describes a reaction, and this isn't a reaction. "A>>B" and "B>>A" are equivalent, and a matched molecular series may also be important.)
The following creates an index from the fragment file and saves the results into a MMPDB database file:
% mmpdb index --symmetric ChEMBL_CYP3A4_hERG.fragments WARNING: No --output filename specified. Saving to 'ChEMBL_CYP3A4_hERG.mmpdb'. WARNING: Neither ujson nor cjson installed. Falling back to Python's slower built-in json decoder.The --symmetric flag adds both "A>>B" and "B>>A" into the database. This roughly doubles the database size. This flag is useful if you want a unique identifier which expressed both the pair and the direction of transformation. It doesn't really affect the downstream processing.
The first warning message is because I prefer that people specify the output file on the command-line, using the "--output" flag. If you don't specify the output filename then it infers it based on the input filename, and it prints the filename so you know where to look. We'll probably drop the "WARNING:" part in the future, because the current behavior seems to be what people want.
The second warning message is because mmpdb was developed under Python 2.7 and we found the performance was limited by the time it took to load the fragment file; each line of the file is a JSON record. The third-party "ujson" and "cjson" JSON parsers were significantly faster than the built-in "json" module. The warning is a strong suggestion to install one of them.
However, I just discovered now that Python 3.6 the ujson package only saves a 4 seconds out of the 130 seconds. A 3% increase is nice, but not enough to warrant the warning message. I've filed an issue to look into this further.
The output "mmpdb" file is a SQLite database, so if you are really curious about its contents you can easily access the file using tools other than mmpdb.
mmpdb loadprops --properties ChEMBL_CYP3A4_hERG_props.txt ChEMBL_CYP3A4_hERG.mmpdbSkip to 'transform' analysis.
mmpdb reads the physical property/activity information from a tab-separated file. Only one value is supported per compound+property. For details on the format, use:
The CHEMBL_CYP3A4_HERG_props.txt file contains CYP3A4 and hERG_pIC50 values. The first few lines are:
CMPD_CHEMBLID CYP3A4 hERG_pIC50 CHEMBL3612928 * 4.52 CHEMBL2425617 7 * CHEMBL3221133 * 6.8 CHEMBL3221131 * 5.9 CHEMBL3221134 * 6.32 CHEMBL1945199 * 4.97 CHEMBL1573507 4.4 * CHEMBL284328 5 * CHEMBL1531676 4.9 * CHEMBL1546374 5.35 * CHEMBL1379480 5.45 * CHEMBL1499545 4.9 * CHEMBL486696 4.5 * CHEMBL1453970 5.45 * CHEMBL1490799 5.5 * CHEMBL121663 4.8 * CHEMBL282468 4.5 * CHEMBL1500528 5.5 * CHEMBL354349 4.6 * CHEMBL221753 4.6 6.97 ...
The following shows the output from loading the properties into the existing mmpdb database.
% mmpdb loadprops --properties ChEMBL_CYP3A4_hERG_props.txt ChEMBL_CYP3A4_hERG.mmpdb WARNING: APSW not installed. Falling back to Python's sqlite3 module. Using dataset: MMPs from 'ChEMBL_CYP3A4_hERG.fragments' Reading properties from 'ChEMBL_CYP3A4_hERG_props.txt' WARNING: the identifier column in the properties file (column 1) has a header of 'CMPD_CHEMBLID'; should be 'id', 'ID', 'Name', or 'name' Read 2 properties for 20267 compounds from 'ChEMBL_CYP3A4_hERG_props.txt' 5474 compounds from 'ChEMBL_CYP3A4_hERG_props.txt' are not in the dataset at 'ChEMBL_CYP3A4_hERG.mmpdb' Imported 9691 'CYP3A4' records (9691 new, 0 updated). Imported 5340 'hERG_pIC50' records (5340 new, 0 updated). Generated 1246741 rule statistics (1263109 rule environments, 2 properties) Number of rule statistics added: 1246741 updated: 0 deleted: 0 Loaded all properties and re-computed all rule statistics.It took about 90 seconds to run.
Timings showed that APSW is faster than Python's built-in sqlite3 module. The "WARNING" is a suggestion that you install that package. Bear in mind that APSW is not available from the Python packaging index at PyPI. It can be installed via pip, using the complex command at the bottom of the APSW download page.
mmpdb transform \ --smiles "C[C@@H](C(=O)OC(C)C)N[P@](=O)(OC[C@@H]1[C@H]([C@@]([C@@H](O1)n2ccc(=O)[nH]c2=O)(C)F)O)Oc3ccccc3" \ --substructure "C[C@@H](C(=O)OC(C)C)N[P@](=O)(OC[C@@H]1[C@H]([C@@]([C@@H](O1)n2ccc(=O)[nH]c2=O)(C)F)O)Oc3ccccc3" \ --property hERG_pIC50 --property CYP3A4 \ ChEMBL_CYP3A4_hERG.mmpdbSkip to 'predict' analysis.
mmpdb supports two analysis options. The 'transform' command applies the matched pair rules to generate transformations of an input structure, along with a prediction of the change in the requested properties.
The above example starts with sofosbuvir (specified via the --smiles option), and requires that the transform structure still have the sofosbuvir substructure. The total search on the command-line took 8 seconds. We found that we could speed it up by either loading the database into memory (with the --in-memory option; only available if APSW is installed) or by putting the data on a RAM disk.
The output is a tab-separated file with a large number of columns. This is designed to be imported into Excel or Spotfire for further analysis; there are too many columns to make sense of it here. Instead, I'll pull out one record as an example:
ID: 25 SMILES: C=CCC(C)OC(=O)[C@H](C)N[P@](=O)(OC[C@H]1O[C@@H](n2ccc(=O)[nH]c2=O)[C@](C)(F)[C@@H]1O)Oc1ccccc1 hERG_pIC50_from_smiles: [*:1]C hERG_pIC50_to_smiles: [*:1]CC=C hERG_pIC50_radius: 0 hERG_pIC50_fingerprint: 59SlQURkWt98BOD1VlKTGRkiqFDbG6JVkeTJ3ex3bOA hERG_pIC50_rule_environment_id: 774189 hERG_pIC50_count: 2 hERG_pIC50_avg: 0.765 hERG_pIC50_std: 0.61518 hERG_pIC50_kurtosis: hERG_pIC50_skewness: hERG_pIC50_min: 0.33 hERG_pIC50_q1: 0.33 hERG_pIC50_median: 0.765 hERG_pIC50_q3: 1.2 hERG_pIC50_max: 1.2 hERG_pIC50_paired_t: 1.7586 hERG_pIC50_p_value: 0.32915 CYP3A4_from_smiles: [*:1]C CYP3A4_to_smiles: [*:1]CC=C CYP3A4_radius: 0 CYP3A4_fingerprint: 59SlQURkWt98BOD1VlKTGRkiqFDbG6JVkeTJ3ex3bOA CYP3A4_rule_environment_id: 774189 CYP3A4_count: 8 CYP3A4_avg: 0.31875 CYP3A4_std: 0.50705 CYP3A4_kurtosis: -0.27899 CYP3A4_skewness: 0.66842 CYP3A4_min: -0.2 CYP3A4_q1: -0.075 CYP3A4_median: 0.2 CYP3A4_q3: 0.6 CYP3A4_max: 1.3 CYP3A4_paired_t: 1.7781 CYP3A4_p_value: 0.11863You may think it's a bit of a duplication that both the hERG_pIC50 and CYP3A4 have identical "*_from_smiles" and "*_to_smiles" values. What we found during development was there may be several different pairs which result in the same transform. We have some heuristics to select what we think is the most relevant transform, based on the number of pairs found with property information. However, the amount of property information may vary for different properties, causing different transforms to be selected as the "most relevant."
mmpdb predict --reference "C[C@@H](C(=O)OC(C)C)N[P@](=O)(OC[C@@H]1[C@H]([C@@]([C@@H](O1)n2ccc(=O)[nH]c2=O)(C)F)O)Oc3ccccc3" --smiles "C[C@@H](C(=O)OC(C)C)N[P@](=O)(OC[C@@H]1[C@H]([C@@]([C@@H](O1)n2ccc(=O)[nH]c2=O)(C)F)O)Oc3ccc(F)cc3" --property hERG_pIC50 ChEMBL_CYP3A4_hERG.mmpdbThis second (and last) analysis example predicts the shift between two compounds. In this case it predicts the effect on the hERG_pIC50 from the addition of a florine to sofosbuvir, graphically depicted as:
The analysis takes about 3 seconds to generate the following:
% mmpdb --quiet predict \ --reference "C[C@@H](C(=O)OC(C)C)N[P@](=O)(OC[C@@H]1[C@H]([C@@]([C@@H](O1)n2ccc(=O)[nH]c2=O)(C)F)O)Oc3ccccc3" \ --smiles "C[C@@H](C(=O)OC(C)C)N[P@](=O)(OC[C@@H]1[C@H]([C@@]([C@@H](O1)n2ccc(=O)[nH]c2=O)(C)F)O)Oc3ccc(F)cc3" \ --property hERG_pIC50 \ ChEMBL_CYP3A4_hERG.mmpdb predicted delta: +0.220769 +/- 0.354767(I used the --quiet option so I wouldn't get the warning message about APSW not being installed.)
See 'predict' details
Add the --save-details to have the predict command save prediction details to the files "pred_detail_rules.txt" and "pred_detail_pairs.txt". (Use --prefix to change the output filename prefix from 'pred_details' to something else.)
Again, these are tab-separated files meant more for Spotfire or Excel than a simple HTML page. As before, I'll just show one example record, first from pred_detail_rules.txt:
rule_environment_statistics_id: 1006880 rule_id: 110 rule_environment_id: 1016557 radius: 3 fingerprint: epWXDOOtiVLnFPhsdb89UN1noHJUTNbNiF1h/qpOhmQ from_smiles: [*:1][H] to_smiles: [*:1]F count: 39 avg: 0.22077 std: 0.35477 kurtosis: 0.27687 skewness: -0.081873 min: -0.74 q1: 0 median: 0.2 q3: 0.4875 max: 1.04 paired_t: 3.8862 p_value: 0.00039518and then from pred_detail_pairs.txt:
rule_environment_id: 698 from_smiles: [*:1][H] to_smiles: [*:1]F radius: 0 fingerprint: 59SlQURkWt98BOD1VlKTGRkiqFDbG6JVkeTJ3ex3bOA lhs_public_id: CHEMBL488468 rhs_public_id: CHEMBL495303 lhs_smiles: CC(C)(C)CCN1CCC(CNC(=O)c2cc(Cl)cc(Cl)c2)CC1 rhs_smiles: CC(C)(C)C(F)CN1CCC(CNC(=O)c2cc(Cl)cc(Cl)c2)CC1 lhs_value: 5.71 rhs_value: 5.33 delta: -0.38
Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me
Copyright © 2001-2013 Andrew Dalke Scientific AB