Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2017/03/24/chembl_bioactivity_data

ChEMBL bioactivity data

I almost only use ChEMBL structure files. I download the .sdf files and process them. ChEMBL also supplies bioactivity data, which I've never worked with. Iain Wallace suggested I look to it as a source of compound set data, and provided some example SQL queries. This blog post is primarily a set of notes for myself as I experiment with the queries and learn more about what is in the data file.

There is one bit of general advice. If you're going to use the SQLite dump from ChEMBL, make sure that you did "ANALYZE" at least on the tables of interest. This may take a few hours. I'm downloading ChEMBL-22.1 to see if it comes pre-analyzed. If it doesn't, I'll ask them to do so as part of their releases. (26 March 2017: I verified that 22.1 does not come pre-analyzed and send them an email. With a few hours, on a Sunday afternoon(!), I got an email thanking me for the suggestion. They only started shipping SQLite dumps with v21, and "will definitely include ANALYZE command when producing the dump" in the future.)

For those playing along from home (or the office, or whereever fine SQL database engines may be found), I downloaded the SQLite dump for ChEMBL 21, which is a lovely 2542883 KB (or 2.4) compressed, and 12 GB uncompressed. That link also includes dumps for MySQL, Oracle, and Postgres, as well as schema documentation.

Unpack it the usual way (it takes a while to unpack 12GB), cd into the directory, and open the database using sqlite console:

% tar xf chembl_21_sqlite.tar.gz
% cd chembl_21_sqlite
% sqlite3 chembl_21.db
SQLite version 3.8.5 2014-08-15 22:37:57
Enter ".help" for usage hints.
sqlite>

compound_structures

The 'compound_structures' table looks interesting. How many structures are there?

sqlite> select count(*) from compound_structures;
1583897
Wow. Just .. wow. That took a several minutes to execute. This is a problem I've had before with large databases. SQLite doesn't store the total table size, so the initial count(*) ends up doing a full table scan. This brings in every B-tree node from disk, which requires a lot of random seeks for my poor hard disk made of spinning rust. (Hmm, Crucible says I can get a replacement 500GB SSD for only EUR 168. Hmmm.)

The second time and onwards is just fine, thanks to the power of caching.

What does the structures look like? I'll decided to show only a few of the smallest structures to keep the results from overflowing the screen:

sqlite>
   ...>
select molregno, standard_inchi, standard_inchi_key, canonical_smiles
  from compound_structures where length(canonical_smiles) < 10 limit 4;
1813|InChI=1S/C4H11NO/c1-2-3-4-6-5/h2-5H2,1H3|WCVVIGQKJZLJDB-UHFFFAOYSA-N|CCCCON 3838|InChI=1S/C2H4INO/c3-1-2(4)5/h1H2,(H2,4,5)|PGLTVOMIXTUURA-UHFFFAOYSA-N|NC(=O)CI 4092|InChI=1S/C4H6N2/c5-4-6-2-1-3-6/h1-3H2|VEYKJLZUWWNWAL-UHFFFAOYSA-N|N#CN1CCC1 4730|InChI=1S/CH4N2O2/c2-1(4)3-5/h5H,(H3,2,3,4)|VSNHCAURESNICA-UHFFFAOYSA-N|NC(=O)NO

For fun, are there canonical SMILES which are listed multiple times? There are a few, so I decided to narrow it down to those with more than 2 instances. (None occur more than 3 times.)

sqlite> select canonical_smiles, count(*) from compound_structures group by canonical_smiles having count(*) > 2;
CC(C)Nc1cc(ccn1)c2[nH]c(nc2c3ccc(F)cc3)[S+](C)[O-]|3
CC(C)Nc1cc(ccn1)c2[nH]c(nc2c3ccc(F)cc3)[S+]([O-])C(C)C|3
CC(C)[C@@H](C)Nc1cc(ccn1)c2[nH]c(nc2c3ccc(F)cc3)[S+](C)[O-]|3
CC(C)[C@H](C)Nc1cc(ccn1)c2[nH]c(nc2c3ccc(F)cc3)[S+](C)[O-]|3
CC(C)[S+]([O-])c1nc(c2ccc(F)cc2)c([nH]1)c3ccnc(NC4CCCCC4)c3|3
  ...
Here are more details about the first output where the same SMILES is used multiple times:
sqlite>
   ...>
select molregno, standard_inchi from compound_structures
 where canonical_smiles = "CC(C)Nc1cc(ccn1)c2[nH]c(nc2c3ccc(F)cc3)[S+](C)[O-]";
1144470|InChI=1S/C18H19FN4OS/c1-11(2)21-15-10-13(8-9-20-15)17-16(22-18(23-17)25(3)24)12-4-6-14(19)7-5-12/h4-11H,1-3H3,(H,20,21)(H,22,23) 1144471|InChI=1S/C18H19FN4OS/c1-11(2)21-15-10-13(8-9-20-15)17-16(22-18(23-17)25(3)24)12-4-6-14(19)7-5-12/h4-11H,1-3H3,(H,20,21)(H,22,23)/t25-/m1/s1 1144472|InChI=1S/C18H19FN4OS/c1-11(2)21-15-10-13(8-9-20-15)17-16(22-18(23-17)25(3)24)12-4-6-14(19)7-5-12/h4-11H,1-3H3,(H,20,21)(H,22,23)/t25-/m0/s1
The differences are in the "/t(isotopic:stereo:sp3)", "/m(fixed_:stereo:sp3:inverted)", and "/s(fixed_H:stereo_type=abs)" layers. Got that?

I don't. I used the techniques of the next section to get the molfiles for each structure. The differences are in the bonds between atoms 23/24 (the sulfoxide, represented in charge-separated form) and atoms 23/25 (the methyl on the sulfur). The molfile for the first record has no asigned bond stereochemistry, the second has a down flag for the sulfoxide, and the third has a down flag for the methyl.

molfile column in compound_structures

There's a "molfile" entry. Does it really include the structure as a raw MDL molfile? Yes, yes it does:

sqlite> select molfile from compound_structures where molregno = 805;

          11280714442D 1   1.00000     0.00000     0

  8  8  0     0  0            999 V2000
    6.0750   -2.5667    0.0000 C   0  0  0  0  0  0           0  0  0
    5.3625   -2.9792    0.0000 N   0  0  3  0  0  0           0  0  0
    6.7917   -2.9792    0.0000 N   0  0  0  0  0  0           0  0  0
    5.3625   -3.8042    0.0000 C   0  0  0  0  0  0           0  0  0
    4.6542   -2.5667    0.0000 C   0  0  0  0  0  0           0  0  0
    6.0750   -1.7417    0.0000 C   0  0  0  0  0  0           0  0  0
    4.6542   -1.7417    0.0000 C   0  0  0  0  0  0           0  0  0
    5.3625   -1.3292    0.0000 C   0  0  0  0  0  0           0  0  0
  2  1  1  0     0  0
  3  1  2  0     0  0
  4  2  1  0     0  0
  5  2  1  0     0  0
  6  1  1  0     0  0
  7  8  1  0     0  0
  8  6  1  0     0  0
  7  5  1  0     0  0
M  END

Why did I choose molregno = 805? I looked for a structure with 8 atoms and 8 bond by searching for the substring "  8  8  0", which is in the counts line. (It's not a perfect solution, but rather a good-enough one.

sqlite> select molregno from compound_structures where molfile LIKE "%  8  8  0%" limit 1;
805
I bet with a bit of effort I could count the number of rings by using the molfile to get the bond counts and use the number of "."s in the canonical_smiles to get the number of fragments.

compound_properties and molecule_dictionary tables

The compound_properties table stores some molecular properties. I'll get the number of heavy atoms, the number of aromatic rings, and the full molecular weight for structure 805.

sqlite> select heavy_atoms, aromatic_rings, full_mwt from compound_properties where molregno = 805;
8|0|112.17
I've been using "805", which is an internal identifier. What's its public ChEMBL id?
sqlite> select chembl_id from molecule_dictionary where molregno = 805;
CHEMBL266980
What are some of the records with only 1 or 2 atoms?
sqlite>
   ...>
   ...>
select chembl_id, heavy_atoms from molecule_dictionary, compound_properties
 where molecule_dictionary.molregno = compound_properties.molregno
 and heavy_atoms < 3 limit 5;
CHEMBL1098659|1 CHEMBL115849|2 CHEMBL1160819|1 CHEMBL116336|2 CHEMBL116838|2

InChI and heavy atom count for large structures

I showed that some of the SMILES were used for two or three records. What about the InChI string? I started with:

sqlite>
   ...>
select molregno, standard_inchi, count(*) from compound_structures
  group by standard_inchi having count(*) > 1;
1378059||9
After 10 minutes with no other output, I gave up. Those 9 occurrences have a NULL value, that is:
sqlite> select count(*) from compound_structures where standard_inchi is NULL;
9
I was confused at first because there are SMILES string (I'll show only the first 40 characters), so there is structure information. The heavy atom count is also NULL:
sqlite>
   ...>
   ...>
select compound_structures.molregno, heavy_atoms, substr(canonical_smiles, 1, 40)
 from compound_structures, compound_properties
 where standard_inchi is NULL and compound_structures.molregno = compound_properties.molregno;
615447||CC1=CN([C@H]2C[C@H](OP(=O)(O)OC[C@H]3O[C 615448||CC1=CN([C@H]2C[C@H](OP(=O)(O)OC[C@H]3O[C 615449||CC1=CN([C@H]2C[C@H](OP(=O)(O)OC[C@H]3O[C 615450||CC1=CN([C@H]2C[C@H](OP(=O)(O)OC[C@H]3O[C 615451||CC1=CN([C@H]2C[C@H](OP(=O)(O)OC[C@H]3O[C 1053861||CN(C)P(=O)(OC[C@@H]1CN(C[C@@H](O1)n2cnc3 1053864||CN(C)P(=O)(OC[C@@H]1CN(C[C@@H](O1)n2cnc3 1053865||CN(C)P(=O)(OC[C@@H]1CN(C[C@@H](O1)N2C=CC 1378059||CC[C@H](C)[C@H](NC(=O)[C@H](CCCNC(=N)N)N
Then I realized it's because the schema specifies the "heavy_atoms" field as "NUMBER(3,0)". While SQLite ignores that limit, it looks like ChEMBL doesn't try to store a count above 999.

What I'll do instead is get the molecular formula, which shows that there are over 600 heavy atoms in those structures:

sqlite>
   ...>
   ...>
   ...>
   ...>
select chembl_id, full_molformula
  from compound_structures, compound_properties, molecule_dictionary
  where standard_inchi is NULL
  and compound_structures.molregno = compound_properties.molregno
  and compound_structures.molregno = molecule_dictionary.molregno;
CHEMBL1077162|C318H381N118O208P29 CHEMBL1077163|C319H383N118O209P29 CHEMBL1077164|C318H382N119O208P29 CHEMBL1077165|C325H387N118O209P29 CHEMBL1631334|C361H574N194O98P24S CHEMBL1631337|C367H606N172O113P24 CHEMBL1631338|C362H600N180O106P24 CHEMBL2105789|C380H614N112O113S9
Those are some large structures! The reason there are no InChIs for them is that InChI didn't support large molecules until version 1.05, which came out in early 2017. Before then, InChI only supported 1024 atoms. Which is normally fine as most compounds are small (hence "small molecule chemistry"). In fact, there aren't any records with more than 79 heavy atoms:
sqlite>
   ...>
select heavy_atoms, count(*) from compound_properties
  where heavy_atoms > 70 group by heavy_atoms;
71|364 72|207 73|46 74|29 75|3 76|7 78|2 79|2
How in the world do these large structures have 600+ atoms? Are they peptides? Mmm, no, not all. The first 8 contain a lot of phosphorouses. I'm guessing some sort of nucleic acid. The last might be a protein. Perhaps I can get a clue from the chemical name, which is in the compound_records table. Here's an example using the molregno 805 from earlier:
sqlite> select * from compound_records where molregno = 805;
1063|805|14385|14|1-Methyl-piperidin-(2Z)-ylideneamine|1|
Some of the names of the 600+ atom molecules are too long, so I'll limit the output to the first 50 characters of the name:
sqlite>
   ...>
   ...>
   ...>
   ...>
   ...>
select chembl_id, full_molformula, substr(compound_name, 1, 50)
  from compound_structures, molecule_dictionary, compound_properties, compound_records
 where standard_inchi is NULL
   and compound_structures.molregno = molecule_dictionary.molregno
   and compound_structures.molregno = compound_properties.molregno
   and compound_structures.molregno = compound_records.molregno;
CHEMBL1077161|C307H368N116O200P28|{[(2R,3S,4R,5R)-5-(4-amino-2-oxo-1,2-dihydropyrimi CHEMBL1077162|C318H381N118O208P29|{[(2R,3S,5R)-2-{[({[(2R,3S,4R,5R)-5-(4-amino-2-oxo CHEMBL1077163|C319H383N118O209P29|{[(2R,3S,5R)-2-{[({[(2R,3S,4R,5R)-5-(4-amino-2-oxo CHEMBL1077164|C318H382N119O208P29|{[(2R,3S,5R)-2-{[({[(2R,3S,4R,5R)-5-(4-amino-2-oxo CHEMBL1077165|C325H387N118O209P29|{[(2R,3S,5R)-2-{[({[(2R,3S,4R,5R)-5-(4-amino-2-oxo CHEMBL1631334|C361H574N194O98P24S|HRV-EnteroX CHEMBL1631337|C367H606N172O113P24|PV-5'term CHEMBL1631338|C362H600N180O106P24|PV-L4 CHEMBL2105789|C380H614N112O113S9|Mirostipen
That didn't help much, but I could at least do a web search for some of the names. For example, HRV-EnteroX is a PPMO (peptide-conjugated phosphorodiamidate morpholino oligomers), which is where those phosphorous atoms come from.

The names weren't really help, and the images at ChEMBL were too small to make sense of the structures, so I looked at them over at PubChem. HRV-EnteroX looks like a 12-mer peptide conjugated to about 25 morpholino oligomers. Mirostipen looks like a peptide. CHEMBL1077161 looks like a nucleic acid strand.

I don't think there's anything interesting to explore in this direction so I'll move on.

Assay data

I'll take a look at assay data, which I deal with a lot less often than I do structure data. How many assays are there?
sqlite> select count(*) from assays;
1212831
Okay, and how many of them are human assays? For that I need the NCBI taxonomy id. Iain's example code uses 9606, which the NCBI web site tells me is for Homo sapiens. I don't think there's a table in the SQLite data dump with all of the taxonomy ids. The organism_class table says only:
sqlite> select * from organism_class where tax_id = 9606;
7|9606|Eukaryotes|Mammalia|Primates
The assay table "assay_organism" column stores the "[n]ame of the organism for the assay system", with the caution "[m]ay differ from the target organism (e.g., for a human protein expressed in non-human cells, or pathogen-infected human cells)." I'll throw caution to the wind and check that field:
sqlite> select count(*) from assays where assay_organism = "Homo sapiens";
291143
sqlite>
   ...>
select assay_organism, count(*) from assays
 where assay_tax_id = 9606 group by assay_organism;
|17 Homo sapiens|291135 sqlite> select count(*) from assays where assay_tax_id = 9606 and assay_organism is NULL; 17
It looks like 9606 is indeed for humans.

Assay activities

What sort of assay activities are there?
sqlite> select distinct published_type from activities;

 ED50
 Transactivation
%
% Cell Death
 ...
AUC
AUC (0-24h)
AUC (0-4h)
AUC (0-infinity)
 ...
Change
Change HDL -C
Change MAP
Change TC
 ...
Okay, quite a few. There appear to be some typos as well:
sqlite>
   ...>
   ...>
select published_type, count(*) from activities where published_type in ("Activity", "A ctivity",
  "Ac tivity", "Act ivity", "Acti vity", "Activ ity", "Activi ty", "Activit y", "Activty")
  group by published_type;
A ctivity|1 Activ ity|2 Activit y|1 Activity|700337 Activty|1
After another 20 minutes of data exploration, I realized that there are two different types. The "published_type" is what the assayer published, while there's also a "standard_type", which looks to be a normalized value by ChEMBL:
sqlite>
   ...>
select published_type, standard_type from activities
  where published_type in ("A ctivity", "Activ ity", "Activit y", "Activty");
A ctivity|Activity Activ ity|Activity Activ ity|Activity Activit y|Activity Activty|Activity
There are a many ways to publish a report with IC50 data. I'll show only those that end with "IC50".
sqlite> select distinct published_type from activities where published_type like "%IC50";
-Log IC50
-Log IC50/IC50
-logIC50
Average IC50
CC50/IC50
CCIC50
CIC IC50
CIC50
Change in IC50
Cytotoxicity IC50
Decrease in IC50
EIC50
FIC50
Fold IC50
I/IC50
IC50
IC50/IC50
Increase in IC50
Log 1/IC50
Log IC50
MBIC50
MIC50
Mean IC50
RIC50
Ratio CC50/IC50
Ratio CIC95/IC50
Ratio ED50/MIC50
Ratio IC50
Ratio LC50/IC50
Ratio LD50/IC50
Ratio pIC50
Ratio plasma concentration/IC50
Relative ET-A IC50
Relative IC50
TBPS IC50
TC50/IC50
Time above IC50
fIC50
log1/IC50
logIC50
pIC50
pMIC50
rIC50
The "p" prefix, as in "pIC50", is shorthand for "-log", so "-Log IC50", "Log 1/IC50", and "pIC50" are almost certainly the same units. Let's see:
sqlite>
   ...>
select distinct published_type, standard_type from activities
  where published_type in ("-Log IC50", "Log 1/IC50", "pIC50");
-Log IC50|IC50 -Log IC50|pIC50 -Log IC50|-Log IC50 Log 1/IC50|IC50 Log 1/IC50|Log 1/IC50 pIC50|IC50 pIC50|pIC50 pIC50|Log IC50 pIC50|-Log IC50
Well color me confused. Oh! There's a "standard_flag", which "[s]hows whether the standardised columns have been curated/set (1) or just default to the published data (0)." Perhaps that will help enlighten me.
sqlite>
   ...>
select distinct published_type, standard_flag, standard_type from activities
 where published_type in ("-Log IC50", "Log 1/IC50", "pIC50");
-Log IC50|1|IC50 -Log IC50|1|pIC50 -Log IC50|0|-Log IC50 Log 1/IC50|1|IC50 Log 1/IC50|0|Log 1/IC50 pIC50|1|IC50 pIC50|1|pIC50 pIC50|0|Log IC50 pIC50|1|-Log IC50 pIC50|0|pIC50
Nope, I still don't understand what's going on. I'll assume it's all tied to the complexities of data curation. For now, I'll assume that the data set is nice and clean.

IC50 types

Let's look at the "IC50" values only. How do the "published_type" and "standard_type" columns compare to each other?

sqlite>
   ...>
select published_type, standard_type, count(*) from activities
  where published_type = "IC50" group by standard_type;
IC50|% Max Response|21 IC50|Change|2 IC50|Control|1 IC50|Electrophysiological activity|6 IC50|Fold Inc IC50|1 IC50|Fold change IC50|12 IC50|IC50|1526202 IC50|IC50 ratio|1 IC50|Inhibition|12 IC50|Log IC50|20 IC50|Ratio IC50|40 IC50|SI|4 IC50|T/C|1
sqlite>
   ...>
select published_type, standard_type, count(*) from activities
  where standard_type = "IC50" group by published_type;
-Log IC50|IC50|1736 -Log IC50(M)|IC50|28 -Log IC50(nM)|IC50|39 -logIC50|IC50|84 3.3|IC50|1 Absolute IC50 (CHOP)|IC50|940 Absolute IC50 (XBP1)|IC50|940 Average IC50|IC50|34 CIC50|IC50|6 I 50|IC50|202 I-50|IC50|25 I50|IC50|6059 IC50|IC50|1526202 IC50 |IC50|52 IC50 app|IC50|39 IC50 max|IC50|90 IC50 ratio|IC50|2 IC50(app)|IC50|457 IC50_Mean|IC50|12272 IC50_uM|IC50|20 ID50|IC50|3 Log 1/I50|IC50|280 Log 1/IC50|IC50|589 Log 1/IC50(nM)|IC50|88 Log IC50|IC50|7013 Log IC50(M)|IC50|3599 Log IC50(nM)|IC50|77 Log IC50(uM)|IC50|28 Mean IC50|IC50|1 NA|IC50|5 NT|IC50|20 log(1/IC50)|IC50|1016 pI50|IC50|2386 pIC50|IC50|43031 pIC50(mM)|IC50|71 pIC50(nM)|IC50|107 pIC50(uM)|IC50|419
Yeah, I'm going to throw my hands up here, declare "I'm a programmer, Jim, not a bioactivity specialist", and simply use the published_type of IC50.

IC50 activity values

How are the IC50 values measured? Here too I need to choose between "published_units" and "standard_units". A quick look at the two shows that the standard_units are less diverse.

sqlite>
   ...>
select standard_units, count(*) from activities where published_type = "IC50"
  group by standard_units;
|167556 %|148 % conc|70 10'-11uM|1 10'-4umol/L|1 M ml-1|15 equiv|64 fg ml-1|1 g/ha|40 kJ m-2|20 liposomes ml-1|5 mMequiv|38 mg kg-1|248 mg.min/m3|4 mg/kg/day|1 milliequivalent|22 min|9 ml|3 mmol/Kg|10 mol|6 molar ratio|198 nA|6 nM|1296169 nM g-1|1 nM kg-1|4 nM unit-1|7 nmol/Kg|1 nmol/mg|5 nmol/min|1 ppm|208 ppm g dm^-3|7 uL|7 uM hr|1 uM tube-1|9 uM well-1|52 uM-1|25 uM-1 s-1|1 ucm|6 ucm s-1|2 ug|168 ug cm-2|1 ug g-1|2 ug well-1|12 ug.mL-1|61139 ug/g|16 umol kg-1|3 umol.kg-1|8 umol/dm3|2
"Less diverse", but still diverse. By far the most common is "nM for "nanomolar", which is the only unit I expected. How many IC50s have an activities better than 1 micromolar, which is 1000 nM?

sqlite>
   ...>
select count(*) from activities where published_type = "IC50"
   and standard_value < 1000 and standard_units = "nM";
483041
That's fully 483041/1212831 = 40% of the assays in the data dump.

How many of the IC50s are in humans? For that I need a join with the assays table using the assay_id:

sqlite>
   ...>
   ...>
   ...>
   ...>
select count(*) from activities, assays
 where published_type = "IC50"
   and standard_value < 1000 and standard_units = "nM"
   and activities.assay_id = assays.assay_id
   and assay_tax_id = 9606;
240916
About 1/2 of them are in humans.

Assay target type from target_dictionary

What are the possible assay targets in humans? That information is stored in the target_dictionary:

sqlite> select distinct target_type from target_dictionary where tax_id = 9606;
SINGLE PROTEIN
ORGANISM
CELL-LINE
UNKNOWN
PROTEIN COMPLEX
SUBCELLULAR
TISSUE
NUCLEIC-ACID
PROTEIN FAMILY
PROTEIN-PROTEIN INTERACTION
PROTEIN COMPLEX GROUP
SELECTIVITY GROUP
CHIMERIC PROTEIN
MACROMOLECULE
SMALL MOLECULE
OLIGOSACCHARIDE

Remember earlier when I threw caution to the wind? How many of the assays are actually against human targets? I can join on the target id "tid" to compare the taxon id in the target vs. the taxon id in the assay:

sqlite>
   ...>
   ...>
select count(*) from assays, target_dictionary
 where assays.tid = target_dictionary.tid
   and target_dictionary.tax_id = 9606;
301810
sqlite>
   ...>
   ...>
select count(*) from assays, target_dictionary
 where assays.tid = target_dictionary.tid
   and assays.assay_tax_id = 9606;
291152

Compare assay organisms with target organism

What are some of the non-human assay organisms where the target is humans?

sqlite>
   ...>
   ...>
   ...>
   ...>
select distinct assay_organism from assays, target_dictionary
 where assays.tid = target_dictionary.tid
   and assays.assay_tax_id != 9606
   and target_dictionary.tax_id = 9606
 limit 10;
rice Saccharomyces cerevisiae Oryza sativa Rattus norvegicus Sus scrofa Cavia porcellus Oryctolagus cuniculus Canis lupus familiaris Proteus vulgaris Salmonella enterica subsp. enterica serovar Typhi

Compounds tested against a target name

I'm interested in the "SINGLE PROTEIN" target names in humans. The target name is a manually curated field.

sqlite> select distinct pref_name from target_dictionary where tax_id = 9606 limit 5;
Maltase-glucoamylase
Sulfonylurea receptor 2
Phosphodiesterase 5A
Voltage-gated T-type calcium channel alpha-1H subunit
Dihydrofolate reductase
What are structures used in "Dihydrofolate reductase" assays? This requires three table joins, one on 'tid' to go from target_dictionary to assays, another on 'assay_id' to get to the activity, and another on 'molregno' to go from assay to molecule_dictionary so I can get the compound's chembl_id. (To make it more interesting, three of the tables have a chembl_id column.)
sqlite>
   ...>
   ...>
   ...>
   ...>
   ...>
   ...>
select distinct molecule_dictionary.chembl_id
  from target_dictionary, assays, activities, molecule_dictionary
 where target_dictionary.pref_name = "Dihydrofolate reductase"
   and target_dictionary.tid = assays.tid
   and assays.assay_id = activities.assay_id
   and activities.molregno = molecule_dictionary.molregno
 limit 10;
CHEMBL1679 CHEMBL429694 CHEMBL106699 CHEMBL422095 CHEMBL1161155 CHEMBL350033 CHEMBL34259 CHEMBL56282 CHEMBL173175 CHEMBL173901
sqlite>
   ...>
   ...>
   ...>
   ...>
   ...>
select count(distinct molecule_dictionary.chembl_id)
   from target_dictionary, assays, activities, molecule_dictionary
  where target_dictionary.pref_name = "Dihydrofolate reductase"
    and target_dictionary.tid = assays.tid
    and assays.assay_id = activities.assay_id
    and activities.molregno = molecule_dictionary.molregno;
3466
There are 3466 of these, including non-human assays. I'll limit it to human ones only:
sqlite>
   ...>
   ...>
   ...>
   ...>
   ...>
   ...>
select count(distinct molecule_dictionary.chembl_id)
   from target_dictionary, assays, activities, molecule_dictionary
  where target_dictionary.pref_name = "Dihydrofolate reductase"
    and target_dictionary.tax_id = 9606
    and target_dictionary.tid = assays.tid
    and assays.assay_id = activities.assay_id
    and activities.molregno = molecule_dictionary.molregno;
1386
I'll further limit it to those with an IC50 of under 1 micromolar:
sqlite>
sqlite>
   ...>
   ...>
   ...>
   ...>
   ...>
   ...>
   ...>
   ...>
   ...>
.timer on
select count(distinct molecule_dictionary.chembl_id)
   from target_dictionary, assays, activities, molecule_dictionary
  where target_dictionary.pref_name = "Dihydrofolate reductase"
    and target_dictionary.tax_id = 9606
    and target_dictionary.tid = assays.tid
    and assays.assay_id = activities.assay_id
    and activities.published_type = "IC50"
    and activities.standard_units = "nM"
    and activities.standard_value < 1000
    and activities.molregno = molecule_dictionary.molregno;
255 Run Time: real 174.561 user 18.073715 sys 23.285346
I turned on the timer to show that the query took about 3 minutes! I repeated it to ensure that it wasn't a simple cache issue. Still about 3 minutes.

ANALYZE the tables

The earlier query, without the activity filter, took 5.7 seconds when the data wasn't cached, and 0.017 seconds when cached. It found 1386 matches. The new query takes almost 3 minutes more to filter those 1386 matches down to 255. That should not happen.

This is a strong indication that the query planner used the wrong plan. I've had this happen before. My solution then was to "ANALYZE" the tables, which "gathers statistics about tables and indices and stores the collected information in internal tables of the database where the query optimizer can access the information and use it to help make better query planning choices."

It can take a while, so I limited it to the tables of interest.

sqlite> analyze target_dictionary;
Run Time: real 0.212 user 0.024173 sys 0.016268
sqlite> analyze assays;
Run Time: real 248.184 user 5.890109 sys 4.793236
sqlite> analyze activities;
Run Time: real 6742.390 user 97.862790 sys 129.854073
sqlite> analyze molecule_dictionary;
Run Time: real 33.879 user 2.195662 sys 2.043848
Yes, it took almost 2 hours to analyze the activities table. But it was worth it from a pure performance view. I ran the above code twice, with this pattern:
% sudo purge  # clear the filesystem cache
% sqlite3 chembl_21.db  # start SQLite
SQLite version 3.8.5 2014-08-15 22:37:57
Enter ".help" for usage hints.
sqlite> .timer on
sqlite> .... previous query, with filter for IC50 < 1uM ...
255
Run Time: real 8.595 user 0.038847 sys 0.141945
sqlite> .... repeat query using a warm cache
255
Run Time: real 0.009 user 0.005255 sys 0.003653
Nice! Now I only need to do about 60 such queries to justify the overall analysis time.


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2013 Andrew Dalke Scientific AB