Dalke Scientific Software: More science. Less time. Products

Where do you put all this chemical data? There's the structure information (SMILES, connection table), various trade names and other aliases, chemical properties, experimental results, supplier information and more. Some people use regular files (often termed flat files). That works okay if you only want to store the database but most times you have data because you want to search it for information. Flat files are searchable but slow for doing complex queries.

The solution of course is a database, but what kind? The traditional business solution is a relational database (RDBMS stands for relational database management system). These store tables where each entry in the table contains values appropriate for the given column (eg, "Andrew" for the "name" column, the number 35 for age, etc.) Because these databases come from the business world the supported data types were the ones most often used there; numbers, string, dates and other widely-used values.

Where does compound information go? Structures can be represented as a SMILES string so perhaps as a string field. Fingerprints can also be put into a column. But there wasn't any way to do chemically relevant searches in the standard databases. Look at the SQL defintion and there's no defined function for doing a substructure search.

People instead used what's termed a federated database, with one database to store the chemistry data, another to store the supplier data, and another for the stockroom, and a few more for good measure. Daylight is one company which sold (and still sells) a chemistry specific database. It's very fast for doing chemistry searches but doesn't have the power to handle complex data well.

The approach works but it's clumsy. There ends up being another layer which coordinates searches across the multiple databases and the strong likelihood that the data in one database will be out of synch with another.

I don't know the exact history well so be wary about the exact details of the following. In the last 1980s there were research programs to extend the RDBMS model to include new data types and searches over those types. An example of a search some people wanted was "what are the nearest 3 stores within an hour drive of this address?".

There are many different database models. The relational model is only one possible way to organize data. Another is hierarchial, or as an object database (imagine if your data structures could stay around so when you restart Python it has the data from the previous session). The research in the 1980s lead to the object-relational model, which is a hybrid solution that supports object and new search types within an existing relational database.

There's no way a single company can provide all of the domain specific functionality needed by the world. Instead the database providers made their DBMS extensible, so third-parties could add new functionality to the database server and extend the query language. An extension was called a datablade by Illustra and a data cartridge by Oracle. The commerical database vendors started providing database extension support in the early 1990s. The chemistry database companies started evaluating the move to one of the RDBMSes in the mid-1990s and there are now about 10 Oracle data cartridges for chemistry.

Making the support for database extensions isn't easy, but as it turns out chemisty isn't that difficult. One advantage comes because SMILES are strings, so there's a natural data type already in the database. The complications come in when doing the search. The searches need to index a given speed to make the fingerprint screens and do the other things needed for a fast search. The index has to be updated when the database changes. The database has a query optimizer which tries to figure out the fastest way to actually query the database. For example, in "find all compounds which are 90% similar to Chemical-X and are sold by Utonium Enterprises" should probably search for the "sold by" critrion first because that's a lot faster than doing a similarity search.

One of the consequences of developing a data cartridge for Oracle is that Oracle is used most often by big companies. Big companies often have big money, so the cartridge developers ask for and can get a lot for their work. Reducing the number of in-house databases and the support costs for the federation makes this worthwhile for the larger companies. Few small companies want to spend, say, US$200,000 for a database extension.

On the other hand, small companies rarely have large compound data sets so don't need all the work that's required to implement fast searches in an ORDBMS. TJ O'Donnell is developing a low-end system for that group. Called CHORD it's an extension to PostgreSQL using OpenEye's OEChem for the chemistry. I installed an evaluation copy of CHORD on one of the machines here along with a test data set so you can get a sense of how a cartridge works.

I'll use the PostgreSQL client for this - no Python today, though perhaps in a week or two I'll show how to use one of the Python libraries for talking to a RDBMS. These examples should work on any Linux machine here but I'll probably need to help you install the client.

I'll start the client and connect it to the gnova database on nbn245.

andrew@nbn231:~$ psql -h nbn245 gnova
Welcome to psql 7.4.8, the PostgreSQL interactive terminal.

Type:  \copyright for distribution terms
       \h for help with SQL commands
       \? for help on internal slash commands
       \g or terminate with semicolon to execute query
       \q to quit

SSL connection (cipher: DHE-RSA-AES256-SHA, bits: 256)

gnova=#
As with all chemical informatics programs the first step is to generate a canonical SMILES for a given structure. My test for that is to see that "CCO" and "OCC" are canonicalized to the same structure. Because of the way SQL works I do this through the SELECT statement even though I'm not selecting from any rows of the database.
gnova=# SELECT cansmiles('CCO');
 cansmiles 
-----------
 CCO
(1 row)

gnova=# SELECT cansmiles('OCC');
 cansmiles 
-----------
 CCO
(1 row)

gnova=# 
Make sure that your SQL statement ends with a ';'!

Here are three different functions for computing SMILES strings. The first gives the OpenEye canonical SMILES, the second shows the structure in Kekule form and the last includes all of the explicit hydrogens.

gnova=# Select cansmiles('c1ccccc1C(=O)NC');
    cansmiles    
-----------------
 CNC(=O)c1ccccc1
(1 row)

gnova=# Select keksmiles('c1ccccc1C(=O)NC');
     keksmiles      
--------------------
 C=1C=CC=CC1C(=O)NC
(1 row)

gnova=# Select impsmiles('c1ccccc1C(=O)NC');
               impsmiles               
---------------------------------------
 [cH]1[cH][cH][cH][cH]c1C(=O)[NH][CH3]
(1 row)

gnova=# 
I've loaded the database with the NCI data set, which is well known in this field because for years it was the only large, freely available data set containing screening results. Here I'll count the number of structures in the database.
gnova=# Select count(*) from nci.structure;
 count  
--------
 237597
(1 row)

gnova=# 
I want to double check that all the compounds in the database are in canonical form. Since that's boring I'll also see how many have a canonical form which is not the same as the Kekule form.
gnova=# Select count(smiles) from nci.structure where smiles != cansmiles(smiles);
 count 
-------
     0
(1 row)

gnova=# Select count(*) from nci;
ERROR:  relation "nci" does not exist
gnova=# Select count(smiles) from nci.structure where smiles != keksmiles(smiles);
 count  
--------
 183604
(1 row)

gnova=# 
Suppose I have the input structure "c1ccccc1C(=O)NC" and I want to get its CAS number from the database. I can search for records which match the canonical SMILES form of that structure and from the record get the CAS field.
gnova=# select smiles, cas from nci.structure where smiles = cansmiles('c1ccccc1C(=O)NC');
     smiles      |   cas    
-----------------+----------
 CNC(=O)c1ccccc1 | 613-93-4
(1 row)

gnova=# 
You can search the database for a given substructure, specified by a SMARTS and find compounds that match or get the number of matches.
gnova=# Select count(smiles) from nci.structure where matches(smiles,'c1ccccc1C(=O)NC');
 count 
-------
  5785
(1 row)

gnova=# Select count(smiles) from nci.structure where matches(smiles,'c1ccccc1C(=O)NC') AND matches(smiles, 'S');
 count 
-------
   893
(1 row)

gnova=# Select count(smiles) from nci.structure where count_matches(smiles, '[Cl,Br,F,I]') > 22;
 count 
-------
    35
(1 row)

gnova=# 
There's support for fingerprints. The gnova database includes fingerprints for the substructure search screens (using fp) and MACCS keys for doing structure similarity (using public166keys). In the following I'll fold the normal substructure search fingerprints to 64 bits instead of the default of 1024. That makes the fingerprint shorter so it takes less space and is also a worse screen.
gnova=# Select fp('c1ccccc1O', 64);
                                fp                                
------------------------------------------------------------------
 0010100001101111010111010111111101100101101000111001110100010110
(1 row)

gnova=# Select public166keys('c1ccccc1O');

                                                                             public166keys                                                                              
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100000000000000000000000001000100000000100001000011110
(1 row)

gnova=# 
The CHORD extension also includes the Tversky, Tanimoto, Euclid and Hamming functions.
gnova=# Select tanimoto(public166keys('c1ccccc1O'),
gnova(#                 public166keys('c1ccccc1C(=O)NC') );
 tanimoto 
----------
 0.210526
(1 row)

gnova=# Select euclid(public166keys('c1ccccc1O'),
gnova(#               public166keys('c1ccccc1C(=O)NC') );
 euclid  
---------
 0.95375
(1 row)

gnova=# 
Even though it looks like the field is a long string the actual data is stored in binary form so a 166 bit fingerprint actually takes only 20 bytes. What you see is called the external representation of the fingerprint, just like the SMILES string may be the external representation of a molecule.

Fingerprints take a bit of time to build so I don't want the database to make a fingerprint for each compound every time it does a search. It's a database so I'll compute the fingerprint once and store it with the compound record. I did that after I loaded the database by using the SQL command.

Update nci.structure set fkey=gnova.public166keys(smiles);
How did I know that? I got it from the installation program. :)

There's no need for you to run that command because the information is already in the database. I can search on that field like this:

gnova=# Select count(*) from nci.structure
gnova-#     where tanimoto(public166keys('c1ccccc1(=O)NC'), fkey) > 0.7;
 count 
-------
    47
(1 row)

gnova=# 
Or get the top 10 SMILES strings which have more than 70% similarity.
gnova=# Select smiles, tanimoto(public166keys('c1ccccc1(=O)NC'), fkey) from nci.structure
gnova-#      where tanimoto(public166keys('c1ccccc1(=O)NC'), fkey) > 0.7
gnova-#      order by tanimoto(public166keys('c1ccccc1(=O)NC'), fkey)
gnova-#      limit 10;
             smiles             | tanimoto 
--------------------------------+----------
 COc1ccccc1NC(=O)c2ccccc2       | 0.703704
 CC(=O)Nc1ccccc1OC              | 0.703704
 Cc1cc(ccc1NC(=O)C)OC           | 0.703704
 CC(=O)Nc1ccc(cc1)OC(C)(C)C     | 0.703704
 Cc1cc(ccc1NC(=O)c2ccccc2)OC    | 0.703704
 Cc1ccccc1NC(=O)c2ccc(cc2)OC    | 0.703704
 COc1ccc(cc1)C(=O)Nc2ccc(cc2)OC | 0.703704
 COc1ccc(cc1)NC(=O)C#N          | 0.703704
 CNC(=O)c1ccc(cc1)OC            | 0.708333
 CNC(=O)c1cccc(c1)O             | 0.708333
(10 rows)
gnova-# 
One thing you might wonder about is the duplicate calls to tanimoto(public166keys('c1ccccc1(=O)NC'), fkey). It looks like it computes the tanimoto three times. That may be the case but most databases have optimizations which identify repeated code so it's only computed once.

There are a few other things you can do in the database. For those, take a look at the documentation.



Copyright © 2001-2013 Andrew Dalke Scientific AB