Dalke Scientific Software: More science. Less time. Products

Using NCBI's web BLAST

There are many ways to BLAST. The program is available from NCBI and you can download it and run it on your own machine. You can use the web interface which searches data available on NCBI's machines. (If you haven't used the web interface before there's a very detailed tutorial that steps you through the process.) The NCBI data is updated frequently so many people prefer writing programs that use the NCBI server instead of maintaining everything locally. NCBI supports that with the QBlast service.

Biopython includes support for running BLAST locally (all versions of BLAST) and remotely (QBlast only allows BLASTP and BLASTN searches). Besides reading this lecture you should go through the relevant Biopython tutorial for BLAST. I'll present it in a different style and focus.

I'll start with the Biopython web interface to BLAST (through NCBI's QBlast interface). The function is available as Bio.Blast.NCBIWWW.qblast which means it's in the NCBIWWW module, which is a submodule of Blast, which is a submodule of Bio. A common practice in Python is to import the lowest module of a hierarchy into the file, like this

>>> from Bio.Blast import NCBIWWW
>>> NCBIWWW.qblast 
<function qblast at 0x4391f0>
>>> 
Once I imported the Blast submodule I can use it to get to the qblast function, as you see in the above text.

The qblast function takes many parameters. You can get a list of them with Python's built-in help() function:

>>> help(NCBIWWW.qblast)

Help on function qblast in module Bio.Blast.NCBIWWW:

qblast(program, database, sequence, ncbi_gi=None, descriptions=None, alignments=None, exp
ect=None, matrix=None, filter=None, format_type=None, hitlist_size=None, entrez_query='(n
one)')
    Do a BLAST search using the QBLAST server at NCBI.
    program        BLASTP, BLASTN, BLASTX, TBLASTN, or TBLASTX.
    database       Which database to search against.
    sequence       The sequence to search.
    ncbi_gi        TRUE/FALSE whether to give 'gi' identifier.  Def FALSE.
    descriptions   Number of descriptions to show.  Def 500.
    alignments     Number of alignments to show.  Def 500.
    expect         An expect value cutoff.  Def 10.0.
    matrix         Specify an alt. matrix (PAM30, PAM70, BLOSUM80, BLOSUM45).
    filter         "none" turns off filtering.  Default uses 'seg' or 'dust'.
    format_type    "HTML", "Text", "ASN.1", or "XML".  Def. "HTML".
    entrez_query   Entrez query to limit Blast search
    
    This function does no checking of the validity of the parameters
    and passes the values to the server as is.  More help is available at:
    http://www.ncbi.nlm.nih.gov/BLAST/blast_overview.html


This says that three fields are required: the program to use ("blastp" or "blastn"), the database to search against (see NCBI's selection guide for a list of databases available for the two programs), and the query sequence. The other parameters are optional.

By reading the qblast documentation (not for the faint of heart!) I found that the query sequence must be in FASTA format or by a GenBank accession or list of gene identifiers. I'll use a FASTA record, which I'll define interactively. (The triple quoted strings are used for strings that may be several lines long.)

>>> my_sequence = """>My Sequence (in FASTA format)
... MEMETIEEQSKCEMSITTLP
... 
... """
>>> print my_sequence 
>My Sequence (in FASTA format)
MEMETIEEQSKCEMSITTLP


>>> 
I'll run that sequence against NCBI's non-redundent protein database ("nr") at NCBI using blastp.
>>> qblast_output = NCBIWWW.qblast("blastp", "nr", my_sequence)
Bio/Blast/NCBIWWW.py:1070: UserWarning: qblast works only with blastn and blastp for now.
  warnings.warn("qblast works only with blastn and blastp for now.")
>>> qblast_output 
<cStringIO.StringI object at 0x22470>
>>> text = qblast_output.read()  # I'm using these two lines to show you
>>> qblast_output.seek(0)        # what's going on.  You don't need them.
>>> 
(That warning message is annoying. I just finished writing an email to the Biopython list asking why it's there.)

The qblast function returns a cStringIO.StringIO instance. That's a Python class which acts like a file but instead of getting the data from the file system it gets the data from a string. If you want to see the result you can read from this file (or "file-like") instance using it's read() method, as I did there. Here's what the start of the BLAST web output text looks like.

>>> print text[:2000]
HTTP/1.0 200 OK
Date: Thu, 18 Aug 2005 22:01:49 GMT
Content-Type: text/html
Server: Apache/1.3.27 (Unix) mod_fastcgi/2.4.0
Via: 1.1 ctb-cache1 (NetCache NetApp/5.5R6D36), 1.1 ctb-cache2 (NetCache NetApp/5.5R6D27)

<HTML>
<HEAD>
</HEAD>
<BODY BGCOLOR="#FFFFFF" LINK="#0000FF" VLINK="#660099" ALINK="#660099">
<IMG SRC="images/head_results.gif" ALT="Header of the page" BORDER="0"NAME="BlastHeaderGif" WIDTH="600" HEIGHT="45" ALIGN="middle">
<form action="Blast.cgi" enctype="application/x-www-form-urlencoded" method="POST"><input name="RID" type="hidden" value="1124403204-5082-8436507971.BLASTQ2"><input name="SEARCH_DB_STATUS" type="hidden" value="43"><input name="_PGR" type="hidden" value="0"><input name="_PGR" type="hidden" value="0"><input name="CMD" type="hidden" value="Get"></form>
<form action="Blast.cgi" enctype="application/x-www-form-urlencoded" method="POST" TARGET=""><input name="RID" type="hidden" value="1124403204-5082-8436507971.BLASTQ2"><input name="SEARCH_DB_STATUS" type="hidden" value="43"><input name="_PGR" type="hidden" value="0"><input name="CMD" type="hidden" value="Get"><p><!--
QBlastInfoBegin
  Status=READY
QBlastInfoEnd
--><p><TITLE> RID=1124403204-5082-8436507971.BLASTQ2, My Sequence (in FASTA format) </TITLE>
<b>BLASTP 2.2.11 [Jun-05-2005]</b>

<pre>
<b><a href="http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed&cmd;=Retrieve&list;_uids
=9254694&dopt;=Citation">Reference</a>:</b>
Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schäffer, 
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), 
"Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs",  Nucleic Acids Res. 25:3389-3402.

RID: 1124403204-5082-8436507971.BLASTQ2

<b>Query=</b> My Sequence (in FASTA format)
         (20 letters)


<b>Database:</b> All non-redundant GenBank CDS
translations+PDB+SwissProt+PIR+PRF excluding environmental samples 
           2,786,403 sequences; 955,764,212 total letters

<p> <p>If you have any proble
>>> 
As you can see, the result is not the BLAST format you're used to seeing. It contains HTML markup to highlight certain fields, and to display one of the author's names correctly (Schäffer, instead of Schaffer).

After the read() call I used seek(0). The file object has a position in the file. The reads and writes occur at that position and cause the position information to advance (except reads past the end of the file). The seek(0) method tells the file to move to position 0, so the next read will occur at the beginning. I needed it because the read() left the file position at the end of the file.

Because the qblast output is in a different format than the normal Blast output you'll need to use a different parser for it. The parser is also located in the NCBIWWW module and like the NCBIStandalone example from yesterday you'll need to make a parser before you can parse it. Using the parser is identical to what I showed yesterday and the parsed result is in the same data structure.

>>> parser = NCBIWWW.BlastParser()
>>> result = parser.parse(qblast_output)
>>> result
<Bio.Blast.Record.Blast instance at 0x78afd0>
>>> len(result.alignments)
4
>>> result.alignments[0].title
'>gb|AAN12028.1| CG8114-PB, isoform B [Drosophila melanogaster]\nref|NP_523965.2
| CG8114-PB, isoform B [Drosophila melanogaster]'
>>> result.alignments[0].hsps
[<Bio.Blast.Record.HSP instance at 0x784148>]
>>> result.alignments[0].hsps[0].query
'MEMETIEEQSKCEMSITTLP'
>>> result.alignments[0].hsps[0].sbjct
'MEMETIEEQSKCEMSITTLP'
>>> 
The standalone and web interfaces to BLAST return the results with the same data structure. This is useful. It means you don't have to write code once for the each style of interface. Code can expect the common Blast data structure and not be aware of where the data came from.

Using local BLAST

I had hoped to give some example of using the command-line version of BLAST. As it turns out, I don't have BLAST on this laptop nor do I have a reasonable data set for testing it, and I'm not downloading all that data over the DSL to Dan's home. It isn't that hard and you've already learned how to parse the results so you should be able to pick it up from the the section titled 3.1.4 Running BLAST locally in the Biopython cookbook.

We'll figure this out during the lecture.

To install BLAST and set up some databases, it might be helpful to read the BLAST documentation.

Making FASTA output

I've talked a lot about parsing a FASTA record, but what about the other way? What about converting the FastaRecord into a FASTA formatted string? To start off, here's the class definition for the FastaRecord from a few days ago.

class FastaRecord(object):
    def __init__(self, title, sequence):
        self.title = title
        self.sequence = sequence
Making the title line is simple. It's ">" followed by the title followed by a newline ("\n").

More complicated is the sequence line. This can be very long, which means I'll need to break the sequence into parts and write each part on its own line. For now I'll use 10 characters per line, so the first sequence line contains the first 10 characters, the second contains the second 10, and so on. (In real life the number is around 70. I'm using 10 so the examples are easier to write and shorter.)

I can use the substring notation to get the different fields. Here is a test sequence and the four lines I want to get from it.

>>> sequence = "ACGTTTCGATGATCATAGGTGTCACATGATGACATAT"
>>> sequence[0:10]
'ACGTTTCGAT'
>>> sequence[10:20]
'GATCATAGGT'
>>> sequence[20:30]
'GTCACATGAT'
>>> sequence[30:40]
'GACATAT'
>>> 
What I need is some way to start at 0 and count by 10s. There's already a built-in Python function named range() that starts at 0 and count by 1s. As it happens, it supports optional start positions and step size. If the step size is 2 then it counts by 2. If the step size is 10 then it counts by 10. Here are a few examples.
          # Start from 0 (the default) and go up to (but not including) 22.
          # Use the default step size of 1.
>>> range(22)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21]
          # Specify the start position of 0 and go up to (but not including) 22.
          # Use the default step size of 1.
>>> range(0, 22)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21]
          # Specify the start position of 18 and go up to (but not including) 22.
          # Use the default step size of 1.
>>> range(18, 22)
[18, 19, 20, 21]
          # Start position of 0, end position of 22, step size of 10
>>> range(0, 22, 10)
[0, 10, 20]
          # Start position of 1, end position of 22, step size of 10
>>> range(1, 22, 10)
[1, 11, 21]
          # Start position of 2, end position of 22, step size of 10
>>> range(2, 22, 10)
[2, 12]
>>> 
I can use the range's step size to get a list of indicies for the start of each sublist. Because I want 10 characters per line, the end position is +10 from the start position. Here's what it looks like, with some bits at the front to show you what's going on.
>>> sequence
'ACGTTTCGATGATCATAGGTGTCACATGATGACATAT'
>>> len(sequence)
37
>>> range(0, len(sequence), 10)
[0, 10, 20, 30]
>>> for i in range(0, len(sequence), 10):
...   sequence[i:i+10]
... 
'ACGTTTCGAT'
'GATCATAGGT'
'GTCACATGAT'
'GACATAT'
>>> 
See how the last line only has 7 characters and not 10? In a Python substring slice (the code that looks like sequence[start:end]) if the end position is too large it is rounded down to the size of the string. That's why in the following
>>> sequence[30:40]
'GACATAT'
>>> 
the requested subsequence size is 10 characters long but the actual sequence has only 7 characters.

I have enough at this point to make a function that might prove useful for debugging. I'll print a FASTA record to the screen. Those sorts of functions usually have names starting with print_ so I'll name this print_fasta.

>>> class FastaRecord(object):
...     def __init__(self, title, sequence):
...         self.title = title
...         self.sequence = sequence
... 
>>> def print_fasta(record):
...     print ">" + record.title
...     for i in range(0, len(record.sequence), 10):
...         print record.sequence[i:i+10]
... 
>>> rec = FastaRecord("Title", sequence)
>>> print_fasta(rec)
>Title
ACGTTTCGAT
GATCATAGGT
GTCACATGAT
GACATAT
>>> 

Here's an example of how use, based on the fasta_reader.py from a few days ago.

>>> import fasta_reader
>>> rec = fasta_reader.read_fasta_record(open("br_sequences.fasta"))
>>> print_fasta(rec)
>gi|4376110|emb|CAA49762.1| bacterioopsin [synthetic construct]
EFPAPRPPTP
EAQITGRPEW
IWLALGTALM
GLGTLYFLVK
GMGVSDPDAK
KFYAITTLVP
AIAFTMYLSM
LLGYGLTMVP
FGGEQNPIYW
ARYADWLFTT
PLLLLDLALL
VDADQGTILA
LVGADGIMIG
TGLVGALTKV
YSYRFVWWAI
STAAMLYILY
VLFFGFTSKA
ESMRPEVAST
FKVLRNVTVV
LWSAYPVVWL
IGSEGAGIVP
LNIETLLFMV
LDVSAKVGFG
LILLRSRAIF
GEAEAPEPSA
GDGAAATSD
>>> 
It's very strange to see only 10 characters per line of a FASTA file! I'll change it to 70 characters per line.
>>> def print_fasta(record):
...   print ">" + record.title
...   for i in range(0, len(record.sequence), 70):
...     print record.sequence[i:i+70]
... 
>>> print_fasta(rec)
>gi|4376110|emb|CAA49762.1| bacterioopsin [synthetic construct]
EFPAPRPPTPEAQITGRPEWIWLALGTALMGLGTLYFLVKGMGVSDPDAKKFYAITTLVPAIAFTMYLSM
LLGYGLTMVPFGGEQNPIYWARYADWLFTTPLLLLDLALLVDADQGTILALVGADGIMIGTGLVGALTKV
YSYRFVWWAISTAAMLYILYVLFFGFTSKAESMRPEVASTFKVLRNVTVVLWSAYPVVWLIGSEGAGIVP
LNIETLLFMVLDVSAKVGFGLILLRSRAIFGEAEAPEPSAGDGAAATSD
>>> 
More often though I'll want the FASTA formatted output as a string. That's not too hard. Instead of printing the text for a line I'll append each one to a list then use "\n" to join the elements at the end.
>>> def to_fasta(record):
...     lines = []
...     lines.append(">" + record.title)
...     for i in range(0, len(record.sequence), 70):
...         lines.append(record.sequence[i:i+70])
...     return "\n".join(lines)
... 
>>> to_fasta(rec)
'>gi|4376110|emb|CAA49762.1| bacterioopsin [synthetic construct]\nEFPAPRPP
TPEAQITGRPEWIWLALGTALMGLGTLYFLVKGMGVSDPDAKKFYAITTLVPAIAFTMYLSM\nLLGYGLTMVPFGG
EQNPIYWARYADWLFTTPLLLLDLALLVDADQGTILALVGADGIMIGTGLVGALTKV\nYSYRFVWWAISTAAMLYI
LYVLFFGFTSKAESMRPEVASTFKVLRNVTVVLWSAYPVVWLIGSEGAGIVP\nLNIETLLFMVLDVSAKVGFGLIL
LRSRAIFGEAEAPEPSAGDGAAATSD'
>>> print to_fasta(rec)
>gi|4376110|emb|CAA49762.1| bacterioopsin [synthetic construct]
EFPAPRPPTPEAQITGRPEWIWLALGTALMGLGTLYFLVKGMGVSDPDAKKFYAITTLVPAIAFTMYLSM
LLGYGLTMVPFGGEQNPIYWARYADWLFTTPLLLLDLALLVDADQGTILALVGADGIMIGTGLVGALTKV
YSYRFVWWAISTAAMLYILYVLFFGFTSKAESMRPEVASTFKVLRNVTVVLWSAYPVVWLIGSEGAGIVP
LNIETLLFMVLDVSAKVGFGLILLRSRAIFGEAEAPEPSAGDGAAATSD
>>> 
Perhaps I should have added a "\n" to the end of the joined lines, so that the last line has a newline like every other line. That's hard to judge and my experience says it's best that the last character not be a newline.

Some classes have a very natural string forms. Integers are easily represented as a number, for example. To get the string form of an object, you can use the str function. But as you can see, the str() of a FastaRecord isn't that helpful.

>>> str(rec)
'<fasta_reader.FastaRecord object at 0x55a30>'
>>>

It's easy to override the default string representation using a special method named __str__. If the class has a __str__ method then str() will call that method to get the string representation, otherwise it uses the not-very-helpful default.

class FastaRecord(object):
    def __init__(self, title, sequence):
        self.title = title
        self.sequence = sequence
    
    def __str__(self):
        lines = []
        lines.append(">" + self.title)
        for i in range(0, len(self.sequence), 70):
            lines.append(self.sequence[i:i+70]) 
        return "\n".join(lines)

>>> FastaRecord("This is the title", "WERTJISDHGFIDGHEROIJAIRJGOI" +
...         "ASDFAWRSDVSDRTGQAEWCASEFQRERGEOGKLFKBJRJGIXJFXIFVEIRJ")
<__main__.FastaRecord object at 0x66290>
>>> record = _
>>> print record
>This is the title
WERTJISDHGFIDGHEROIJAIRJGOIASDFAWRSDVSDRTGQAEWCASEFQRERGEOGKLFKBJRJGIX
JFXIFVEIRJ
>>> 
(To understand this fully, the "print" command first calls str() on an object to get its string representation.)

This is useful for more than just debugging. The NCBIWWW.qblast function takes the query sequence as a parameter. I said that the sequence must be a sequence formatted as FASTA (or a set of NCBI identifiers). I lied a little bit. It's actually something whose string form is in FASTA format (et cetera). With the special __str__ method of the FastaRecord I can pass the sequence record directly into the qblast function without needing to do an extra conversion from a SeqRecord into a string.

>>> seq = FastaRecord("test", "MEMETIEEQSKCEMSITTLP")
>>> print seq
>test
MEMETIEEQSKCEMSITTLP
>>> NCBIWWW.qblast("blastp", "nr", seq)
<cStringIO.StringI object at 0x22530>
>>> output = _
>>> result = NCBIWWW.BlastParser().parse(output)
>>> len(result.alignments)
4
>>> 



Copyright © 2001-2013 Andrew Dalke Scientific AB