Dalke Scientific Software: More science. Less time. Products

BLAST is the most widely used bioinformatics tool, meaning that it uses biologically based methods to search large sequence databases. The original BLAST authors expected the output to be read by people and not by programs. Turns out that people use BLAST as one part of a larger algorithm and want to automate the BLAST step. That meant they needed to write parsers for the various BLAST output flavors (BLASTN, BLASTP, TBLASTX, WU-BLAST, and so on). Eventually these people joined together to put the BLAST parsers into a library for use by others and you'll find that Bioperl, Biopython, BioJava, etc. all have BLAST output parsers.

BLAST now has output support for two machine readable formats. The older one is ASN.1. This is an international standard used in many different fields, including telecommunications and cryptography. It didn't become popular. The format standard that ended up being the most popular is XML, and in 2001 NCBI BLAST added a flag for BLAST to generate XML output. [Hint for assignment 1: use "ASN.1" as another search term when looking for web pages about the BLAST output.]

Even with the XML output people are more comfortable reading the normal BLAST. Some will run a BLAST search several times and scan over the result by eye until the parameters are correct, then want to extract data from the file. I'll show how to extract various fields from the file because that knowledge transfers well to parsing other formats.

To start off with, take a look at this BLASTP output [blastp.txt]. It comes from one of the Bioperl test cases and comes from BLAST version 2.0.11 in 2000. Rather old but the format hasn't changed much in the last 8 years and it's something I had handy.

For reference, here is the first few lines of the BLASTP output

BLASTP 2.0.11 [Jan-20-2000]


Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, 
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), 
"Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs",  Nucleic Acids Res. 25:3389-3402.

Query= sp|P09429|HMG1_HUMAN HIGH MOBILITY GROUP PROTEIN HMG1 (HMG-1) -
Homo sapiens (Human).
         (214 letters)

Database: sprot.fas
           72,615 sequences; 26,335,942 total letters

There's a lot of data in that BLASTP file. The first thing I can get is the program version information on the first line. I'll first get the information interactively

>>> blastfile = open("blastp.txt")
>>> blastfile.readline()
'BLASTP 2.0.11 [Jan-20-2000]\n'
>>> 
I mentioned on Monday that it's best to put a sanity check in your code to figure out if you're reading the expected file format. Every BLASTP output starts with the text "BLASTP" on the first line. Modifying the above snippet of code a bit, here's a program which reports the version information of a BLAST file.
## blast1.py
# Print the version information about a BLASTP file

blast_file = open("blastp.txt")
line = blast_file.readline()

# Sanity check that this is a BLASTP output text file
if not line.startswith("BLASTP"):
    raise TypeError("Not a BLASTP file: %r" % version)

# Remove the trailing newline
version = line.rstrip()
print "version", repr(version)

When I run it the output is
% python blast1.py
version 'BLASTP 2.0.11 [Jan-20-2000]'
% 

The next field I want to get is the number of sequences in the database and the total number of letters. That's in the line which says "72,615 sequences; 26,335,942 total letters". To get that data I need to figure out which line it's on. The easiest way for this file is to recognize that it's on the 14th line of the file, but that's not a good solution. The title field from the query input is written after the "Query= ". If it's too long, BLAST will fold the title over multiple lines, so you don't know if the sequence and letter count information is on line 13, 14, 15, or later.

What you do know is that it's on the line after the line that starts with "Database: ". You can use it as a recognition site, as it were. Read lines up to the one which starts with "Database: " then read the next line to get the sequence counts. Here again I'll do some sanity checks to make sure this is the counts line. I'll check for the end of file (that's the check for "") in case the file was truncated and I'll check for the "Searching..." line, as a way to stop quickly if I missed the "Database: " line.

## blast2.py
# Print information about a BLASTP file

blast_file = open("blastp.txt")
line = blast_file.readline()

# Sanity check that this is a BLASTP output text file
if not line.startswith("BLASTP"):
    raise TypeError("Not a BLASTP file: %r" % version)

# Remove the trailing newline
version = line.rstrip()
print "version", repr(version)

# Skip to the "Database: " line
while 1:
    line = blast_file.readline()
    if line.startswith("Database: "):
        break
    # Sanity check that we haven't gone too far
    if not line or line.startswith("Searching..."):
        raise TypeError("Could not find the 'Database: ' line")

# The next line contains the sequence and total letter counts, like this
# "      72,615 sequences; 26,335,942 total letters"
counts_line = blast_file.readline().strip()
print "Counts line", repr(counts_line)
The output is
% python blast2.py
version 'BLASTP 2.0.11 [Jan-20-2000]'
Counts line '72,615 sequences; 26,335,942 total letters'
%

I need to get the "72,615" and "26,335,942" fields. The most common way is to use the string split() method to extract a list of words. The first word (element [0]) is the sequence count and the third word (element [2]) is the letter count.

>>> line = "72,615 sequences; 26,335,942 total letters"
>>> line.split()
['72,615', 'sequences;', '26,335,942', 'total', 'letters']
>>> words = line.split()
>>> words[0]
'72,615'
>>> words[2]
'26,335,942'
>>>
These numbers each contain comma to make the numbers easier to read. Computers don't have a problem reading numbers so I need to get rid of the commas before converting them into integers. I'll use the string replace() method to make new strings but with the "," replaced by the empty string "". I can then use int() to convert the string into an integer.
>>> words[0]
'72,615'
>>> words[0].replace(",", "")
'72615'
>>> int(words[0].replace(",", ""))
72615
>>> words[2]
'26,335,942'
>>> words[2].replace(",", "")
'26335942'
>>> int(words[2].replace(",", ""))
26335942
>>> 
Here's the new version of the code with support for reading the two numbers as Python integers.
## blast3.py
# Print information about a BLASTP file

blast_file = open("blastp.txt")
line = blast_file.readline()

# Sanity check that this is a BLASTP output text file
if not line.startswith("BLASTP"):
    raise TypeError("Not a BLASTP file: %r" % version)

# Remove the trailing newline
version = line.rstrip()
print "version", repr(version)

# Skip to the "Database: " line
while 1:
    line = blast_file.readline()
    if line.startswith("Database: "):
        break
    # Sanity check that we haven't gone too far
    if not line or line.startswith("Searching..."):
        raise TypeError("Could not find the 'Database: ' line")

# The next line contains the sequence and total letter counts, like this
# "      72,615 sequences; 26,335,942 total letters"
counts_line = blast_file.readline().strip()

# Convert the two number fields into Python integers
words = counts_line.split()
num_sequences = int(words[0].replace(",", ""))
num_letters = int(words[2].replace(",", ""))

print num_sequences, "sequences and", num_letters, "letters"
When I run it I get
version 'BLASTP 2.0.11 [Jan-20-2000]'
72615 sequences and 26335942 letters

One warning - this code is not fool proof. Recall how the title line of the query sequence is printed after the "Query= " line in the BLAST output? If the title line is too long BLAST will fold the title over several lines. It's possible (though unlikely) that the query title will have the text "Database: " folded just right so that the above parser code detects it instead of the correct line.

It's usually bad to allow user-defined data be able to break your code. User data must be escaped so it's always possible to distinguish it from control data. Here it's not likely to be a problem but hackers use flaws like this as one way to break into computers and other systems. (30 years ago "Cap'n Crunch" used one of these flaws to make free long-distance phone calls.)

For this lecture it's not something I'm going to handle. Just be aware that problems like this might occur and because programs are both fragile and trusting it's possible to have some nasty side effects.

For the regular expression fans in the class, another way to identify the counts line is with a regular expression.

>>> line = "         72,615 sequences; 26,335,942 total letters\n"
>>> import re
>>> count_pattern = re.compile(r"\s*([0-9,]+) sequences; ([0-9,]+) total letters")
>>> m = count_pattern.search(line)
>>> m.group(1)
'72,615'
>>> m.group(2)
'26,335,942'
>>> m.groups()
('72,615', '26,335,942')
>>> 

Next I want to read the description lines. That's the part which looks like this, with many lines elided

                                                                   Score     E
Sequences producing significant alignments:                        (bits)  Value

P07155;P27109;P27428 HIGH MOBILITY GROUP PROTEIN HMG1 (HMG-1) (...   326  2e-89
P09429 HIGH MOBILITY GROUP PROTEIN HMG1 (HMG-1).                     326  2e-89
P10103 HIGH MOBILITY GROUP PROTEIN HMG1 (HMG-1).                     326  2e-89
P12682 HIGH MOBILITY GROUP PROTEIN HMG1 (HMG-1).                     324  9e-89
P26583 HIGH MOBILITY GROUP PROTEIN HMG2 (HMG-2).                     284  1e-76
P17741 HIGH MOBILITY GROUP PROTEIN HMG2 (HMG-2).                     283  1e-76
P52925 HIGH MOBILITY GROUP PROTEIN HMG2 (HMG-2).                     282  3e-76
P30681 HIGH MOBILITY GROUP PROTEIN HMG2 (HMG-2).                     279  2e-75
P26584 HIGH MOBILITY GROUP PROTEIN HMG2 (HMG-2).                     279  3e-75
P07746 HIGH MOBILITY GROUP-T PROTEIN (HMG-T).                        266  2e-71
  [... skipping many lines ...]
P09208;Q24089;Q24023 INSULIN-LIKE RECEPTOR PRECURSOR (EC 2.7.1....    30  3.2
Q11102 HYPOTHETICAL 131.5 KD PROTEIN C02F12.7 IN CHROMOSOME X.        29  5.5
P19375 HISTONE H1, EARLY EMBRYONIC.                                   29  7.2

>P07155;P27109;P27428 HIGH MOBILITY GROUP PROTEIN HMG1 (HMG-1)
           (AMPHOTERIN) (HEPARIN-BINDING
           Length = 214

Again the technique is to find some special characteristic of the file format that can be used to recognize the start and end of the description. In this case the best is that the description starts two lines after the line starting with "Sequence producing significant alignments:" and ending with a blank line.

Here's a program which print the description lines and only the description lines, to show you how that part will work. As usual in my parsing code I add extra checks to make sure I'm parsing the part of the file I think I'm parsing. There's some balance and skill in figuring out good tests. Formats change. By not adding tests your code might break when a change occurs without you even knowing it. By adding too many tests (or too many wrong tests) your code might break when the format has trivial changes.

## blast4.py
# Print the description lines of a BLASTP file

blast_file = open("blastp.txt")
line = blast_file.readline()

# Sanity check that this is a BLASTP output text file
if not line.startswith("BLASTP"):
    raise TypeError("Not a BLASTP file: %r" % version)

# The start of the description lines looks like
# (though I deleted a few characters per line to make it fit nicely)
#
#                                                            Score     E
# Sequences producing significant alignments:                (bits)  Value
# 
# P07155;P27109;P27428 HIGH MOBILITY GROUP PROTEIN HMG1 (...   326  2e-89
# P09429 HIGH MOBILITY GROUP PROTEIN HMG1 (HMG-1).             326  2e-89
#
# The summaries start two lines after the line that starts with
# "Sequences producing significant alignments:"

# Skip lines to the "Sequences producing significant alignments:" line
while 1:
    line = blast_file.readline()
    if line.startswith("Sequences producing significant alignments:"):
        break
    if not line:
        # End of file - this should not happen
        raise TypeError("Could not find the description section")

# Read the blank line after the header for the description line
line = blast_file.readline()
if line.strip() != "":
    # Double check that it's a blank line
    raise TypeError("Expected a blank line after the description header")

# Start of the description lines
while 1:
    line = blast_file.readline()
    if not line:
        # End of file - this should not happen
        raise TypeError("Found end of file when reading summaries")
    line = line.rstrip()
    if line == "":
        # End of the summaries
        break

    # Print the description line
    print line

To be usable I'll want to extract three fields from each line of the description; the title (as text), the score (the number of bits, as an integer) and the expectation value (as a Python float). By looking at the description line I can tell that the title is in the first 67 columns of the line and the remaining two fields are the score and E-value. I need to extract the three fields. As before I'll use the interactive Python session to make sure my code works.

>>> line = "P09208;Q24089;Q24023 INSULIN-LIKE RECEPTOR PRECURSOR " +
...     "(EC 2.7.1....    30  3.2\n"
>>> title = line[:67]
>>> title
'P09208;Q24089;Q24023 INSULIN-LIKE RECEPTOR PRECURSOR (EC 2.7.1.... '
>>> line[67:]
'    30  3.2\n'
>>> line[67:].split()
['30', '3.2']
>>> fields = line[67:].split()
>>> int(fields[0])
30
>>> float(fields[1])
3.2000000000000002
>>> 
(Why does "3.2" get turned into "3.2000000000000002"? Read section B of the Python tutorial, Floating Point Arithmetic: Issues and Limitations.)

I'll modify blast4.py to extract the three fields and print the values. The output will look like

--> 'P07155;P27109;P27428 HIGH MOBILITY GROUP PROTEIN HMG1 (HMG-1) (...'
       326 bits; expectation value = 2e-89
--> 'P09429 HIGH MOBILITY GROUP PROTEIN HMG1 (HMG-1).'
       326 bits; expectation value = 2e-89
--> 'P10103 HIGH MOBILITY GROUP PROTEIN HMG1 (HMG-1).'
       326 bits; expectation value = 2e-89
with the title on one line and the bit count and expectation value on the next.
## blast5.py
# Print the description information in a BLASTP file

blast_file = open("blastp.txt")
line = blast_file.readline()

# Sanity check that this is a BLASTP output text file
if not line.startswith("BLASTP"):
    raise TypeError("Not a BLASTP file: %r" % version)

# The start of the description lines looks like
# (though I deleted a few characters per line to make it fit nicely)
#
#                                                            Score     E
# Sequences producing significant alignments:                (bits)  Value
# 
# P07155;P27109;P27428 HIGH MOBILITY GROUP PROTEIN HMG1 (...   326  2e-89
# P09429 HIGH MOBILITY GROUP PROTEIN HMG1 (HMG-1).             326  2e-89
#
# The summaries start two lines after the line that starts with
# "Sequences producing significant alignments:"

# Skip lines to the "Sequences producing significant alignments:" line
while 1:
    line = blast_file.readline()
    if line.startswith("Sequences producing significant alignments:"):
        break
    if not line:
        # End of file - this should not happen
        raise TypeError("Could not find the description section")

# Read the blank line after the header for the description line
line = blast_file.readline()
if line.strip() != "":
    # Double check that it's a blank line
    raise TypeError("Expected a blank line after the description header")

# Start of the description lines
while 1:
    line = blast_file.readline()
    if not line:
        # End of file - this should not happen
        raise TypeError("Found end of file when reading summaries")
    line = line.rstrip()
    if line == "":
        # End of the summaries
        break

    # I have a description line.  Break it up into parts
    title = line[:67].strip()
    fields = line[67:].split()
    bits = int(fields[0])
    expect = float(fields[1])
    print "-->", repr(title)
    print "      ", bits, "bits; expectation value =", expect

At this point we have the start of a BLAST parser. The next step would be to handle the alignments but that's the hardest part of the problem. Doing it correctly requires looking ahead in the file to figure out what's going to occur, and I haven't covered how to do that. It also requires handling many fiddly details of the format that no one in their right might should have to worry about.

Instead I'll convert the "blast5.py" script into a library. I want it to be a function which takes the blast input file (as a file object open for reading) and returns the version information, number of sequence in the database, number of letters in the database, and the three pieces of description information for each entry in the summary.

Here's where the object-oriented model is useful. I want the returned object to be an instance of a BlastResult class. The instance will have several properties: "version" contains the BLAST version information, "database_sequences" contains the number of sequence in the database and "database_letters" contains the number of letters in the database.

I also want the BlastResult to store the list of descriptions. Each decription is a BlastDescription instance, which stores the "title", "score" and "e" attributes.

Here are the corresponding Python class definitions. The structure is similar to the other class definitions in that these classes are used to store data fields that are related to each other.

class BlastResult(object):
    def __init__(self,
                 version,  # "BLASTP 2.0.11 [Jan-20-2000]"
                 database_sequences,  # number of sequences in the database
                 database_letters,    # number of letters in the database
                 descriptions):       # list of BlastDescriptions
        self.version = version
        self.database_sequences = database_sequences
        self.database_letters = database_letters
        self.descriptions = descriptions

class BlastDescription(object):
    def __init__(self,
                 title,  # Title of the hit
                 score,  # Number of bits (int)
                 e):     # E-value (float)
        self.title = title
        self.score = score
        self.e = e

I'll start with the blast3.py script and make a few changes. I'll add the two class definitions at the top of the file and make a new function named parse_blastp. This takes the BLAST input file object as its only parameter. I'll remove the print statements because I'm instead going to return the information through the BlastResult. For now I'll skip the description line and return only the version and count information, using None for the list of descriptions.

## blast6.py
# Parse information about a BLASTP file

class BlastResult(object):
    def __init__(self,
                 version,  # "BLASTP 2.0.11 [Jan-20-2000]"
                 database_sequences,  # number of sequences in the database
                 database_letters,    # number of letters in the database
                 descriptions):       # list of BlastDescriptions
        self.version = version
        self.database_sequences = database_sequences
        self.database_letters = database_letters
        self.descriptions = descriptions

class BlastDescription(object):
    def __init__(self,
                 title,  # Title of the hit
                 score,  # Number of bits (int)
                 e):     # E-value (float)
        self.title = title
        self.score = score
        self.e = e


def parse_blastp(blast_file):

    line = blast_file.readline()

    # Sanity check that this is a BLASTP output text file
    if not line.startswith("BLASTP"):
        raise TypeError("Not a BLASTP file: %r" % version)

    # Remove the trailing newline
    version = line.rstrip()

    # Skip to the "Database: " line
    while 1:
        line = blast_file.readline()
        if line.startswith("Database: "):
            break
        # Sanity check that we haven't gone too far
        if not line or line.startswith("Searching..."):
            raise TypeError("Could not find the 'Database: ' line")

    # The next line contains the sequence and total letter counts, like this
    # "      72,615 sequences; 26,335,942 total letters"
    counts_line = blast_file.readline().strip()

    # Convert the two number fields into Python integers
    words = counts_line.split()
    num_sequences = int(words[0].replace(",", ""))
    num_letters = int(words[2].replace(",", ""))

    return BlastResult(version, num_sequences, num_letters, None)
Compare this to the original blast3.py. You'll see it isn't all that different.

I need to test the code to make sure it works correctly. I'll use the same self-test code style I showed on Monday. This goes at the end of 'blast6.py'

# self-test code for blast6.py

def test():
    blast_file = open("blastp.txt")
    result = parse_blastp(blast_file)
    if result.version != "BLASTP 2.0.11 [Jan-20-2000]":
        raise AssertionError(result.version)
    if result.database_sequences != 72615:
        raise AssertionError(result.database_sequences)
    if result.database_letters != 26335942:
        raise AssertionError(result.database_letters)

if __name__ == "__main__":
    test()
    print "All tests passed."
When I run it the result isn't very exciting.
% python blast6.py
All tests passed.
%

Next I need to add in the code to read the description section. I'll use the code from blast5.py to identify the description lines and to extract the three fields from the line. The description section is after the database counts section so I'll put the description parsing code after the database counts parsing section.

I need to make a list of the descriptions so I'll start with an empty descriptions list. When I read a description line I'll extract the three fields and create a new BlastDescription object. I'll append that object to the list of descriptions. After I've read all of the descriptions I'll pass that list as the "descriptions" property of the BlastResult object (instead of passing None for that field).

## blast7.py
# Parse information about a BLASTP file

class BlastResult(object):
    def __init__(self,
                 version,  # "BLASTP 2.0.11 [Jan-20-2000]"
                 database_sequences,  # number of sequences in the database
                 database_letters,    # number of letters in the database
                 descriptions):       # list of BlastDescriptions
        self.version = version
        self.database_sequences = database_sequences
        self.database_letters = database_letters
        self.descriptions = descriptions

class BlastDescription(object):
    def __init__(self,
                 title,  # Title of the hit
                 score,  # Number of bits (int)
                 e):     # E-value (float)
        self.title = title
        self.score = score
        self.e = e


def parse_blastp(blast_file):

    line = blast_file.readline()

    # Sanity check that this is a BLASTP output text file
    if not line.startswith("BLASTP"):
        raise TypeError("Not a BLASTP file: %r" % version)

    # Remove the trailing newline
    version = line.rstrip()

    # Skip to the "Database: " line
    while 1:
        line = blast_file.readline()
        if line.startswith("Database: "):
            break
        # Sanity check that we haven't gone too far
        if not line or line.startswith("Searching..."):
            raise TypeError("Could not find the 'Database: ' line")

    # The next line contains the sequence and total letter counts, like this
    # "      72,615 sequences; 26,335,942 total letters"
    counts_line = blast_file.readline().strip()

    # Convert the two number fields into Python integers
    words = counts_line.split()
    num_sequences = int(words[0].replace(",", ""))
    num_letters = int(words[2].replace(",", ""))

    # The start of the description lines looks like
    # (though I deleted a few characters per line to make it fit nicely)
#
#                                                            Score     E
# Sequences producing significant alignments:                (bits)  Value
# 
# P07155;P27109;P27428 HIGH MOBILITY GROUP PROTEIN HMG1 (...   326  2e-89
# P09429 HIGH MOBILITY GROUP PROTEIN HMG1 (HMG-1).             326  2e-89
    #
    # The summaries start two lines after the line that starts with
    # "Sequences producing significant alignments:"

    # Skip lines to the "Sequences producing significant alignments:" line
    while 1:
        line = blast_file.readline()
        if line.startswith("Sequences producing significant alignments:"):
            break
        if not line:
            # End of file - this should not happen
            raise TypeError("Could not find the description section")

    # Read the blank line after the header for the description line
    line = blast_file.readline()
    if line.strip() != "":
        # Double check that it's a blank line
        raise TypeError("Expected a blank line after the description header")

    # Start of the description lines
    descriptions = []
    while 1:
        line = blast_file.readline()
        if not line:
            # End of file - this should not happen
            raise TypeError("Found end of file when reading summaries")
        line = line.rstrip()
        if line == "":
            # End of the summaries
            break

        # I have a description line.  Break it up into parts
        title = line[:67].strip()
        fields = line[67:].split()
        bits = int(fields[0])
        expect = float(fields[1])

        description = BlastDescription(title, bits, expect)
        descriptions.append(description)

    return BlastResult(version, num_sequences, num_letters, descriptions)

# self-test code

def test():
    blast_file = open("blastp.txt")
    result = parse_blastp(blast_file)
    if result.version != "BLASTP 2.0.11 [Jan-20-2000]":
        raise AssertionError(result.version)
    if result.database_sequences != 72615:
        raise AssertionError(result.database_sequences)
    if result.database_letters != 26335942:
        raise AssertionError(result.database_letters)
    if len(result.descriptions) != 125:
        raise AssertionError(len(result.descriptions))

    expected = (
        BlastDescription(
        "P07155;P27109;P27428 HIGH MOBILITY GROUP PROTEIN HMG1 (HMG-1) (...",
         326, 2e-89),
        BlastDescription(
        "P09429 HIGH MOBILITY GROUP PROTEIN HMG1 (HMG-1).",
        326, 2e-89),
        BlastDescription(
        "P10103 HIGH MOBILITY GROUP PROTEIN HMG1 (HMG-1).",
        326, 2e-89),
        BlastDescription(
        "P12682 HIGH MOBILITY GROUP PROTEIN HMG1 (HMG-1).",
        324, 9e-89),
        BlastDescription(
        "P26583 HIGH MOBILITY GROUP PROTEIN HMG2 (HMG-2).",
        284, 1e-76),
        )

    # Compare the expected value to the parsed value
    for i in range(len(expected)):
        expect = expected[i]
        description = result.descriptions[i]
        if expect.title != description.title:
            raise AssertionError((expect.title, description.title))
        if expect.score != description.score:
            raise AssertionError((expect.score, description.score))
        if expect.e != description.e:
            raise AssertionError((expect.e, description.e))

if __name__ == "__main__":
    test()
    print "All tests passed."
If you look at the self-test code you'll see I added some checks to make sure it parsed the descriptions. When making a test you must take care you write good tests. In this case I made sure that I had enough descriptions so the values in each description field would change because a common bug is to repeat the first record or field from the first record for all the other records.

Out of curiosity I tried this on the "ncbi_blastp_2.2.3" file in the BioJava distribution, again because it was available. I had to make a minor change because that BLAST omits the leading digit for the score when it's smaller than 1e-100 (so 1e-175 gets displayed as "e-175"). The correct solution is to use a special purpose string conversion function instead float() but for now I just edited the file manually. One I made the edits this BLAST parser worked on that BLASTP.

>>> import blast7 as blast_parser
>>> result = blast_parser.parse_blastp(open("ncbi_blastp_2.2.3.txt"))
>>> result.version 
'BLASTP 2.2.3 [Apr-24-2002]'
>>> len(result.descriptions )
392
>>> result.database_sequences 
884285
>>> result.database_letters 
278608352
>>> 

HOWEVER! There's almost no need for you to write your own BLAST parser. The Bio* projects have done that for you. Here's how to read the blastp.txt file using Biopython.

>>> from Bio.Blast import NCBIStandalone
>>> parser = NCBIStandalone.BlastParser()
>>> result = parser.parse(open("blastp.txt"))
>>> result
<Bio.Blast.Record.Blast instance at 0x1022f08>
>>> result.version
'2.0.11'
>>> result.application 
'BLASTP'
>>> len(result.descriptions)
125
>>> result.descriptions[0]
<Bio.Blast.Record.Description instance at 0x432eb8>
>>> result.descriptions[0].title
'P07155;P27109;P27428 HIGH MOBILITY GROUP PROTEIN HMG1 (HMG-1) (...'
>>> result.descriptions[0].score
326
>>> result.descriptions[0].e    
2.0000000000000001e-89
>>> 
When I developed my object model for this lesson I made sure to use names that matched with the Biopython nomenclature so it would be easier for you to understand this example. The only unusual step here for you (I hope) is the creation of the NCBIStandalone.BlastParser() object which does the actual parsing. It's a consequence of how Biopython's parsers are structued internally. I would have preferred that there also be a function interface to make it a bit simpler to understand.

The Biopython parser extracts all the information from a BLAST output file. To give you an idea of what it supports ...

>>> result = parser.parse(open("blastp.txt"))
>>> result.<<tab>>

result.__class__                    result.effective_query_length       result.num_extends
result.__doc__                      result.effective_search_space       result.num_good_extends
result.__init__                     result.effective_search_space_used  result.num_hits
result.__module__                   result.frameshift                   result.num_letters_in_database
result.alignments                   result.gap_penalties                result.num_seqs_better_e
result.application                  result.gap_trigger                  result.num_sequences
result.blast_cutoff                 result.gap_x_dropoff                result.num_sequences_in_database
result.database                     result.gap_x_dropoff_final          result.posted_date
result.database_length              result.gapped                       result.query
result.database_letters             result.hsps_gapped                  result.query_length
result.database_name                result.hsps_no_gap                  result.query_letters
result.database_sequences           result.hsps_prelim_gapped           result.reference
result.date                         result.hsps_prelim_gapped_attemped  result.sc_match
result.descriptions                 result.ka_params                    result.sc_mismatch
result.dropoff_1st_pass             result.ka_params_gap                result.threshold
result.effective_database_length    result.matrix                       result.version
result.effective_hsp_length         result.multiple_alignment           result.window_size

>>> result.matrix
'BLOSUM62'
>>> len(result.alignments)
125
>>> result.alignments[0]
<Bio.Blast.Record.Alignment instance at 0x10cbfa8>
>>> result.alignments[0].hsps
[<Bio.Blast.Record.HSP instance at 0x10c32d8>]
>>> result.alignments[0].hsps[0].query
'GKGDPKKPRGKMSSYAFFVQTCREEHKKKHPDASVNFSEFSKKCSERWKTMSAKEKGKFEDMAKADKARYEREMKTYI
PPKGETKKKFKDPNAPKRPPSAFFLFCSEYRPKIKGEHPGLSIGDVAKKLGEMWNNTAADDKQPXXXXXXXXXXXXXXD
IAAYRAKGKPD'
>>> result.alignments[0].hsps[0].query_start
1
>>> result.alignments[0].hsps[0].sbjct    
'GKGDPKKPRGKMSSYAFFVQTCREEHKKKHPDASVNFSEFSKKCSERWKTMSAKEKGKFEDMAKADKARYEREMKTYI
PPKGETKKKFKDPNAPKRPPSAFFLFCSEYRPKIKGEHPGLSIGDVAKKLGEMWNNTAADDKQPYEKKAAKLKEKYEKD
IAAYRAKGKPD'
>>> result.alignments[0].hsps[0].sbjct_start
1
>>> result.alignments[100].hsps[0].sbjct_start
63
>>> result.alignments[100].hsps[0].sbjct
'KRPMNAFIVWSRDQRRKVALENPQMQNSEISKWLGCKWKMLTEAEKRP'
>>> result.alignments[100].hsps[0].sbjct_start
63
>>> result.alignments[100].hsps[0].query 
'KRPPSAFFLFCSEYRPKIKGEHPGLSIGDVAKKLGEMWNNTAADDKQP'
>>> result.alignments[100].hsps[0].query_start
95
>>> 
Biopython's data model for a Blast.Record is this:
A BlastRecord has a large number of simple top-level attributes. Here is a list of the attribute names and sample values, as extracted from the blastp.txt file.
alignments = [list of Blast.Record.Alignment objects; see below]
application = 'BLASTP'
blast_cutoff = (62, 28.600000000000001)
database = 'sprot.fas'
database_length = 26335942
database_letters = 26335942
database_sequences = 72615
date = 'Jan-20-2000'
descriptions = [list of Blast.Record.Description objects; see below]
dropoff_1st_pass = (16, 7.2000000000000002)
effective_database_length = 22705192
effective_hsp_length = 50
effective_query_length = 164
effective_search_space = 3723651488
effective_search_space_used = 3723651488
frameshift = (None, None)
gap_penalties = [11.0, 1.0]
gap_trigger = (42, 21.899999999999999)
gap_x_dropoff = (38, 14.800000000000001)
gap_x_dropoff_final = (64, 24.899999999999999)
gapped = 1
hsps_gapped = 291
hsps_no_gap = 136
hsps_prelim_gapped = 14
hsps_prelim_gapped_attemped = None
ka_params = [0.313, 0.13100000000000001, 0.39700000000000002]
ka_params_gap = [0.27000000000000002, 0.047, 0.23000000000000001]
matrix = 'BLOSUM62'
multiple_alignment = None
num_extends = 370731
num_good_extends = 1257
num_hits = 9176623
num_letters_in_database = [26335942]
num_seqs_better_e = 150
num_sequences = 72615
posted_date = [('May 18, 1998 10:04 AM',)]
query = 'sp|P09429|HMG1_HUMAN HIGH MOBILITY GROUP PROTEIN HMG1 (HMG-1) -\nHomo
 sapiens (Human).'
query_length = 214
query_letters = 214
reference = 'Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, \n
 Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), \n"Gapped
 BLAST and PSI-BLAST: a new generation of protein database search\nprograms",  
 Nucleic Acids Res. 25:3389-3402.'
sc_match = None
sc_mismatch = None
threshold = 11
version = '2.0.11'
window_size = 40
A Blast.Record has a list of descriptions. Each Blast.Record.Description has the following four properties (with sample data)
description.e = 2.0000000000000001e-89
description.num_alignments = 1
description.score = 326
description.title = 'P07155;P27109;P27428 HIGH MOBILITY GROUP PROTEIN HMG1 (HMG-1) (...'
A Blast.Record has a list of alignments. Each Blast.Record.Alignment has a title, a length and a list of HSPs (high scoring ... somethings; two sequences have have multiple regions with high degress of similarity)
alignment.hsps = [list of Blast.Record.HSP objects; see below]
alignment.length = 144
alignment.title = '>P40619 HMG1/2-LIKE PROTEIN.'
And finally, each HSP has the following attributes
hsp.bits = 77.0
hsp.expect = 0.16
hsp.frame = ()
hsp.gaps = (5, 76)
hsp.identities = (22, 76)
hsp.match = '+G  K+P       AF V +  E HK    + S+  +E SK+   RWK+++  EK  F   A+  K  +  +   Y'
hsp.num_alignments = 1
hsp.positives = (35, 76)
hsp.query = 'KGDPKKPRGKMSSYAFFVQTCREEHKKKHPDASVNFSEFSKKCSERWKTMSAKEKGKFEDMAKADKARYEREMKTY'
hsp.query_start = 2
hsp.sbjct = 'EGHVKRPMN-----AFMVWSRGERHKLAQQNPSMQNTEISKQLGCRWKSLTEAEKRPFFQEAQRLKILHREKYPNY'
hsp.sbjct_start = 2
hsp.score = 34.399999999999999
hsp.strand = (None, None)



Copyright © 2001-2020 Andrew Dalke Scientific AB