BLAST is the most widely used bioinformatics tool, meaning that it uses biologically based methods to search large sequence databases. The original BLAST authors expected the output to be read by people and not by programs. Turns out that people use BLAST as one part of a larger algorithm and want to automate the BLAST step. That meant they needed to write parsers for the various BLAST output flavors (BLASTN, BLASTP, TBLASTX, WU-BLAST, and so on). Eventually these people joined together to put the BLAST parsers into a library for use by others and you'll find that Bioperl, Biopython, BioJava, etc. all have BLAST output parsers.
BLAST now has output support for two machine readable formats. The older one is ASN.1. This is an international standard used in many different fields, including telecommunications and cryptography. It didn't become popular. The format standard that ended up being the most popular is XML, and in 2001 NCBI BLAST added a flag for BLAST to generate XML output. [Hint for assignment 1: use "ASN.1" as another search term when looking for web pages about the BLAST output.]
Even with the XML output people are more comfortable reading the normal BLAST. Some will run a BLAST search several times and scan over the result by eye until the parameters are correct, then want to extract data from the file. I'll show how to extract various fields from the file because that knowledge transfers well to parsing other formats.
To start off with, take a look at this BLASTP output [blastp.txt]. It comes from one of the Bioperl test cases and comes from BLAST version 2.0.11 in 2000. Rather old but the format hasn't changed much in the last 8 years and it's something I had handy.
For reference, here is the first few lines of the BLASTP output
BLASTP 2.0.11 [Jan-20-2000] Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402. Query= sp|P09429|HMG1_HUMAN HIGH MOBILITY GROUP PROTEIN HMG1 (HMG-1) - Homo sapiens (Human). (214 letters) Database: sprot.fas 72,615 sequences; 26,335,942 total letters
There's a lot of data in that BLASTP file. The first thing I can get is the program version information on the first line. I'll first get the information interactively
>>> blastfile = open("blastp.txt") >>> blastfile.readline() 'BLASTP 2.0.11 [Jan-20-2000]\n' >>>I mentioned on Monday that it's best to put a sanity check in your code to figure out if you're reading the expected file format. Every BLASTP output starts with the text "BLASTP" on the first line. Modifying the above snippet of code a bit, here's a program which reports the version information of a BLAST file.
## blast1.py # Print the version information about a BLASTP file blast_file = open("blastp.txt") line = blast_file.readline() # Sanity check that this is a BLASTP output text file if not line.startswith("BLASTP"): raise TypeError("Not a BLASTP file: %r" % version) # Remove the trailing newline version = line.rstrip() print "version", repr(version)When I run it the output is
% python blast1.py version 'BLASTP 2.0.11 [Jan-20-2000]' %
The next field I want to get is the number of sequences in the database and the total number of letters. That's in the line which says "72,615 sequences; 26,335,942 total letters". To get that data I need to figure out which line it's on. The easiest way for this file is to recognize that it's on the 14th line of the file, but that's not a good solution. The title field from the query input is written after the "Query= ". If it's too long, BLAST will fold the title over multiple lines, so you don't know if the sequence and letter count information is on line 13, 14, 15, or later.
What you do know is that it's on the line after the line that starts with "Database: ". You can use it as a recognition site, as it were. Read lines up to the one which starts with "Database: " then read the next line to get the sequence counts. Here again I'll do some sanity checks to make sure this is the counts line. I'll check for the end of file (that's the check for "") in case the file was truncated and I'll check for the "Searching..." line, as a way to stop quickly if I missed the "Database: " line.
## blast2.py # Print information about a BLASTP file blast_file = open("blastp.txt") line = blast_file.readline() # Sanity check that this is a BLASTP output text file if not line.startswith("BLASTP"): raise TypeError("Not a BLASTP file: %r" % version) # Remove the trailing newline version = line.rstrip() print "version", repr(version) # Skip to the "Database: " line while 1: line = blast_file.readline() if line.startswith("Database: "): break # Sanity check that we haven't gone too far if not line or line.startswith("Searching..."): raise TypeError("Could not find the 'Database: ' line") # The next line contains the sequence and total letter counts, like this # " 72,615 sequences; 26,335,942 total letters" counts_line = blast_file.readline().strip() print "Counts line", repr(counts_line)The output is
% python blast2.py version 'BLASTP 2.0.11 [Jan-20-2000]' Counts line '72,615 sequences; 26,335,942 total letters' %
I need to get the "72,615" and "26,335,942" fields. The most common way is to use the string split() method to extract a list of words. The first word (element [0]) is the sequence count and the third word (element [2]) is the letter count.
>>> line = "72,615 sequences; 26,335,942 total letters" >>> line.split() ['72,615', 'sequences;', '26,335,942', 'total', 'letters'] >>> words = line.split() >>> words[0] '72,615' >>> words[2] '26,335,942' >>>These numbers each contain comma to make the numbers easier to read. Computers don't have a problem reading numbers so I need to get rid of the commas before converting them into integers. I'll use the string replace() method to make new strings but with the "," replaced by the empty string "". I can then use int() to convert the string into an integer.
>>> words[0] '72,615' >>> words[0].replace(",", "") '72615' >>> int(words[0].replace(",", "")) 72615 >>> words[2] '26,335,942' >>> words[2].replace(",", "") '26335942' >>> int(words[2].replace(",", "")) 26335942 >>>Here's the new version of the code with support for reading the two numbers as Python integers.
## blast3.py # Print information about a BLASTP file blast_file = open("blastp.txt") line = blast_file.readline() # Sanity check that this is a BLASTP output text file if not line.startswith("BLASTP"): raise TypeError("Not a BLASTP file: %r" % version) # Remove the trailing newline version = line.rstrip() print "version", repr(version) # Skip to the "Database: " line while 1: line = blast_file.readline() if line.startswith("Database: "): break # Sanity check that we haven't gone too far if not line or line.startswith("Searching..."): raise TypeError("Could not find the 'Database: ' line") # The next line contains the sequence and total letter counts, like this # " 72,615 sequences; 26,335,942 total letters" counts_line = blast_file.readline().strip() # Convert the two number fields into Python integers words = counts_line.split() num_sequences = int(words[0].replace(",", "")) num_letters = int(words[2].replace(",", "")) print num_sequences, "sequences and", num_letters, "letters"When I run it I get
version 'BLASTP 2.0.11 [Jan-20-2000]' 72615 sequences and 26335942 letters
One warning - this code is not fool proof. Recall how the title line of the query sequence is printed after the "Query= " line in the BLAST output? If the title line is too long BLAST will fold the title over several lines. It's possible (though unlikely) that the query title will have the text "Database: " folded just right so that the above parser code detects it instead of the correct line.
It's usually bad to allow user-defined data be able to break your code. User data must be escaped so it's always possible to distinguish it from control data. Here it's not likely to be a problem but hackers use flaws like this as one way to break into computers and other systems. (30 years ago "Cap'n Crunch" used one of these flaws to make free long-distance phone calls.)
For this lecture it's not something I'm going to handle. Just be aware that problems like this might occur and because programs are both fragile and trusting it's possible to have some nasty side effects.
For the regular expression fans in the class, another way to identify the counts line is with a regular expression.
>>> line = " 72,615 sequences; 26,335,942 total letters\n" >>> import re >>> count_pattern = re.compile(r"\s*([0-9,]+) sequences; ([0-9,]+) total letters") >>> m = count_pattern.search(line) >>> m.group(1) '72,615' >>> m.group(2) '26,335,942' >>> m.groups() ('72,615', '26,335,942') >>>
Next I want to read the description lines. That's the part which looks like this, with many lines elided
Score E Sequences producing significant alignments: (bits) Value P07155;P27109;P27428 HIGH MOBILITY GROUP PROTEIN HMG1 (HMG-1) (... 326 2e-89 P09429 HIGH MOBILITY GROUP PROTEIN HMG1 (HMG-1). 326 2e-89 P10103 HIGH MOBILITY GROUP PROTEIN HMG1 (HMG-1). 326 2e-89 P12682 HIGH MOBILITY GROUP PROTEIN HMG1 (HMG-1). 324 9e-89 P26583 HIGH MOBILITY GROUP PROTEIN HMG2 (HMG-2). 284 1e-76 P17741 HIGH MOBILITY GROUP PROTEIN HMG2 (HMG-2). 283 1e-76 P52925 HIGH MOBILITY GROUP PROTEIN HMG2 (HMG-2). 282 3e-76 P30681 HIGH MOBILITY GROUP PROTEIN HMG2 (HMG-2). 279 2e-75 P26584 HIGH MOBILITY GROUP PROTEIN HMG2 (HMG-2). 279 3e-75 P07746 HIGH MOBILITY GROUP-T PROTEIN (HMG-T). 266 2e-71 [... skipping many lines ...] P09208;Q24089;Q24023 INSULIN-LIKE RECEPTOR PRECURSOR (EC 2.7.1.... 30 3.2 Q11102 HYPOTHETICAL 131.5 KD PROTEIN C02F12.7 IN CHROMOSOME X. 29 5.5 P19375 HISTONE H1, EARLY EMBRYONIC. 29 7.2 >P07155;P27109;P27428 HIGH MOBILITY GROUP PROTEIN HMG1 (HMG-1) (AMPHOTERIN) (HEPARIN-BINDING Length = 214Again the technique is to find some special characteristic of the file format that can be used to recognize the start and end of the description. In this case the best is that the description starts two lines after the line starting with "Sequence producing significant alignments:" and ending with a blank line.
Here's a program which print the description lines and only the
description lines, to show you how that part will work. As usual in
my parsing code I add extra checks to make sure I'm parsing the part
of the file I think I'm parsing. There's some balance and skill in
figuring out good tests. Formats change. By not adding tests your
code might break when a change occurs without you even knowing it. By
adding too many tests (or too many wrong tests) your code might break
when the format has trivial changes.
## blast4.py
# Print the description lines of a BLASTP file
blast_file = open("blastp.txt")
line = blast_file.readline()
# Sanity check that this is a BLASTP output text file
if not line.startswith("BLASTP"):
raise TypeError("Not a BLASTP file: %r" % version)
# The start of the description lines looks like
# (though I deleted a few characters per line to make it fit nicely)
#
# Score E
# Sequences producing significant alignments: (bits) Value
#
# P07155;P27109;P27428 HIGH MOBILITY GROUP PROTEIN HMG1 (... 326 2e-89
# P09429 HIGH MOBILITY GROUP PROTEIN HMG1 (HMG-1). 326 2e-89
#
# The summaries start two lines after the line that starts with
# "Sequences producing significant alignments:"
# Skip lines to the "Sequences producing significant alignments:" line
while 1:
line = blast_file.readline()
if line.startswith("Sequences producing significant alignments:"):
break
if not line:
# End of file - this should not happen
raise TypeError("Could not find the description section")
# Read the blank line after the header for the description line
line = blast_file.readline()
if line.strip() != "":
# Double check that it's a blank line
raise TypeError("Expected a blank line after the description header")
# Start of the description lines
while 1:
line = blast_file.readline()
if not line:
# End of file - this should not happen
raise TypeError("Found end of file when reading summaries")
line = line.rstrip()
if line == "":
# End of the summaries
break
# Print the description line
print line
To be usable I'll want to extract three fields from each line of the description; the title (as text), the score (the number of bits, as an integer) and the expectation value (as a Python float). By looking at the description line I can tell that the title is in the first 67 columns of the line and the remaining two fields are the score and E-value. I need to extract the three fields. As before I'll use the interactive Python session to make sure my code works.
>>> line = "P09208;Q24089;Q24023 INSULIN-LIKE RECEPTOR PRECURSOR " + ... "(EC 2.7.1.... 30 3.2\n" >>> title = line[:67] >>> title 'P09208;Q24089;Q24023 INSULIN-LIKE RECEPTOR PRECURSOR (EC 2.7.1.... ' >>> line[67:] ' 30 3.2\n' >>> line[67:].split() ['30', '3.2'] >>> fields = line[67:].split() >>> int(fields[0]) 30 >>> float(fields[1]) 3.2000000000000002 >>>(Why does "3.2" get turned into "3.2000000000000002"? Read section B of the Python tutorial, Floating Point Arithmetic: Issues and Limitations.)
I'll modify blast4.py to extract the three fields and print the values. The output will look like
--> 'P07155;P27109;P27428 HIGH MOBILITY GROUP PROTEIN HMG1 (HMG-1) (...' 326 bits; expectation value = 2e-89 --> 'P09429 HIGH MOBILITY GROUP PROTEIN HMG1 (HMG-1).' 326 bits; expectation value = 2e-89 --> 'P10103 HIGH MOBILITY GROUP PROTEIN HMG1 (HMG-1).' 326 bits; expectation value = 2e-89with the title on one line and the bit count and expectation value on the next.
## blast5.py # Print the description information in a BLASTP file blast_file = open("blastp.txt") line = blast_file.readline() # Sanity check that this is a BLASTP output text file if not line.startswith("BLASTP"): raise TypeError("Not a BLASTP file: %r" % version) # The start of the description lines looks like # (though I deleted a few characters per line to make it fit nicely) # # Score E # Sequences producing significant alignments: (bits) Value # # P07155;P27109;P27428 HIGH MOBILITY GROUP PROTEIN HMG1 (... 326 2e-89 # P09429 HIGH MOBILITY GROUP PROTEIN HMG1 (HMG-1). 326 2e-89 # # The summaries start two lines after the line that starts with # "Sequences producing significant alignments:" # Skip lines to the "Sequences producing significant alignments:" line while 1: line = blast_file.readline() if line.startswith("Sequences producing significant alignments:"): break if not line: # End of file - this should not happen raise TypeError("Could not find the description section") # Read the blank line after the header for the description line line = blast_file.readline() if line.strip() != "": # Double check that it's a blank line raise TypeError("Expected a blank line after the description header") # Start of the description lines while 1: line = blast_file.readline() if not line: # End of file - this should not happen raise TypeError("Found end of file when reading summaries") line = line.rstrip() if line == "": # End of the summaries break # I have a description line. Break it up into parts title = line[:67].strip() fields = line[67:].split() bits = int(fields[0]) expect = float(fields[1]) print "-->", repr(title) print " ", bits, "bits; expectation value =", expect
At this point we have the start of a BLAST parser. The next step would be to handle the alignments but that's the hardest part of the problem. Doing it correctly requires looking ahead in the file to figure out what's going to occur, and I haven't covered how to do that. It also requires handling many fiddly details of the format that no one in their right might should have to worry about.
Instead I'll convert the "blast5.py" script into a library. I want it to be a function which takes the blast input file (as a file object open for reading) and returns the version information, number of sequence in the database, number of letters in the database, and the three pieces of description information for each entry in the summary.
Here's where the object-oriented model is useful. I want the returned object to be an instance of a BlastResult class. The instance will have several properties: "version" contains the BLAST version information, "database_sequences" contains the number of sequence in the database and "database_letters" contains the number of letters in the database.
I also want the BlastResult to store the list of descriptions. Each decription is a BlastDescription instance, which stores the "title", "score" and "e" attributes.
Here are the corresponding Python class definitions. The structure is similar to the other class definitions in that these classes are used to store data fields that are related to each other.
class BlastResult(object): def __init__(self, version, # "BLASTP 2.0.11 [Jan-20-2000]" database_sequences, # number of sequences in the database database_letters, # number of letters in the database descriptions): # list of BlastDescriptions self.version = version self.database_sequences = database_sequences self.database_letters = database_letters self.descriptions = descriptions class BlastDescription(object): def __init__(self, title, # Title of the hit score, # Number of bits (int) e): # E-value (float) self.title = title self.score = score self.e = e
I'll start with the blast3.py script and make a few changes. I'll add the two class definitions at the top of the file and make a new function named parse_blastp. This takes the BLAST input file object as its only parameter. I'll remove the print statements because I'm instead going to return the information through the BlastResult. For now I'll skip the description line and return only the version and count information, using None for the list of descriptions.
## blast6.py # Parse information about a BLASTP file class BlastResult(object): def __init__(self, version, # "BLASTP 2.0.11 [Jan-20-2000]" database_sequences, # number of sequences in the database database_letters, # number of letters in the database descriptions): # list of BlastDescriptions self.version = version self.database_sequences = database_sequences self.database_letters = database_letters self.descriptions = descriptions class BlastDescription(object): def __init__(self, title, # Title of the hit score, # Number of bits (int) e): # E-value (float) self.title = title self.score = score self.e = e def parse_blastp(blast_file): line = blast_file.readline() # Sanity check that this is a BLASTP output text file if not line.startswith("BLASTP"): raise TypeError("Not a BLASTP file: %r" % version) # Remove the trailing newline version = line.rstrip() # Skip to the "Database: " line while 1: line = blast_file.readline() if line.startswith("Database: "): break # Sanity check that we haven't gone too far if not line or line.startswith("Searching..."): raise TypeError("Could not find the 'Database: ' line") # The next line contains the sequence and total letter counts, like this # " 72,615 sequences; 26,335,942 total letters" counts_line = blast_file.readline().strip() # Convert the two number fields into Python integers words = counts_line.split() num_sequences = int(words[0].replace(",", "")) num_letters = int(words[2].replace(",", "")) return BlastResult(version, num_sequences, num_letters, None)Compare this to the original blast3.py. You'll see it isn't all that different.
I need to test the code to make sure it works correctly. I'll use the same self-test code style I showed on Monday. This goes at the end of 'blast6.py'
# self-test code for blast6.py def test(): blast_file = open("blastp.txt") result = parse_blastp(blast_file) if result.version != "BLASTP 2.0.11 [Jan-20-2000]": raise AssertionError(result.version) if result.database_sequences != 72615: raise AssertionError(result.database_sequences) if result.database_letters != 26335942: raise AssertionError(result.database_letters) if __name__ == "__main__": test() print "All tests passed."When I run it the result isn't very exciting.
% python blast6.py All tests passed. %
Next I need to add in the code to read the description section. I'll use the code from blast5.py to identify the description lines and to extract the three fields from the line. The description section is after the database counts section so I'll put the description parsing code after the database counts parsing section.
I need to make a list of the descriptions so I'll start with an empty
descriptions list. When I read a description line I'll
extract the three fields and create a new BlastDescription object.
I'll append that object to the list of descriptions. After I've read
all of the descriptions I'll pass that list as the "descriptions"
property of the BlastResult object (instead of passing None for that
field).
## blast7.py
# Parse information about a BLASTP file
class BlastResult(object):
def __init__(self,
version, # "BLASTP 2.0.11 [Jan-20-2000]"
database_sequences, # number of sequences in the database
database_letters, # number of letters in the database
descriptions): # list of BlastDescriptions
self.version = version
self.database_sequences = database_sequences
self.database_letters = database_letters
self.descriptions = descriptions
class BlastDescription(object):
def __init__(self,
title, # Title of the hit
score, # Number of bits (int)
e): # E-value (float)
self.title = title
self.score = score
self.e = e
def parse_blastp(blast_file):
line = blast_file.readline()
# Sanity check that this is a BLASTP output text file
if not line.startswith("BLASTP"):
raise TypeError("Not a BLASTP file: %r" % version)
# Remove the trailing newline
version = line.rstrip()
# Skip to the "Database: " line
while 1:
line = blast_file.readline()
if line.startswith("Database: "):
break
# Sanity check that we haven't gone too far
if not line or line.startswith("Searching..."):
raise TypeError("Could not find the 'Database: ' line")
# The next line contains the sequence and total letter counts, like this
# " 72,615 sequences; 26,335,942 total letters"
counts_line = blast_file.readline().strip()
# Convert the two number fields into Python integers
words = counts_line.split()
num_sequences = int(words[0].replace(",", ""))
num_letters = int(words[2].replace(",", ""))
# The start of the description lines looks like
# (though I deleted a few characters per line to make it fit nicely)
#
# Score E
# Sequences producing significant alignments: (bits) Value
#
# P07155;P27109;P27428 HIGH MOBILITY GROUP PROTEIN HMG1 (... 326 2e-89
# P09429 HIGH MOBILITY GROUP PROTEIN HMG1 (HMG-1). 326 2e-89
#
# The summaries start two lines after the line that starts with
# "Sequences producing significant alignments:"
# Skip lines to the "Sequences producing significant alignments:" line
while 1:
line = blast_file.readline()
if line.startswith("Sequences producing significant alignments:"):
break
if not line:
# End of file - this should not happen
raise TypeError("Could not find the description section")
# Read the blank line after the header for the description line
line = blast_file.readline()
if line.strip() != "":
# Double check that it's a blank line
raise TypeError("Expected a blank line after the description header")
# Start of the description lines
descriptions = []
while 1:
line = blast_file.readline()
if not line:
# End of file - this should not happen
raise TypeError("Found end of file when reading summaries")
line = line.rstrip()
if line == "":
# End of the summaries
break
# I have a description line. Break it up into parts
title = line[:67].strip()
fields = line[67:].split()
bits = int(fields[0])
expect = float(fields[1])
description = BlastDescription(title, bits, expect)
descriptions.append(description)
return BlastResult(version, num_sequences, num_letters, descriptions)
# self-test code
def test():
blast_file = open("blastp.txt")
result = parse_blastp(blast_file)
if result.version != "BLASTP 2.0.11 [Jan-20-2000]":
raise AssertionError(result.version)
if result.database_sequences != 72615:
raise AssertionError(result.database_sequences)
if result.database_letters != 26335942:
raise AssertionError(result.database_letters)
if len(result.descriptions) != 125:
raise AssertionError(len(result.descriptions))
expected = (
BlastDescription(
"P07155;P27109;P27428 HIGH MOBILITY GROUP PROTEIN HMG1 (HMG-1) (...",
326, 2e-89),
BlastDescription(
"P09429 HIGH MOBILITY GROUP PROTEIN HMG1 (HMG-1).",
326, 2e-89),
BlastDescription(
"P10103 HIGH MOBILITY GROUP PROTEIN HMG1 (HMG-1).",
326, 2e-89),
BlastDescription(
"P12682 HIGH MOBILITY GROUP PROTEIN HMG1 (HMG-1).",
324, 9e-89),
BlastDescription(
"P26583 HIGH MOBILITY GROUP PROTEIN HMG2 (HMG-2).",
284, 1e-76),
)
# Compare the expected value to the parsed value
for i in range(len(expected)):
expect = expected[i]
description = result.descriptions[i]
if expect.title != description.title:
raise AssertionError((expect.title, description.title))
if expect.score != description.score:
raise AssertionError((expect.score, description.score))
if expect.e != description.e:
raise AssertionError((expect.e, description.e))
if __name__ == "__main__":
test()
print "All tests passed."
If you look at the self-test code you'll see I added some checks to
make sure it parsed the descriptions. When making a test you must
take care you write good tests. In this case I made sure that I had
enough descriptions so the values in each description field would
change because a common bug is to repeat the first record or field
from the first record for all the other records.
Out of curiosity I tried this on the "ncbi_blastp_2.2.3" file in the BioJava distribution, again because it was available. I had to make a minor change because that BLAST omits the leading digit for the score when it's smaller than 1e-100 (so 1e-175 gets displayed as "e-175"). The correct solution is to use a special purpose string conversion function instead float() but for now I just edited the file manually. One I made the edits this BLAST parser worked on that BLASTP.
>>> import blast7 as blast_parser >>> result = blast_parser.parse_blastp(open("ncbi_blastp_2.2.3.txt")) >>> result.version 'BLASTP 2.2.3 [Apr-24-2002]' >>> len(result.descriptions ) 392 >>> result.database_sequences 884285 >>> result.database_letters 278608352 >>>
HOWEVER! There's almost no need for you to write your own BLAST parser. The Bio* projects have done that for you. Here's how to read the blastp.txt file using Biopython.
>>> from Bio.Blast import NCBIStandalone >>> parser = NCBIStandalone.BlastParser() >>> result = parser.parse(open("blastp.txt")) >>> result <Bio.Blast.Record.Blast instance at 0x1022f08> >>> result.version '2.0.11' >>> result.application 'BLASTP' >>> len(result.descriptions) 125 >>> result.descriptions[0] <Bio.Blast.Record.Description instance at 0x432eb8> >>> result.descriptions[0].title 'P07155;P27109;P27428 HIGH MOBILITY GROUP PROTEIN HMG1 (HMG-1) (...' >>> result.descriptions[0].score 326 >>> result.descriptions[0].e 2.0000000000000001e-89 >>>When I developed my object model for this lesson I made sure to use names that matched with the Biopython nomenclature so it would be easier for you to understand this example. The only unusual step here for you (I hope) is the creation of the NCBIStandalone.BlastParser() object which does the actual parsing. It's a consequence of how Biopython's parsers are structued internally. I would have preferred that there also be a function interface to make it a bit simpler to understand.
The Biopython parser extracts all the information from a BLAST output file. To give you an idea of what it supports ...
>>> result = parser.parse(open("blastp.txt")) >>> result.<<tab>> result.__class__ result.effective_query_length result.num_extends result.__doc__ result.effective_search_space result.num_good_extends result.__init__ result.effective_search_space_used result.num_hits result.__module__ result.frameshift result.num_letters_in_database result.alignments result.gap_penalties result.num_seqs_better_e result.application result.gap_trigger result.num_sequences result.blast_cutoff result.gap_x_dropoff result.num_sequences_in_database result.database result.gap_x_dropoff_final result.posted_date result.database_length result.gapped result.query result.database_letters result.hsps_gapped result.query_length result.database_name result.hsps_no_gap result.query_letters result.database_sequences result.hsps_prelim_gapped result.reference result.date result.hsps_prelim_gapped_attemped result.sc_match result.descriptions result.ka_params result.sc_mismatch result.dropoff_1st_pass result.ka_params_gap result.threshold result.effective_database_length result.matrix result.version result.effective_hsp_length result.multiple_alignment result.window_size >>> result.matrix 'BLOSUM62' >>> len(result.alignments) 125 >>> result.alignments[0] <Bio.Blast.Record.Alignment instance at 0x10cbfa8> >>> result.alignments[0].hsps [<Bio.Blast.Record.HSP instance at 0x10c32d8>] >>> result.alignments[0].hsps[0].query 'GKGDPKKPRGKMSSYAFFVQTCREEHKKKHPDASVNFSEFSKKCSERWKTMSAKEKGKFEDMAKADKARYEREMKTYI PPKGETKKKFKDPNAPKRPPSAFFLFCSEYRPKIKGEHPGLSIGDVAKKLGEMWNNTAADDKQPXXXXXXXXXXXXXXD IAAYRAKGKPD' >>> result.alignments[0].hsps[0].query_start 1 >>> result.alignments[0].hsps[0].sbjct 'GKGDPKKPRGKMSSYAFFVQTCREEHKKKHPDASVNFSEFSKKCSERWKTMSAKEKGKFEDMAKADKARYEREMKTYI PPKGETKKKFKDPNAPKRPPSAFFLFCSEYRPKIKGEHPGLSIGDVAKKLGEMWNNTAADDKQPYEKKAAKLKEKYEKD IAAYRAKGKPD' >>> result.alignments[0].hsps[0].sbjct_start 1 >>> result.alignments[100].hsps[0].sbjct_start 63 >>> result.alignments[100].hsps[0].sbjct 'KRPMNAFIVWSRDQRRKVALENPQMQNSEISKWLGCKWKMLTEAEKRP' >>> result.alignments[100].hsps[0].sbjct_start 63 >>> result.alignments[100].hsps[0].query 'KRPPSAFFLFCSEYRPKIKGEHPGLSIGDVAKKLGEMWNNTAADDKQP' >>> result.alignments[100].hsps[0].query_start 95 >>>Biopython's data model for a Blast.Record is this:
A BlastRecord has a large number of simple top-level attributes. Here is a list of the attribute names and sample values, as extracted from the blastp.txt file.
alignments = [list of Blast.Record.Alignment objects; see below] application = 'BLASTP' blast_cutoff = (62, 28.600000000000001) database = 'sprot.fas' database_length = 26335942 database_letters = 26335942 database_sequences = 72615 date = 'Jan-20-2000' descriptions = [list of Blast.Record.Description objects; see below] dropoff_1st_pass = (16, 7.2000000000000002) effective_database_length = 22705192 effective_hsp_length = 50 effective_query_length = 164 effective_search_space = 3723651488 effective_search_space_used = 3723651488 frameshift = (None, None) gap_penalties = [11.0, 1.0] gap_trigger = (42, 21.899999999999999) gap_x_dropoff = (38, 14.800000000000001) gap_x_dropoff_final = (64, 24.899999999999999) gapped = 1 hsps_gapped = 291 hsps_no_gap = 136 hsps_prelim_gapped = 14 hsps_prelim_gapped_attemped = None ka_params = [0.313, 0.13100000000000001, 0.39700000000000002] ka_params_gap = [0.27000000000000002, 0.047, 0.23000000000000001] matrix = 'BLOSUM62' multiple_alignment = None num_extends = 370731 num_good_extends = 1257 num_hits = 9176623 num_letters_in_database = [26335942] num_seqs_better_e = 150 num_sequences = 72615 posted_date = [('May 18, 1998 10:04 AM',)] query = 'sp|P09429|HMG1_HUMAN HIGH MOBILITY GROUP PROTEIN HMG1 (HMG-1) -\nHomo sapiens (Human).' query_length = 214 query_letters = 214 reference = 'Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, \n Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), \n"Gapped BLAST and PSI-BLAST: a new generation of protein database search\nprograms", Nucleic Acids Res. 25:3389-3402.' sc_match = None sc_mismatch = None threshold = 11 version = '2.0.11' window_size = 40A Blast.Record has a list of descriptions. Each Blast.Record.Description has the following four properties (with sample data)
description.e = 2.0000000000000001e-89 description.num_alignments = 1 description.score = 326 description.title = 'P07155;P27109;P27428 HIGH MOBILITY GROUP PROTEIN HMG1 (HMG-1) (...'A Blast.Record has a list of alignments. Each Blast.Record.Alignment has a title, a length and a list of HSPs (high scoring ... somethings; two sequences have have multiple regions with high degress of similarity)
alignment.hsps = [list of Blast.Record.HSP objects; see below] alignment.length = 144 alignment.title = '>P40619 HMG1/2-LIKE PROTEIN.'And finally, each HSP has the following attributes
hsp.bits = 77.0 hsp.expect = 0.16 hsp.frame = () hsp.gaps = (5, 76) hsp.identities = (22, 76) hsp.match = '+G K+P AF V + E HK + S+ +E SK+ RWK+++ EK F A+ K + + Y' hsp.num_alignments = 1 hsp.positives = (35, 76) hsp.query = 'KGDPKKPRGKMSSYAFFVQTCREEHKKKHPDASVNFSEFSKKCSERWKTMSAKEKGKFEDMAKADKARYEREMKTY' hsp.query_start = 2 hsp.sbjct = 'EGHVKRPMN-----AFMVWSRGERHKLAQQNPSMQNTEISKQLGCRWKSLTEAEKRPFFQEAQRLKILHREKYPNY' hsp.sbjct_start = 2 hsp.score = 34.399999999999999 hsp.strand = (None, None)
Copyright © 2001-2020 Andrew Dalke Scientific AB