Dalke Scientific Software: More science. Less time. Products

In my plotting lecture I showed how to plot the hydrophobicity profile for the bacteriorhodopsin record gi:66360541. I used hard-coded information; the actual sequence data must be in bacteriorhodopsin.fasta and I manually copied over the secondary structure elements.

I want my program to handle any GenBank record with seconary structure information. That is, I want it to read a GenPept file to get the sequence and SecStr fields then plot the hydrophobicity profile and highlight the helix and sheet regions.

I need a data file so I went to that GenBank like and changed the "Send to" to "File". Here's the record, which I'll save as "bR.gp".

LOCUS       1XJI_A                   247 aa            linear   BCT 23-SEP-2004
DEFINITION  Chain A, Bacteriorhodopsin Crystallized In Bicelles At Room
VERSION     1XJI_A  GI:66360541
DBSOURCE    pdb: molecule 1XJI, chain 65, release Sep 23, 2004;
            deposition: Sep 23, 2004;
            class: Membrane Protein;
            source: Mol_id: 1; Organism_scientific: Halobacterium Salinarium;
            Organism_common: Bacteria; Strain: L33;
            Exp. method: X-Ray Diffraction.
SOURCE      Halobacterium salinarum
  ORGANISM  Halobacterium salinarum
            Archaea; Euryarchaeota; Halobacteria; Halobacteriales;
            Halobacteriaceae; Halobacterium.
REFERENCE   1  (residues 1 to 247)
  AUTHORS   Faham,S., Boulting,G.L., Massey,E.A., Yohannan,S., Yang,D. and
  TITLE     Crystallization of bacteriorhodopsin from bicelle formulations at
            room temperature
  JOURNAL   Protein Sci. 14 (3), 836-840 (2005)
   PUBMED   15689517
REFERENCE   2  (residues 1 to 247)
  AUTHORS   Faham,S., Boulting,G.L., Massey,E.A., Yohannan,S., Yang,D. and
  TITLE     Direct Submission
  JOURNAL   Submitted (23-SEP-2004)
COMMENT     Revision History:
            APR 19 5 Initial Entry.
FEATURES             Location/Qualifiers
     source          1..247
                     /organism="Halobacterium salinarum"
     SecStr          9..32
                     /note="helix 1"
     SecStr          36..61
                     /note="helix 2"
     SecStr          64..71
                     /note="strand 1"
     SecStr          72..79
                     /note="strand 2"
     SecStr          82..100
                     /note="helix 3"
     SecStr          104..126
                     /note="helix 4"
     SecStr          130..159
                     /note="helix 5"
     SecStr          164..190
                     /note="helix 6"
     SecStr          200..224
                     /note="helix 7"
     Het             bond(215)
                     /heterogen="(RET, 301 )"
        1 aqitgrpewi wlalgtalmg lgtlyflvkg mgvsdpdakk fyaittlvpa iaftmylsml
       61 lgygltmvpf ggeqnpiywa ryadwlfttp lllldlallv dadqgtilal vgadgimigt
      121 glvgaltkvy syrfvwwais taamlyilyv lffgftskae smrpevastf kvlrnvtvvl
      181 wsaypvvwli gsegagivpl nietllfmvl dvsakvgfgl illrsraifg eaeapepsag
      241 dgaaats

Biopython include a GenBank parser which supports GenPept. The parser is in Bio.GenBank and uses the same style as the Biopython FASTA parser. You need to create the parser first then use the parser to parse the opened input file.

>>> from Bio import GenBank
>>> parser = GenBank.RecordParser()
>>> record = parser.parse(open("bR.gp"))
>>> record
<Bio.GenBank.Record.Record instance at 0x13332b0>

A GenBank record has many fields and Biopython parsers just about all of them. For interactive use you can get a list of object properties using the dir() function.
>>> dir(record)
'__init__', '__module__', '__str__', '_accession_line', '_base_count_line',
'_comment_line', '_contig_line', '_db_source_line', '_definition_line',
'_features_line', '_keywords_line', '_locus_line', '_nid_line',
'_organism_line', '_origin_line', '_pid_line', '_segment_line',
'_sequence_line', '_source_line', '_version_line', 'accession', 'base_counts',
'comment', 'contig', 'data_file_division', 'date', 'db_source', 'definition',
'features', 'gi', 'keywords', 'locus', 'nid', 'organism', 'origin', 'pid',
'primary', 'references', 'residue_type', 'segment', 'sequence', 'size',
'source', 'taxonomy', 'version']
The inspect module provides other ways to do introspection. (Introspection is the programming term for figuring out what's in a data structure using functions in the programming language instead of by looking at the source code.) Here I'll use it to show all the attributes and methods of a Bio.GenBank.Record.Record.
>>> import inspect
>>> for (name, value) in inspect.getmembers(record):
...   print repr(name), "==", repr(str(value)[:50])
'BASE_FORMAT' == '%-12s'
'GB_BASE_INDENT' == '12'
'GB_LINE_LENGTH' == '79'
'INTERNAL_FORMAT' == '  %-10s'
'__doc__' == 'Hold GenBank information in a format similar to th'
'__init__' == '<bound method Record.__init__ of <Bio.GenBank.Reco'
'__module__' == 'Bio.GenBank.Record'
'__str__' == '<bound method Record.__str__ of <Bio.GenBank.Recor'
'_accession_line' == '<bound method Record._accession_line of <Bio.GenBa'
'_base_count_line' == '<bound method Record._base_count_line of <Bio.GenB'
'_comment_line' == '<bound method Record._comment_line of <Bio.GenBank'
'_contig_line' == '<bound method Record._contig_line of <Bio.GenBank.'
'_db_source_line' == '<bound method Record._db_source_line of <Bio.GenBa'
'_definition_line' == '<bound method Record._definition_line of <Bio.GenB'
'_features_line' == '<bound method Record._features_line of <Bio.GenBan'
'_keywords_line' == '<bound method Record._keywords_line of <Bio.GenBan'
'_locus_line' == '<bound method Record._locus_line of <Bio.GenBank.R'
'_nid_line' == '<bound method Record._nid_line of <Bio.GenBank.Rec'
'_organism_line' == '<bound method Record._organism_line of <Bio.GenBan'
'_origin_line' == '<bound method Record._origin_line of <Bio.GenBank.'
'_pid_line' == '<bound method Record._pid_line of <Bio.GenBank.Rec'
'_segment_line' == '<bound method Record._segment_line of <Bio.GenBank'
'_sequence_line' == '<bound method Record._sequence_line of <Bio.GenBan'
'_source_line' == '<bound method Record._source_line of <Bio.GenBank.'
'_version_line' == '<bound method Record._version_line of <Bio.GenBank'
'accession' == "['1XJI_A']"
'base_counts' == ''
'comment' == 'Revision History:\nAPR 19 5 Initial Entry.'
'contig' == ''
'data_file_division' == 'BCT'
'date' == '23-SEP-2004'
'db_source' == 'pdb: molecule 1XJI, chain 65, release Sep 23, 2004'
'definition' == 'Chain A, Bacteriorhodopsin Crystallized In Bicelle'
'features' == '[<Bio.GenBank.Record.Feature instance at 0x13697d8'
'gi' == '66360541'
'keywords' == "['']"
'locus' == '1XJI_A'
'nid' == ''
'organism' == 'Halobacterium salinarum'
'origin' == ''
'pid' == ''
'primary' == '[]'
'references' == '[<Bio.GenBank.Record.Reference instance at 0x133d3'
'residue_type' == 'linear'
'segment' == ''
'size' == '247'
'source' == 'Halobacterium salinarum'
'taxonomy' == "['Archaea', 'Euryarchaeota', 'Halobacteria', 'Halo"
'version' == '1XJI_A'

Introspection lets you get an idea of the data structure without needing to track down the source code. The interactive shell supports introspection to let you know the list of available properties. If you're using the shell through the terminal and you have readline support (so that the up- and down-arrows let you scroll through the history) then you can enable tab completion using the rlcompleter module. Here's what it looks like for me; the <<tab>> shows where I pressed the tab key.

>>> record.<<tab>>
record.BASE_FEATURE_FORMAT         record._pid_line
record.BASE_FORMAT                 record._segment_line
record.GB_BASE_INDENT              record._sequence_line
record.GB_FEATURE_INDENT           record._source_line
record.GB_FEATURE_INTERNAL_INDENT  record._version_line
record.GB_INTERNAL_INDENT          record.accession
record.GB_LINE_LENGTH              record.base_counts
record.GB_OTHER_INTERNAL_INDENT    record.comment
record.GB_SEQUENCE_INDENT          record.contig
record.INTERNAL_FEATURE_FORMAT     record.data_file_division
record.INTERNAL_FORMAT             record.date
record.OTHER_INTERNAL_FORMAT       record.db_source
record.SEQUENCE_FORMAT             record.definition
record.__class__                   record.features
record.__doc__                     record.gi
record.__init__                    record.keywords
record.__module__                  record.locus
record.__str__                     record.nid
record._accession_line             record.organism
record._base_count_line            record.origin
record._comment_line               record.pid
record._contig_line                record.primary
record._db_source_line             record.references
record._definition_line            record.residue_type
record._features_line              record.segment
record._keywords_line              record.sequence
record._locus_line                 record.size
record._nid_line                   record.source
record._organism_line              record.taxonomy
record._origin_line                record.version
Here are some of the fields in the record
>>> record.sequence
>>> record.features
feature.__class__   feature.__init__    feature.__str__     feature.location
feature.__doc__     feature.__module__  feature.key         feature.qualifiers
>>> feature.key
>>> feature = record.features[1]
>>> feature.key
>>> feature.location 
>>> feature.qualifiers
[<Bio.GenBank.Record.Qualifier instance at 0x13414b8>, <Bio.GenBank.
Record.Qualifier instance at 0x13414e0>]
>>> feature.qualifiers[0]
<Bio.GenBank.Record.Qualifier instance at 0x13414b8>
>>> feature.qualifiers[0].key
>>> feature.qualifiers[0].value

In case you didn't catch it in the above, here's how to get the sequence information and the secondary structure element information.

## print_secstr
# print the secondary structure fields from a GenBank record
from Bio import GenBank

parser = GenBank.RecordParser()
record = parser.parse(open("bR.gp"))

print "sequence", repr(record.sequence)
for feature in record.features:
    if feature.key == "SecStr":
        locations = feature.location.split("..")
        start = int(locations[0])
        end = int(locations[1])
        sec_str_type = "unknown"
        for qualifier in feature.qualifiers:
            if qualifier.key == "/sec_str_type=":
                # get rid of the quotes around the qualifier
                sec_str_type = qualifier.value[1:-1]
        print sec_str_type, "starts at", start, "and ends at", end

(I don't like how the qualifier information contains the quotes. I think the quotes are part of the syntax of the file format and aren't part of the actual data. Oh well.)

Here's the output

helix starts at 9 and ends at 32
helix starts at 36 and ends at 61
sheet starts at 64 and ends at 71
sheet starts at 72 and ends at 79
helix starts at 82 and ends at 100
helix starts at 104 and ends at 126
helix starts at 130 and ends at 159
helix starts at 164 and ends at 190
helix starts at 200 and ends at 224
If you look at the data file (which you really should do) then you'll see that the Biopython reader converted the lower-case sequence letters into upper-case.

For the assignment you'll need to know one more Python feature. So far your programs have used a hard-coded filename for input data. I'll want you to take a name from the command-line. These values are available to Python progams using the sys.argv variable. That's has a list of all the parameter strings passed on the command line. The first argument (sys.argv[0]) is special. It's the actual name of the program being run.

Here are some examples to show you what I mean. I'll create a program called print_args.py which prints the list of command-line arguments.

## show_args.py
# Print the list of command-line arguments

import sys
print sys.argv

I'll run it a few different ways from the command-line. A couple of times I'll use quoted text to show that that's converted into a single string by the shell.
% python print_args.py
% python print_args.py one two three
['print_args.py', 'one', 'two', 'three']
 python print_args.py "Who are you?"
['print_args.py', 'Who are you?']
 python print_args.py say "Polly want a cracker?"
['print_args.py', 'say', 'Polly want a cracker?']

Here's a program which prints the sum of the floating point values passed to it.

# sum the list of values entered on the command-line

import sys

total = 0.0
for arg in sys.argv[1:]:
    value = float(arg)
    total = total + value

print "Sum:", total

along with some examples of it in action.
% python sum.py 
Sum: 0.0
% python sum.py 1
Sum: 1.0
% python sum.py 23 -8
Sum: 15.0
% python sum.py 9.5 1.2 -3.4 2E3
Sum: 2007.3

Finally, here's a program which prints the repr() of the first line of the filename passed to it.

## first_line.py

import sys

# The length of argv is always at least 1 because sys.argv[0] contains
# the name of the program.  If there's a command-line argument then it
# will have more than 1 value.  Use that to check that the filename
# was given.

if len(sys.argv) == 1:
    sys.exit("No filename given")

# The first element is the name of the program being read.
# The second element is the first command-line parameter
filename = sys.argv[1]

line = open(filename).readline()
print repr(line)

Some examples
% python first_line.py
No filename given
% python first_line.py sum.py 
'# sum the list of values entered on the command-line\n'
% python first_line.py br_sequences.fasta 
'>gi|4376110|emb|CAA49762.1| bacterioopsin [synthetic construct]\n'

Copyright © 2001-2013 Andrew Dalke Scientific AB