hydrophobicity plots with matplotlib

Hydrophobicity plots with matplotlib

Making plots (such as graphs, histograms, contour diagrams) is a surprisingly hard problem. The difficulty comes because plots are meant for people, and good plots require some understanding of how people interpret data and a sense of esthetics. I won't go into those details but for a starting point you might try Edward Tufte's books.

Making a plotting package is also hard. Many people have written plotting libraries for various languages but few have succeeded in making one which is general, flexible and pretty. The best one of these for Python is matplotlib. In this lecture I'll show how to use Python to make a protein hydrophobicity plot.

First though I want to make sure that everyone has matplotlib working. For those on MS Windows, the easiest way is to install the SciPy package. For Linux and Apple users we'll see if it's installed already or you can compile the source.

For the Mac people - I had problems getting the fonts to work correctly with the current release and I needed to use the Tk backend instead of Gtk. Because of the way the Mac's windowing system works, a program that uses a GUI must be run with pythonw instead of the normal python command. I'll help those who run into problems.

(Note to self: here's how I configured the backend.)

% cat ~/.matplotlib/matplotlibrc 
backend       : TkAgg
%

)

When you get it installed, here's a test program to make sure matplotlib is working correctly. When you run this program it should display a window on the screen with a simple sawtooth pattern.

## graph1
# Make a simple sawtooth pattern

from pylab import *

x_data = [0, 1, 2,  3, 4, 5, 6]
y_data = [0, 1, 0, -1, 0, 1, 0]

plot(x_data, y_data)

xlabel("This is the x label")
ylabel("This is the y label")
title("Here is the title")

grid(True)

show()

I think the code is easy enough to understand if you've done any sort of graphs before so I won't go into details here.

The sawtooth pattern isn't that interesting so I'll change the program to use some hydrophobicity values I computed elsewhere. In a bit I'll show you how to compute those values. The graph is meant for biologists so I'll number the residues starting with 1 instead of 0.

## graph2.py
# Plot a few hydrophobicity values

from pylab import *

y_data = [1.8, -3.5, 4.5, -0.7, -0.4, -4.5, -1.6, -3.5, -0.9, 4.5,
          -0.9, 3.8, 1.8, 3.8, -0.4, -0.7, 1.8, 3.8, 1.9, -0.4]

# Normally "range" starts with 0, which is what Python expects.
# This output is meant for biologists, who expect sequence
# coordinates to start with 1, so shift things by 1.
x_data = range(1, len(y_data)+1)

plot(x_data, y_data)

xlabel("residue number")
ylabel("hydrophobicity")
title("Kyte&Doolittle hydrophobicity")
show()

Here's the output

I don't like the gap on the left side of the plot. The matplot developers did a good job of guessing good values for the start and end of the axes coordinates but no algorithm can be perfect for every data set. I can change the x and y ranges for the axes with the axis() function. Here I'll change only the lower and upper limits of the x axis based on the starting and ending residue indicies.

## graph3.py
# Plot a few hydrophobicity values

from pylab import *

y_data = [-3.5, 2.8, -1.6, 1.8, -1.6, -4.5, -1.6, -1.6, -0.7, -1.6,
          -3.5, 1.8, -3.5, 4.5, -0.7, -0.4, -4.5, -1.6, -3.5, -0.9]

# Normally "range" starts with 0, which is what Python expects.
# This output is meant for biologists, who expect sequence
# coordinates to start with 1, so shift things by 1.
x_data = range(1, len(y_data)+1)

plot(x_data, y_data, linewidth=1.0)

# Start the numbering from 1 and not from 0.
# End at the last residue.
num_residues = len(y_data)
axis(xmin = 1, xmax = num_residues)

xlabel("residue number")
ylabel("hydrophobicity")
title("Kyte&Doolittle hydrophobicity")


show()

Here's the output without the gap on the left side.

For the previous example I used a hard-coded list of hydrophobity values. In real code I would rather have them computed from a sequence. For this talk I'll copy the Kyte&Doolittle values from Biopython's Bio.SeqUtils.ProtParamData.kd. That's a dictionary mapping each residue (as a single letter code) to a hydrophobicity value.

>>> kd = { 'A': 1.8,'R':-4.5,'N':-3.5,'D':-3.5,'C': 2.5,
...        'Q':-3.5,'E':-3.5,'G':-0.4,'H':-3.2,'I': 4.5,
...        'L': 3.8,'K':-3.9,'M': 1.9,'F': 2.8,'P':-1.6,
...        'S':-0.8,'T':-0.7,'W':-0.9,'Y':-1.3,'V': 4.2 }
>>> kd["A"]
1.8
>>> kd["M"]
1.8999999999999999
>>> sequence = "WERDNA"
>>> for residue in sequence:
...   print kd[residue]
... 
-0.9
-3.5
-4.5
-3.5
-3.5
1.8
>>>

To get the list of hydrophobicity values I'll start with an empty list named values and for each residue I'll look up the hydrophobicity value and append it to the list.

I need a sequence so I'll use the fasta_reader.py from last week to get the first (and only) record from bacteriorhodopsin.fasta.

## graph4.py
# Compute hydrophobity values for a sequence

import fasta_reader

# Kyte & Doolittle index of hydrophobicity
kd = { 'A': 1.8,'R':-4.5,'N':-3.5,'D':-3.5,'C': 2.5,
       'Q':-3.5,'E':-3.5,'G':-0.4,'H':-3.2,'I': 4.5,
       'L': 3.8,'K':-3.9,'M': 1.9,'F': 2.8,'P':-1.6,
       'S':-0.8,'T':-0.7,'W':-0.9,'Y':-1.3,'V': 4.2 }

infile = open("bacteriorhodopsin.fasta")
record = fasta_reader.read_fasta_record(infile)

# Make a list of hydrophobicity values, one value for each character
# in the string.
values = []
for residue in record.sequence:
    values.append(kd[residue])

# Print the results to see that it worked
print "K&D hydrophobicity values:"
for value in values:
    print str(value) + ",",

Here's what the output looks like (formatted somewhat because otherwise everything would be on one very long line)

K&D hydrophobicity values:
1.8, -3.5, 4.5, -0.7, -0.4, -4.5, -1.6, -3.5, -0.9, 4.5, -0.9, 3.8,
8, 3.8, -0.4, -0.7, 1.8, 3.8, 1.9, -0.4, 3.8, -0.4, -0.7, 3.8,
-1.3, 2.8, 3.8, 4.2, -3.9, -0.4, 1.9, -0.4, 4.2, -0.8, -3.5, -1.6,
.5, 1.8, -3.9, -3.9, 2.8, -1.3, 1.8, 4.5, -0.7, -0.7, 3.8, 4.2,
-1.6, 1.8, 4.5, 1.8, 2.8, -0.7, 1.9, -1.3, 3.8, -0.8, 1.9, 3.8, 3.8,
.4, -1.3, -0.4, 3.8, -0.7, 1.9, 4.2, -1.6, 2.8, -0.4, -0.4, -3.5,
-3.5, -3.5, -1.6, 4.5, -1.3, -0.9, 1.8, -4.5, -1.3, 1.8, -3.5, -0.9,
8, 2.8, -0.7, -0.7, -1.6, 3.8, 3.8, 3.8, 3.8, -3.5, 3.8, 1.8, 3.8,
3.8, 4.2, -3.5, 1.8, -3.5, -3.5, -0.4, -0.7, 4.5, 3.8, 1.8, 3.8,
2, -0.4, 1.8, -3.5, -0.4, 4.5, 1.9, 4.5, -0.4, -0.7, -0.4, 3.8,
4.2, -0.4, 1.8, 3.8, -0.7, -3.9, 4.2, -1.3, -0.8, -1.3, -4.5, 2.8,
2, -0.9, -0.9, 1.8, 4.5, -0.8, -0.7, 1.8, 1.8, 1.9, 3.8, -1.3,
4.5, 3.8, -1.3, 4.2, 3.8, 2.8, 2.8, -0.4, 2.8, -0.7, -0.8, -3.9,
8, -3.5, -0.8, 1.9, -4.5, -1.6, -3.5, 4.2, 1.8, -0.8, -0.7, 2.8,
-3.9, 4.2, 3.8, -4.5, -3.5, 4.2, -0.7, 4.2, 4.2, 3.8, -0.9, -0.8,
8, -1.3, -1.6, 4.2, 4.2, -0.9, 3.8, 4.5, -0.4, -0.8, -3.5, -0.4,
1.8, -0.4, 4.5, 4.2, -1.6, 3.8, -3.5, 4.5, -3.5, -0.7, 3.8, 3.8,
8, 1.9, 4.2, 3.8, -3.5, 4.2, -0.8, 1.8, -3.9, 4.2, -0.4, 2.8,
-0.4, 3.8, 4.5, 3.8, 3.8, -4.5, -0.8, -4.5, 1.8, 4.5, 2.8, -0.4,
-3.5, 1.8, -3.5, 1.8, -1.6, -3.5, -1.6, -0.8, 1.8, -0.4, -3.5, -0.4,
1.8, 1.8, 1.8, -0.7, -0.8,

I'll merge this code (graph4) with the previous one (graph3) to plot the hydrophobicities for that FASTA record. The first part computes the values and the second part plots the hydrophobicity graph.

## graph5.py
# Plot hydrophobity values for a sequence

import fasta_reader
from pylab import *

# Kyte & Doolittle index of hydrophobicity
kd = { 'A': 1.8,'R':-4.5,'N':-3.5,'D':-3.5,'C': 2.5,
       'Q':-3.5,'E':-3.5,'G':-0.4,'H':-3.2,'I': 4.5,
       'L': 3.8,'K':-3.9,'M': 1.9,'F': 2.8,'P':-1.6,
       'S':-0.8,'T':-0.7,'W':-0.9,'Y':-1.3,'V': 4.2 }

infile = open("bacteriorhodopsin.fasta")
record = fasta_reader.read_fasta_record(infile)
num_residues = len(record.sequence)

values = []
for residue in record.sequence:
    values.append(kd[residue])

## Plot the hydrophobicity values

# Normally "range" starts with 0, which is what Python expects.
# This output is meant for biologists, who expect sequence
# coordinates to start with 1, so shift things by 1.
x_data = range(1, num_residues+1)

plot(x_data, values, linewidth=1.0)

# Show exactly the length of the sequence
axis(xmin = 1, xmax = num_residues)

xlabel("residue number")
ylabel("hydrophobicity")

record_id = record.title.split()[0]
title("K&D hydrophobicity for " + record_id)

show()

Did you notice that bit of code at the end to extract the sequence identifer from the FastaRecord's title field? I used the split() method to break the title fields into words. The first word is the identifier "gi|66360541|pdb|1XJI|A", which I used to make the title for the graph.

By the way, if you want to save the graph to a file use the savefig function instead of show. Here's how I saved the above plot

#show()
savefig('graph5.png', dpi=80)

That hydrophobicity plot is hard to understand. It's very noisy. The reason is because the hydrophobicity value is a measure of the propensity of a given residue type for a hydrophobic region (like a membrane or the inside of a globular protein). It is not an exact predictor. A residue with high hydrophobic propensity could still be in water, and vice versa. Hydrophobicities add up so if there are a few hydrophobic residues in a row then it's more likely that they are in a hydrophobic region.

What I'll do is use the average values for a residue and its two neighbors; one to the left and one to the right). I need to be careful because the very first and last residues have only one neighbor. I could use the average of the two residues for this case but that's more complicated so I'll just ignore the two end residues.

## graph6.py
# Plot hydrophobity values for a sequence
# The value for a residue is the average of the residue and its two
# neighbors (window size = 3)

import fasta_reader
from pylab import *

# Kyte & Doolittle index of hydrophobicity
kd = { 'A': 1.8,'R':-4.5,'N':-3.5,'D':-3.5,'C': 2.5,
       'Q':-3.5,'E':-3.5,'G':-0.4,'H':-3.2,'I': 4.5,
       'L': 3.8,'K':-3.9,'M': 1.9,'F': 2.8,'P':-1.6,
       'S':-0.8,'T':-0.7,'W':-0.9,'Y':-1.3,'V': 4.2 }

infile = open("bacteriorhodopsin.fasta")
record = fasta_reader.read_fasta_record(infile)
num_residues = len(record.sequence)

values = []
for residue in record.sequence:
    values.append(kd[residue])

# I've computed the values for each residue.  Now compute
# the moving average.  Can't use the first and last residues
# because they don't have neighbors.
y_data = []
for i in range(1, num_residues-1):
    left = values[i-1]
    center = values[i]
    right = values[i+1]
    average_value = (left + center + right) / 3
    y_data.append(average_value)

# Exclude the first and last residues.
# The final list is in biologist coordinates (the first residue is
# index 1)
x_data = range(1, 1+len(y_data))

plot(x_data, y_data, linewidth=1.0)

# Show exactly the length of the sequence
axis(xmin = 1, xmax = num_residues)

xlabel("residue number")
ylabel("hydrophobicity (moving average over 3 values)")

record_id = record.title.split(" ", 1)[0]
title("K&D hydrophobicity for " + record_id)

show()

Not as hard to read, though still a bit jagged.

Mathematically speaking this technique is called data smoothing. It filters out (removes) the high frequency noise to emphasize lower frequency effects. I found a nice animation [source] which demonstrates the algorithm.

The algorithm says that for a given residue get its value and the values for its neighbors and compute the average. If you get the closest neighbor on each side then there are three residues in the average. In technical language this has a window size of 3. The animation uses a window size of 5 because 5 points are used for the smoothing.

After you compute the value for a point, move the window over by one residue and compute the new smoothed value. Repeat until done. The technical language for this is a sliding window.

A sliding window of width 3 does smooth the graph a bit. What about of width 5? Here's code which smooths the data by averaging the value for residue with its 4 nearest neighbors (the 2 on each side).

## graph7.py
# Plot hydrophobity values for a sequence
# The value for a residue is the average of the residue and its 4
# nearest neighbors (window size = 5)

import fasta_reader
from pylab import *

# Kyte & Doolittle index of hydrophobicity
kd = { 'A': 1.8,'R':-4.5,'N':-3.5,'D':-3.5,'C': 2.5,
       'Q':-3.5,'E':-3.5,'G':-0.4,'H':-3.2,'I': 4.5,
       'L': 3.8,'K':-3.9,'M': 1.9,'F': 2.8,'P':-1.6,
       'S':-0.8,'T':-0.7,'W':-0.9,'Y':-1.3,'V': 4.2 }

infile = open("bacteriorhodopsin.fasta")
record = fasta_reader.read_fasta_record(infile)
num_residues = len(record.sequence)

values = []
for residue in record.sequence:
    values.append(kd[residue])

# I've computed the values for each residue.  Now compute
# the moving average.  Can't use the first and last two residues
# because they don't have all four neighbors.
y_data = []
for i in range(2, num_residues-2):
    leftleft = values[i-2]
    left = values[i-1]
    center = values[i]
    right = values[i+1]
    rightright = values[i+2]
    average_value = (leftleft + left + center + right + rightright) / 5
    y_data.append(average_value)

# Exclude the first and last residues.
# The final list is in biologist coordinates (the first residue is
# index 1)
x_data = range(2, 2+len(y_data))

plot(x_data, y_data, linewidth=1.0)

# Show exactly the length of the sequence
axis(xmin = 1, xmax = num_residues)

xlabel("residue number")
ylabel("hydrophobicity (moving average over 5 values)")

record_id = record.title.split(" ", 1)[0]
title("K&D hydrophobicity for " + record_id)

show()

When you're doing data analysis you need to be careful that what you're doing makes physical sense and that you can understand results. The moving average gives the propensity for a short subsequence to be in solvent vs. a hydrophobic environment. Suppose though that I'm interesting in identifying helicies which span a membrane. From the physical sizes of a helix and a membrane I know a helix is about 20 bases long. I want to see what happens when I use a window size of 19, which means computing the average value of the given point and its 9 neighbors in each direction.

I don't want to make 19 different variables (like "left1" for the neighbor to the immediate left, "left9" for the neighbor 9 to the left, etc.) Instead I'll use a loop to get the values for the residues from position i-9 to i+9, where i is the residue position. I'll add these up and divide by 19 (the window size) to get the smoothed value for that point.

## graph8.py
# Plot hydrophobity values for a sequence
# Use a moving average of width 19.
# Show a cutoff for Kyte&Doolittle transmembrane helix prediction

import fasta_reader
from pylab import *

# Kyte & Doolittle index of hydrophobicity
kd = { 'A': 1.8,'R':-4.5,'N':-3.5,'D':-3.5,'C': 2.5,
       'Q':-3.5,'E':-3.5,'G':-0.4,'H':-3.2,'I': 4.5,
       'L': 3.8,'K':-3.9,'M': 1.9,'F': 2.8,'P':-1.6,
       'S':-0.8,'T':-0.7,'W':-0.9,'Y':-1.3,'V': 4.2 }

infile = open("bacteriorhodopsin.fasta")
record = fasta_reader.read_fasta_record(infile)
num_residues = len(record.sequence)

values = []
for residue in record.sequence:
    values.append(kd[residue])

# I've computed the values for each residue.  Now compute the moving
# average for a window size of 19.  Transmembrane helixes are often
# about 20 residues in length so peaks in this plot suggest
# transmembrane regions.

window_size = 19
half_window = (window_size-1)/2

y_data = []
for i in range(half_window, num_residues-half_window):
    average_value = 0.0
    for j in range(-half_window, half_window+1):
        average_value += values[i+j]
    y_data.append(average_value / window_size)

# Exclude the first and last residues.
# The final list is in biologist coordinates (the first residue is
# index 1)
x_data = range(half_window, half_window+len(y_data))

plot(x_data, y_data, linewidth=1.0)

# Draw a reasonable cutoff for membrane prediction
# Value of 1.6 taken from
#   http://arbl.cvmbs.colostate.edu/molkit/hydropathy/
axhline(y=1.6)

# Show exactly the length of the sequence
axis(xmin = 1, xmax = num_residues)

xlabel("residue number")
ylabel("hydrophobicity (moving average over %d values)" % window_size)

record_id = record.title.split(" ", 1)[0]
title("K&D hydrophobicity for " + record_id)

show()

If you look in the plotting section you'll see I added a call to axhline(y=1.6). This puts a horizontal line across the graph at a place where the smoothed hydrophobicity value suggests that the region is a transmembrane region. Here's the resulting image.

As it turns out there's 3D structure information for this record. (Okay, I deliberately chose one with a structure. ;) Here's what PDB 1XJI looks like using VMD, a structure visualization program I helped write many years ago.

Bacteriorhodopin has 7 transmembrane helicies and two small beta strands which sit on the membrane/water interface. This view shows bR from the "side", with the top and bottom sticking out of the membrane. I want to compare the prediction to the actual secondary structure values. I'll add boxes to the plot to show the helicies (in yellow) and the strands (in red).

The matplotlib axvspan function is meant exactly for this purpose. I'll make a vertical span for each secondary structure element, with the start and end points of the span being the start and end points of the element. I don't want the span to cover and block the graph so I'll tell the command to use a somewhat transparent color. In the computer world this is called the alpha value for the color. An opaque color has an alpha of 1.0 and a transparent color has an alpha of 0.0.

To get the secondary structure elements I used the feature fields from the GenBank entry for gi|66360541 which are

     SecStr          9..32
                     /sec_str_type="helix"
                     /note="helix 1"
     SecStr          36..61
                     /sec_str_type="helix"
                     /note="helix 2"
     SecStr          64..71
                     /sec_str_type="sheet"
                     /note="strand 1"
     SecStr          72..79
                     /sec_str_type="sheet"
                     /note="strand 2"
     SecStr          82..100
                     /sec_str_type="helix"
                     /note="helix 3"
     SecStr          104..126
                     /sec_str_type="helix"
                     /note="helix 4"
     SecStr          130..159
                     /sec_str_type="helix"
                     /note="helix 5"
     SecStr          164..190
                     /sec_str_type="helix"
                     /note="helix 6"
     SecStr          200..224
                     /sec_str_type="helix"
                     /note="helix 7"

I manually converted this description into the following

# Draw the known helix and ribbon ranges
axvspan(9, 32, facecolor="yellow", alpha=0.4) # helix
axvspan(36, 61, facecolor="yellow", alpha=0.4)
axvspan(64, 71, facecolor="red", alpha=0.4)  # sheet
axvspan(72, 79, facecolor="red", alpha=0.4)  # sheet
axvspan(82, 100, facecolor="yellow", alpha=0.4)
axvspan(104, 126, facecolor="yellow", alpha=0.4)
axvspan(130, 159, facecolor="yellow", alpha=0.4)
axvspan(164, 190, facecolor="yellow", alpha=0.4)
axvspan(200, 224, facecolor="yellow", alpha=0.4)

The resulting code and image is

## graph9.py
# Plot hydrophobity values for a sequence
# Use a moving average of width 19.
# Show a cutoff for Kyte&Doolittle transmembrane helix prediction
# The bR structure is known so highlight the secondary structure features; from
#  http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=protein&val=66360541

import fasta_reader
from pylab import *

# Kyte & Doolittle index of hydrophobicity
kd = { 'A': 1.8,'R':-4.5,'N':-3.5,'D':-3.5,'C': 2.5,
       'Q':-3.5,'E':-3.5,'G':-0.4,'H':-3.2,'I': 4.5,
       'L': 3.8,'K':-3.9,'M': 1.9,'F': 2.8,'P':-1.6,
       'S':-0.8,'T':-0.7,'W':-0.9,'Y':-1.3,'V': 4.2 }

infile = open("bacteriorhodopsin.fasta")
record = fasta_reader.read_fasta_record(infile)
num_residues = len(record.sequence)

values = []
for residue in record.sequence:
    values.append(kd[residue])

# I've computed the values for each residue.  Now compute the moving
# average for a window size of 19.  Transmembrane helixes are often
# about 20 residues in length so peaks in this plot suggest
# transmembrane regions.

window_size = 19
half_window = (window_size-1)/2

# Precompute the offsets list for better performance.
offsets = range(-half_window, half_window+1)

y_data = []
for i in range(half_window, num_residues-half_window):
    average_value = 0.0
    for offset in offsets:
        average_value += values[i+offset]
    y_data.append(average_value / window_size)

# Exclude the first and last residues.
# The final list is in biologist coordinates (the first residue is
# index 1)
x_data = range(half_window, half_window+len(y_data))

plot(x_data, y_data, linewidth=1.0)

# Draw a reasonable cutoff for membrane prediction
# Value of 1.6 taken from
#   http://arbl.cvmbs.colostate.edu/molkit/hydropathy/
axhline(y=1.6)

# Draw the known helix and ribbon ranges
axvspan(9, 32, facecolor="yellow", alpha=0.4) # helix
axvspan(36, 61, facecolor="yellow", alpha=0.4)
axvspan(64, 71, facecolor="red", alpha=0.4)  # strand
axvspan(72, 79, facecolor="red", alpha=0.4)  # strand
axvspan(82, 100, facecolor="yellow", alpha=0.4)
axvspan(104, 126, facecolor="yellow", alpha=0.4)
axvspan(130, 159, facecolor="yellow", alpha=0.4)
axvspan(164, 190, facecolor="yellow", alpha=0.4)
axvspan(200, 224, facecolor="yellow", alpha=0.4)

# Show exactly the length of the sequence
axis(xmin = 1, xmax = num_residues)

xlabel("residue number")
ylabel("hydrophobicity (moving average over %d values)" % window_size)

record_id = record.title.split(" ", 1)[0]
title("K&D hydrophobicity for " + record_id)

show()

Hmm. No false positives on the helicies but it didn't quite get helix 3 or 6. Perhaps the threshold is too high, but then it would suggest that the region between helicies 3 and 4 is a transmembrane region. See also this web page which provides relevant commentary.

In graph9.py I made one change to the computation code. It used to say

y_data = []
for i in range(half_window, num_residues-half_window):
    average_value = 0.0
    for j in range(-half_window, half_window+1):
        average_value += values[i+j]
    y_data.append(average_value / window_size)

The return value of range(half_window, num_residues-half_window) never changes for a given window size. Rather than have Python re-create that list for each residue I can compute it once outside of the sliding window loop and reuse it. This makes the code a bit faster, though I didn't actually time how much faster. The technical term for this is called loop-hoisting – you "hoist the code out of the loop".

The new code is

# Precompute the offsets list for better performance.
offsets = range(-half_window, half_window+1)

y_data = []
for i in range(half_window, num_residues-half_window):
    average_value = 0.0
    for offset in offsets:
        average_value += values[i+offset]
    y_data.append(average_value / window_size)

I used the average value of the window to compute the smoothed value. The problem with the average that the residues far from the center residue are just as important as central values. That doesn't make much physical sense. I expect them to have less impact. I want the center value weighted the most and the more distant residues to be weighted less and less.

I'll start with a linear weighting scheme. Suppose I have a window of size 5. I'll assign the center residue a weight of 3, its immediate neighbors a weight of 2 and the further residues a weight of 1. This is called a triangle filter function because it looks like this

It doesn't make sense if the weights don't add up to 1 because then the smoothed value for a flat line will be different than the actual value. In case the weights don't add up to 1 (as with this case), I'll normalize the smoothed value by the total weight. (The formal way to say this is "normalize by the total weight.")

By the way, the name for the filter that does the average of all values in the window is called a top-hat filter. Here's why

Given what you've learned so far the code will look something like this

>>> values = [1.8, -3.5, 4.5, -0.7, -0.4, -4.5, -1.6, -3.5, -0.9, 4.5,
...           -0.9, 3.8, 1.8, 3.8, -0.4, -0.7, 1.8, 3.8, 1.9, -0.4]
>>> weights = [1,2,3,2,1]
>>> offsets = [-2, -1, 0, 1, 2]  # or: range(-2, 3)
>>> smoothed_values = []
>>> for i in range(2, len(values)-2):
...     smoothed_value = 0.0
...     for j in range(5):
...         weight = weights[j]
...         offset = offsets[j]
...         smoothed_value = smoothed_value + weight * values[i+offset]
...     smoothed_values.append(smoothed_value)
... 
>>> smoothed_values
[6.5, -1.8999999999999995, -8.6999999999999993, -21.699999999999999, -22.100000000000001,
-15.5, -3.2000000000000006, 10.199999999999999, 14.800000000000001, 21.5, 19.300000000000001, 
17.299999999999997, 8.5999999999999996, 8.3000000000000007, 13.1, 17.699999999999999]
>>>

(Note: There is a slight bug. I need to normalize the smoothed values by the sum of the weights, that is, divide each value by 1+2+3+2+1=9.)

There's a cleaner way to say this. The complication comes because I want to use a for loop on two different lists at the same time. I want to get both the weights and the offsets. I did that here by making an loop over the indicies and getting the correct weight and offset for each j index.

I can instead use the zip function. It's named after a zipper because it combines terms from multiple lists like a zipper merges teeth from the two sides. (Though zip can merge more than two lists.) Here an example of it in use

>>> weights
[1, 2, 3, 2, 1]
>>> offsets
[-2, -1, 0, 1, 2]
>>> zip(weights, offsets)
[(1, -2), (2, -1), (3, 0), (2, 1), (1, 2)]
>>>

It's useful because I can use the result of the zip in a for loop, like this

>>> seq1 = "GATTACA"
>>> seq2 = "GALAGAA"
>>> zip(seq1, seq2)
[('G', 'G'), ('A', 'A'), ('T', 'L'), ('T', 'A'), ('A', 'G'), ('C', 'A'), ('A', 'A')]
>>> for c1, c2 in zip(seq1, seq2):
...   if c1 == c2:
...     print c1, "==", c2
...   else:
...     print c1, "!=", c2
... 
G == G
A == A
T != L
T != A
A != G
C != A
A == A
>>> zip(range(7), seq1, seq2) 
[(0, 'G', 'G'), (1, 'A', 'A'), (2, 'T', 'L'), (3, 'T', 'A'), (4, 'A', 'G'), (5, 'C', 'A'), (6, 'A', 'A')]
>>> for i, c1, c2 in zip(range(7), seq1, seq2):
...     print i,
...     if c1 == c2:
...         print c1, "==", c2
...     else:
...         print c1, "!=", c2
... 
0 G == G
1 A == A
2 T != L
3 T != A
4 A != G
5 C != A
6 A == A
>>>

This uses a feature of Python called tuple assignment, where multiple variables can be assigned to multiple values at the same time.

Going back to the code to compute the weighted values, here it is rewritten to use the zip function

>>> values = [1.8, -3.5, 4.5, -0.7, -0.4, -4.5, -1.6, -3.5, -0.9, 4.5,
...           -0.9, 3.8, 1.8, 3.8, -0.4, -0.7, 1.8, 3.8, 1.9, -0.4]
>>> weights = [1,2,3,2,1]
>>> offsets = range(-2, 3)
>>> smoothed_values = []
>>> for i in range(2, len(values)-2):
...     smoothed_value = 0.0
...     for (weight, offset) in zip(weights, offsets):
...         smoothed_value = smoothed_value + weight * values[i+offset]
...     smoothed_values.append(smoothed_value)
... 
>>> smoothed_values
[6.5, -1.8999999999999995, -8.6999999999999993, -21.699999999999999, -22.100000000000001,
-15.5, -3.2000000000000006, 10.199999999999999, 14.800000000000001, 21.5, 19.300000000000001, 
17.299999999999997, 8.5999999999999996, 8.3000000000000007, 13.1, 17.699999999999999]
>>>

I hope you see that it's a bit easier to read. If not, don't worry much about it. The original way works, it's just that this approach works better.

One last note: the value of zip(weights, offsets) is constant so I could hoist it out of the loop. I didn't for the following code.

Okay, I want to use a triangle filter for the window of size 19 when looking for transmembrane helicies. Here's the code, with a hard-coded list of weights.

## graph10.py
# Plot weight hydrophobity values for a sequence
# Use a triangle average of width 19
# The bR structure is known so highlight the secondary structure features; from
#  http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=protein&val=66360541

import fasta_reader
from pylab import *

# Kyte & Doolittle index of hydrophobicity
kd = { 'A': 1.8,'R':-4.5,'N':-3.5,'D':-3.5,'C': 2.5,
       'Q':-3.5,'E':-3.5,'G':-0.4,'H':-3.2,'I': 4.5,
       'L': 3.8,'K':-3.9,'M': 1.9,'F': 2.8,'P':-1.6,
       'S':-0.8,'T':-0.7,'W':-0.9,'Y':-1.3,'V': 4.2 }

infile = open("bacteriorhodopsin.fasta")
record = fasta_reader.read_fasta_record(infile)
num_residues = len(record.sequence)

values = []
for residue in record.sequence:
    values.append(kd[residue])

# I've computed the values for each residue.  Now compute the moving
# average for a window size of 19.  Transmembrane helixes are often
# about 20 residues in length so peaks in this plot suggest
# transmembrane regions.

window_size = 19
half_window = (window_size-1)/2

# Precompute the offsets list for better performance.
offsets = range(-half_window, half_window+1)
weights = [1,2,3,4,5,6,7,8,9,10,9,8,7,6,5,4,3,2,1]
total_weight = sum(weights)

y_data = []
for i in range(half_window, num_residues-half_window):
    weighted_value = 0.0
    for offset, weight in zip(offsets, weights):
        weighted_value += weight*values[i+offset]
    y_data.append(weighted_value / total_weight)

# Exclude the first and last residues.
# The final list is in biologist coordinates (the first residue is
# index 1)
x_data = range(half_window, half_window+len(y_data))

plot(x_data, y_data, linewidth=1.0)

# Draw a reasonable cutoff for membrane prediction
# Value of 1.6 taken from
#   http://arbl.cvmbs.colostate.edu/molkit/hydropathy/
axhline(y=1.6)

# Draw the known helix and ribbon ranges
axvspan(9, 32, facecolor="yellow", alpha=0.4) # helix
axvspan(36, 61, facecolor="yellow", alpha=0.4)
axvspan(64, 71, facecolor="red", alpha=0.4)  # strand
axvspan(72, 79, facecolor="red", alpha=0.4)  # strand
axvspan(82, 100, facecolor="yellow", alpha=0.4)
axvspan(104, 126, facecolor="yellow", alpha=0.4)
axvspan(130, 159, facecolor="yellow", alpha=0.4)
axvspan(164, 190, facecolor="yellow", alpha=0.4)
axvspan(200, 224, facecolor="yellow", alpha=0.4)

# Show exactly the length of the sequence
axis(xmin = 1, xmax = num_residues)

xlabel("residue number")
ylabel("hydrophobicity (triangle average over %d values)" % window_size)

record_id = record.title.split(" ", 1)[0]
title("K&D hydrophobicity for " + record_id)

show()

along with the graph

Compare this with the previous plot, which did the averaging. This one looks a lot smoother. To understand why you should learn about Fourier series. The short version is that sharp angles in the filter (like the drop from 1 to 0 at the end of the top-hat) causes lots of "ringing" (high-frequency noise) in the filtered output. The triangle filter has a sharp angle but in the first derivative so it doesn't have as much noise.

I want to compare the different smoothing functions (average vs. triangle) by viewing both results in the same plot. I'll need to use the smoothing code twice so I'll convert that code into a function. The new function will take the original list of hydrophobicity values and the weight values to use and return the list of smoothed values.

As with last week I'll do the conversion in two steps: rewrite the existing code with a function then extend the result to handle multiple plots. At this point you should understand the process of moving existing code into a function so I won't go into the details and just show the result.

## graph11.py
# Plot weight hydrophobity values for a sequence
# Use a triangle average of width 19
# The bR structure is known so highlight the secondary structure features; from
#  http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=protein&val=66360541

import fasta_reader
from pylab import *

def get_per_residue_values(sequence, table):
    values = []
    for residue in sequence:
        values.append(table[residue])
    return values

# Smooth a list of values using on a sliding filter.  The filter weights
# are in "weights" which must have an odd number of elements.  The middle
# element is the weight for the center of the window.
def smooth_values(values, weights):
    window_size = len(weights)
    if window_size%2 != 1:
        raise TypeError("smoothing requires an odd number of weights")
    
    half_window = (window_size-1)/2

    # Precompute the offset values for better performance.
    offsets = range(-half_window, half_window+1)
    offset_data = zip(offsets, weights)

    # normalize the weights in case the sum != 1
    total_weight = sum(weights)

    weighted_values = []
    for i in range(half_window, len(values)-half_window):
        weighted_value = 0.0
        for offset, weight in offset_data:
            weighted_value += weight*values[i+offset]
        weighted_values.append(weighted_value / total_weight)

    return weighted_values


def main():
    # Kyte & Doolittle index of hydrophobicity
    kd = { 'A': 1.8,'R':-4.5,'N':-3.5,'D':-3.5,'C': 2.5,
           'Q':-3.5,'E':-3.5,'G':-0.4,'H':-3.2,'I': 4.5,
           'L': 3.8,'K':-3.9,'M': 1.9,'F': 2.8,'P':-1.6,
           'S':-0.8,'T':-0.7,'W':-0.9,'Y':-1.3,'V': 4.2 }

    weights = [1,2,3,4,5,6,7,8,9,10,9,8,7,6,5,4,3,2,1]
    
    infile = open("bacteriorhodopsin.fasta")
    record = fasta_reader.read_fasta_record(infile)
    
    values = get_per_residue_values(record.sequence, kd)
    smoothed_values = smooth_values(values, weights)
    
    ## Make the plot
    
    half_window = (len(weights)-1)/2
    x_data = range(half_window, half_window+len(smoothed_values))
    
    plot(x_data, smoothed_values, linewidth=1.0)

    # Draw a reasonable cutoff for membrane prediction
    # Value of 1.6 taken from
    #   http://arbl.cvmbs.colostate.edu/molkit/hydropathy/
    axhline(y=1.6)
    
    # Draw the known helix and ribbon ranges
    axvspan(9, 32, facecolor="yellow", alpha=0.4) # helix
    axvspan(36, 61, facecolor="yellow", alpha=0.4)
    axvspan(64, 71, facecolor="red", alpha=0.4)  # strand
    axvspan(72, 79, facecolor="red", alpha=0.4)  # strand
    axvspan(82, 100, facecolor="yellow", alpha=0.4)
    axvspan(104, 126, facecolor="yellow", alpha=0.4)
    axvspan(130, 159, facecolor="yellow", alpha=0.4)
    axvspan(164, 190, facecolor="yellow", alpha=0.4)
    axvspan(200, 224, facecolor="yellow", alpha=0.4)
    
    # Show exactly the length of the sequence
    axis(xmin = 1, xmax = len(record.sequence))
    
    xlabel("residue number")
    ylabel("hydrophobicity")
    
    record_id = record.title.split(" ", 1)[0]
    title("Smoothed K&D hydrophobicity for " + record_id)
    
    show()


if __name__ == "__main__":
    main()

The output is unchanged so I won't show it.

I want to plot the average and the triangle plots on the same graph, to see how they compare. While working on this lecture I came across Savitzky-Golay weights. I found a description of how it works here and wrote a function to figure out the Savitzky-Golay quadratic smooth weights for a window of width 19. This should be smoother than the triangle function so I plotted it as well. Feel free to look at the code to compute the Savitzky-Golay weights. Here's how I used it to get the weights for the final plotting code.

>>> import savitzky_golay
>>> savitzky_golay.savitzky_golay_weights(19)
array([-0.06015038, -0.02255639,  0.01061477,  0.03936311,  0.06368863,  0.08359133,
             0.09907121,  0.11012826,  0.11676249,  0.11897391,  0.11676249,
             0.11012826,  0.09907121,  0.08359133,  0.06368863,  0.03936311,
             0.01061477, -0.02255639, -0.06015038])
>>>

I have three weighting functions so I'll have three smoothed values. It's very simple to make three overlapping plots in matplotlib — call the plot() function once for each plot. It's possible to pass the values all in one go except I wanted to give each one a different label so I can see the legend(). Here's the final code

## graph12.py
# Compare several different smoothing functions for the hydrophobicity of bR
# The bR structure is known so highlight the secondary structure features; from
#  http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=protein&val=66360541

import fasta_reader
from pylab import *

def get_per_residue_values(sequence, table):
    values = []
    for residue in sequence:
        values.append(table[residue])
    return values

# Smooth a list of values using on a sliding filter.  The filter weights
# are in "weights" which must have an odd number of elements.  The middle
# element is the weight for the center of the window.
def smooth_values(values, weights):
    window_size = len(weights)
    if window_size%2 != 1:
        raise TypeError("smoothing requires an odd number of weights")
    
    half_window = (window_size-1)/2

    # Precompute the offset values for better performance.
    offsets = range(-half_window, half_window+1)
    offset_data = zip(offsets, weights)

    # normalize the weights in case the sum != 1
    total_weight = sum(weights)

    weighted_values = []
    for i in range(half_window, len(values)-half_window):
        weighted_value = 0.0
        for offset, weight in offset_data:
            weighted_value += weight*values[i+offset]
        weighted_values.append(weighted_value / total_weight)

    return weighted_values


def main():
    # Kyte & Doolittle index of hydrophobicity
    kd = { 'A': 1.8,'R':-4.5,'N':-3.5,'D':-3.5,'C': 2.5,
           'Q':-3.5,'E':-3.5,'G':-0.4,'H':-3.2,'I': 4.5,
           'L': 3.8,'K':-3.9,'M': 1.9,'F': 2.8,'P':-1.6,
           'S':-0.8,'T':-0.7,'W':-0.9,'Y':-1.3,'V': 4.2 }

    average_weights = [1]*19
    triangle_weights = [1,2,3,4,5,6,7,8,9,10,9,8,7,6,5,4,3,2,1]

    # Savitzky-Golay smooth quadratic weights
    savitzky_golay_weights = [
        -60.15037594, -22.55639098, 10.61477222, 39.36311367,
        63.68863335, 83.59133127, 99.07120743, 110.12826183,
        116.76249447, 118.97390535, 116.76249447, 110.12826183,
        99.07120743, 83.59133127, 63.68863335, 39.36311367,
        10.61477222, -22.55639098, -60.15037594]
    
    infile = open("bacteriorhodopsin.fasta")
    record = fasta_reader.read_fasta_record(infile)
    
    values = get_per_residue_values(record.sequence, kd)

    average_values = smooth_values(values, average_weights)
    triangle_values = smooth_values(values, triangle_weights)
    savitzky_golay_values = smooth_values(values, savitzky_golay_weights)
    
    ## Make the plot
    
    half_window = (len(average_weights)-1)/2
    x_data = range(half_window, half_window+len(average_values))
    
    plot(x_data, average_values, linewidth=1.0, label="average")
    plot(x_data, triangle_values, linewidth=1.0, label="triangle")
    plot(x_data, savitzky_golay_values, linewidth=1.0, label="Savitzky-Golay")

    legend()

    # Draw a reasonable cutoff for membrane prediction
    # Value of 1.6 taken from
    #   http://arbl.cvmbs.colostate.edu/molkit/hydropathy/
    axhline(y=1.6)
    
    # Draw the known helix and ribbon ranges
    axvspan(9, 32, facecolor="yellow", alpha=0.4) # helix
    axvspan(36, 61, facecolor="yellow", alpha=0.4)
    axvspan(64, 71, facecolor="red", alpha=0.4)  # strand
    axvspan(72, 79, facecolor="red", alpha=0.4)  # strand
    axvspan(82, 100, facecolor="yellow", alpha=0.4)
    axvspan(104, 126, facecolor="yellow", alpha=0.4)
    axvspan(130, 159, facecolor="yellow", alpha=0.4)
    axvspan(164, 190, facecolor="yellow", alpha=0.4)
    axvspan(200, 224, facecolor="yellow", alpha=0.4)
    
    # Show exactly the length of the sequence
    axis(xmin = 1, xmax = len(record.sequence))
    
    xlabel("residue number")
    ylabel("Kyte&Doolittle hydrophobicity")
    
    record_id = record.title.split(" ", 1)[0]
    title("Three hydrophobicity smoothings of " + record_id)
    
    show()


if __name__ == "__main__":
    main()

and the result you've been waiting for ...

Of the three it looks like Savitzky-Golay is indeed the best weighting function. That's probably why programs like GlobPlot use it. Looking over the paper I see the program was written in Python. I wonder why they decided to use an external program for the data smoothing rather than use Numeric like I did. Hmm, and this quote is relevant

We opted to use a running sum function for three reasons. Firstly, it results in plots that are easy to interpret, whether by human or algorithm. Secondly, it is a simple approach to a very complex problem. Thirdly, there is no dependency on frame length as is the case for sliding window methods such as SEG or Prot-Scale.