Assignment #1

ASSIGNMENT INSTRUCTIONS

I don't like grades but I'm supposed to grade you. I'll base them in part on these assignments. You'll submit an assignment to me via email. In the email will be an attachment which is the tar file (or zip file) of a directory containing the assignment. The first assignment will be in the directory assignment1.tar (or .tar.gz or tar.bz2 or .zip), the second in assignment2.tar, and so on. The directory will have a file called README which contains your name, the assignment number and information about your answer.

You will email nbn@dalkescientific.com with the tar file as an attachment.

Each assignment will have multiple parts so your README should look something like this (it doesn't have to be exactly like this, but I am checking to see that your answers are understandable)

Andrew Dalke
Assignment #1

Answer 1

Documentation for the GenBank format is at http://...
Documentation for Prosite is at http://...

I found descriptions of the FASTA format at
  http://...
  http://...
  http://...

The XYZZY home page is at http://...  They don't have any format
specification because nothing happens here.

Answer 2

 There are 12345 records in the file.
 8 records have sequence length of 248.
 The shortest sequence is titled 'Human Chromosome 21'
 No other records have a sequence with that length.
 The longest sequence is titled 'Human Chromosome 1'
 The record 'Elephant Chromosome 3' has the same sequence
...

To submit your assignment first make sure that all of the files are in a directory with the correct name for the assignment. Double check that all the files are there and that it has the README file.

To make a tar file from a directory named assignment1 go into the parent directory and from the command-line do

  tar cf assignment1.tar assignment1

This creates the tar file assignment1.tar. To see the contents of a tar file use the "t" option

  tar tf assignment1.tar

To extract the files from a tar file do

  tar xf assignment1.tar

I would like it if you compress the tar file before sending it. Using "bzip" you can do

  bzip2 assignment1.tar

This will compress the file into a new file. The name of the new file is the old file name + ".bz2". In this case you will email me the file assignment1.tar.bz2.

ASSIGNMENT #1

Question 1

List the authoritative URLs for the format specifications of:

GenBank
Prosite
EMBL
Blast XML
KEGG

List three URLs to at least three different descriptions of the FASTA format. Are any of them authoritative? Why or why not?

Pick another public database you use. Tell me the database, the URL to the home page for the database and the URL for its format definition.

Question 2

Copy fasta_reader.py into your assignment directory. This is the parse7.py file from earlier. You will use it as a library module for this assignment.

Copy the file br_sequences.fasta into your assignment directory. You will write a program to answer some questions about the sequences in the br_sequences.fasta file.

How many records are in the file?
How many records have a sequence of length 249?
What is the title for the record with the shortest sequence? Is there more than one record with that length?
What is the title for the record with the longest sequence? Is there more than one record with that length?
How long is the sequence for the GenBank identifier gi|114812?
Which records have 3D structures submitted to the PDB? (Give me the 4 character PDB identifier.)
Some records contain identical sequences. How many unique sequences are present?
Give the titles for two different records which have identical sequences (any two will do)

The program will be called br_info.py. When I run it from the command line (using python br_info.py) it will use the file named br_sequences.fasta in the local directory to answer the above questions. The program output should look something like

 There are 12345 records in the file.
 8 records have sequence length of 248.
 The shortest sequence is titled 'Human Chromosome 21'
 No other records have a sequence with that length.
 The longest sequence is titled 'Human Chromosome 1'
 The record 'Elephant Chromosome 3' has the same sequence
  ... and so on ...

Put a copy of the output into the README file for this question. I will be looking over your program so don't hard-code the answers. I may decide to use a different FASTA file when running the tests.

Question 3 (Advanced):

Copy the file swissprot.dat into your assignment directory. This contains roughly 100 records from the SWISS-PROT 40 release. You will write a library file to read SWISS-PROT records and answer questions that depend on the record id, description and sequence.

Take a look at the file contents.

What is a good way to identify ...:

... the start of a record?
... the end of a record?
... the SWISS-PROT primary identifier?
... the record description?
... the lines containing sequence data?

Might the description field be split over several lines?
How did you find the answer to that question?

Show appropriate code for a SwissRecord class which stores the id, description and sequence fields.

Create a file named swiss_reader.py in your assignment directory. The file must include the SwissRecord class definition and two functions named read_swiss_record and read_swiss_records. These functions will act similar to the corresponding functions from the fasta_reader.py file. That is, read_swiss_record will return the next SwissRecord from a file object or return None if it's at the end of the file and read_swiss_records will return all of the records in the file as a list.

The read_swiss_record function must raise a TypeError exception if the first line of the record does not start with "ID".

Include test code for your swiss_reader.py library either as a self-test (use the __name__ == "__main__" technique) or in a new file named test_swiss_reader.py. Tell me how to run the test program. Include a test case for the previous paragraph.

Using your SWISS-PROT parser, write a program named swiss_info.py which answers the following questions about the swissprot.dat in the local assignment directory:

How many records are in the file?
How many records have a sequence of length 260?
What are the first 20 residues of 143X_MAIZE?
What is the identifier for the record with the shortest sequence? Is there more than one record with that length?
What is the identifier for the record with the longest sequence? Is there more than one record with that length?
How many contain the subsequence "ARRA"?
How many contain the substring "KCIP-1" in the description?

As before it must print the answers to the terminal and I may decide to use a different data set for my testing.