Project 5 – Multiple Sequence Alignments – Sequence Conservation 1-D and 3-D Spaces
Instructions
This project is called a multiple sequence alignment (MSA). We will use a program called Clustalw to perform the alignment. So far, we have seen alignments in our Blast output, where small fragments of the input peptide were mapped against the intact protein sequences. In this exercise, you will be looking at the whole sequence of several proteins compared against each other.
For this project, you will need to first collect the sequences of your protein and 4 homologs, evolutionarily related versions of your protein from different organisms. You will also need the sequence of the protein from the PDB that you have been looking at to analyze your structure. To do this:
1. First collect the sequence of the protein from the X-ray or NMR structure by going to the PDB.
2. Call up the structure you have been using by entering the PDB record number.
3. Press the “sequence details” button on the left menu bar.
4. Press “Download all chains in FASTA format”
5. Copy the sequence to a Microsoft Word file (including the line that starts with “>”). You can replace the information after the “>” with the PDB number if you wish.
6. Now go to the ExPASy web page and then to the Nice-Zyme page for your unknown enzyme.
7. At the bottom of that page is a list of organisms that have homologs of your protein. You will be able to identify some of the species (i.e. drome = drosophila melanagastor which is the fruit fly). Others may be more obscure.
8. Click on the links for one of the sequences you want to collect, you will arrive at the Nice-Prot page for the enzyme. At the bottom of that page is the sequence and the link to the fasta formatted sequence.
9. Click on the fasta link
10. Copy the sequence (including the line that starts with “>”) into the same Microsoft Word file for later use. You may replace the information after the “>” with a short descriptive name if you wish (i.e. the name of the organism). Leave one or two RETURN’s between each sequence so that you can easily tell where one ends and the next begins.
11. Repeat this procedure until you have collected several sequences one after the next all in the same file. At this point, your file should look like this:
>HEADER INFORMAITON <return>
SEQUENCE_SEQUENCE_SEQUENCE_ SEQUENCE_SEQUENCE_SEQUENCE_ SEQUENCE_SEQUENCE_SEQUENCE_ SEQUENCE_SEQUENCE_SEQUENCE_ SEQUENCE_SEQUENCE_SEQUENCE_ SEQUENCE_SEQUENCE_SEQUENCE_ SEQUENCE_SEQUENCE_SEQUENCE_SEQUENCE <return>
>HEADER INFORMAITON <return>
SEQUENCE_SEQUENCE_SEQUENCE_ SEQUENCE_SEQUENCE_SEQUENCE_ SEQUENCE_SEQUENCE_SEQUENCE_ SEQUENCE_SEQUENCE_SEQUENCE_ SEQUENCE_SEQUENCE_SEQUENCE_ SEQUENCE_SEQUENCE_SEQUENCE_ SEQUENCE_SEQUENCE_SEQUENCE_<return>…12. Save this file so that you can come back to it and repeat the alignment easily if necessary.
13. Now go to the ClustalW web site at PBIL.
14. Paste the sequences from the MS Word file into the sequence window. Make sure that you keep the “>” character, it is an important part of the file format.
15. Press the “Submit” button. All of the default parameters in the program can be left as they are.
16. Look at the alignment statistics. If they are between 5-20% identity, you should proceed to the next part of the assignment. If it is below 5%, remove one of the sequences (not the one from the PDB file) and repeat the alignment. If they are significantly more than 20% identical, add another sequence to your alignment, or replace a closely related species with one that is farther away in terms of evolutionarily divergence (i.e. humans are close to rats both being mammals but are distant from E. coli). If you have added all of the ExPASy sequences and still have an alignment with greater than 20% identity, continue with the assignment anyway – you have a very highly conserved protein.
17. Clustal allows you to superimpose secondary structure predictions onto the MSA alignment. To do this, click on the box(es) next to the secondary structure prediction algorithm you wish to try (such as PHD or DSC) and then press the “show” button. A new line on the alignment will appear with symbols like “e, h and c”. These are sheet, helix and coil respectively. You may superimpose the prediction from a second algorithm by clicking its box and pressing "show" again. If it has a notation that says "do it alone", that means you must add it to your alignment independent of other structure prediction algorithms. Compare the predicted regions of helix, sheet and coil to the actual secondary structures observed crystallographically (see project 3).
18. Compare a few of the prediction algorithms and explore the differences and similarities.
19. Set the output to show consensus secondary structure only and press the "show" button.
20. Print out the alignment results and hand them in with the PS.
21. The next phase of this assignment will superimpose the sequence alignment onto the 3-D structure. To do this we will use a program imbedded in the Protein Explorer package.
22. Enter Protein Explorer as you did previously and call up your structure.
23. Stop the rotation and remove the water molecules and then click on “Molecule Information”.
24. In the new window that opens up, click on “Conserved Regions”. This link will take you to the ConSurf web site (http://bioinfo.tau.ac.il/ConSurf/).
25. Enter the PDB code that you have been using and the appropriate designation for the chain you wish to align. In a few cases, there is no one letter code assigned to the chain. If this is the case, type in “none” in that window. Press submit when you are ready.
26. Click on the “Click here” button after a few seconds to look at the results. The process may take awhile, so don’t wait to the last minute to run this program.
27. When the alignments are done, the results will appear in this window.
28. View the MSA output by clicking on the output in clustal format button. Print out this output for use in answering questions on the problem set. Close the window to return to the ConSurf results page.
29. Click on “View ConSurf Results with Protein Explorer”.
30. When PE opens, you will see a somewhat different screen than you are used to with commands associated with the ConSurf program. The molecule will be in space filling mode color coded by conservation with the most conserved regions maroon and the least conserved regions in blue.
31. Click on the “Bkg” button to make the background white for better black and white printing.
32. Print out a view of the protein that lets you see the active site.
33. Click on “spacefill none” and then click on the maroon box (most conserved). Print out a view of this overlay.
34. Click on “spacefill none” and then click on the blue box (least conserved). Print out a view of this overlay.
NOTE: You may get to all of the normal PE commands by pressing on the explore link that will take you to the normal quick views menu. There will be a link to return you to the ConSurf menu so you should be able to go back and forth between the sets of available commands without a problem.
Thought questions to help you reflect on the exercise and analyze the output files:
a) What do the “*” and ":" indicate on the MSA alignment?
b) How does the alignment generated by ConSurf differ from the one you generated manually using Clustal W? Don’t focus on the format of the output (which is different only because you selected a certain output style in Clustal W), but rather, look at the actual content – the parts that are similar versus different and the sequences that have been used.
c) When you look at the primary sequence and the amino acid conservation, do you see any patterns or organization?
d) The secondary structure prediction probably came back with a higher percentage of random coil then actually exists in the protein? Is this prediction accurate and/or realistic? Why or why not? Please explain.
e) What is a conserved sequence motif? With the help of ExPASy, try to identify a specific conserved sequence motif in your alignment.
f) Is there a relationship between the sequence conservation and the overall 3-D structure? To answer this question, compare the printouts you made of the most conserved and the least conserved parts of the sequence.
g) The exercise above looked at the multiple sequence alignment of a protein. How would you use MSA on a nucleic acid (such as a tRNA) and what information might you obtain from that exercise? Will you always look simply for conservation or are there other factors that might come into play? To answer this question, think about the structure of tRNA. If you don’t know what tRNA looks like, work through this tRNA tutorial before answering the question.