In which we identify unknown human proteins.
Yesterday, I wrote about using the BLOSUM 62 matrix to calculate a score for matches between two proteins. Those scores give us a good start on understanding how blastp determines whether two sequences are matching by chance or because they're more likely to be related. But that's not all there is to calculating a blast score, and there's at least one other statistic to consider as well, the E value.
It all comes down to biochemistry
The BLOSUM 62 matrix is based on the substitutions that really do or do not happen in real protein sequences. I want to point out that the ability of a protein to tolerate those replacements is related to the chemical properties of the amino acids in the pair. Naturally, we have a word to describe this. If an amino acid is replaced by one with a similar chemistry, then we say it's a "conservative change" and we use a "+" to show this in some representations of alignments. Replacing valine with leucine, for example would be conservative change since both amino acids have small, hydrophobic side chains. If the chemistry is different, it's a non-conservative change.
Now, onto the E value!
The E value is used as a way to normalize our results and to determine the number of sequences that would match as well as ours, if we were searching a database of random sequences. Of course our real databases are definitely not random, but people actually made databases of random sequences when they worked on the blast algorithms.
The key things that the E value corrects for are the length of our query sequence and the number and lengths of sequences in the database. The query sequence is the one that we're using for the search. The sequences we find are called the subject sequences. If the query sequence is short, it's likely to match more sequences in the database. Consequently, short sequences will have higher E values.
If the query sequence is long, then it might have a longer set of matching amino acids. This will give it a lower E value, since a longer match is less probable.
In terms of the database size, as a database gets larger, there is a greater chance that it will contain matching sequences. So, the E value goes up when you search larger sets of sequences and down when you look at smaller sequence sets.
One confusing point is that very low E values are represented as exponential numbers. A number presented as 9e-166 is really: 9 x 10-166
Okay, that part isn't too confusing. The confusing part is when the E vales get to be very, very small. Eventually, they reach a point where there are so many digits in the exponent that it's possible to print the E value on the web page. At this point, the E value gets rounded off to zero. Be aware, the E value is never zero, it's just very, very close. If you have an E value close to zero, then your proteins are quite similar, maybe even the same protein from different species.
What does this all mean? If an E value is low, say below 0.01, then the match is significant. If the E value is higher, the match might still be significant, especially if you have a short sequence. If that's the case, you have to evaluate the results in the context of the experiment.
Using blastp to identify unknown proteins
If you're in my ACC class, your assignment will be to take a closer look at some of these oncogenes, using an industrial-strength form of protein blast, called "blastp." If you're not in my class, you can follow along for fun.
1. Go to the NCBI home page.
2. Search with the query: human AND unknown
3. Click the link to the protein database.
4. How many unknown human proteins are listed?
5. Pick one of the sequences in the list, record the accession number (so you can find it later) and click the word "BLink" that appears on the right side of the page.
BLink is a link to all the results of blastp searches that were already done. Whenever a new protein sequence (or predicted) protein sequence enters GenBank, blastp is automatically run.
6. When you select BLink, you will get a list of all the results from blastp searches.
In the example below, I have multiple proteins with the identical blastp score. They might all be the same protein, or different forms of the same protein, or they might be the same protein in different species.
7. Pick one of the highest scoring sequences to work with. Go to the class Blackboard site and review the accession numbers from your classmates. If someone else has chosen your number, go back and find another protein to work with.
8. Click the link to the blastp score to see the alignment.
9. Look at the alignment and answer the following questions:
a. Are the two sequences the same length?
b. Do they align over the entire sequence or just in part of the sequence?
c. Do you think this blastp result is significant? Use the blastp score and the E value to justify this statement.
10. Click the link to the subject sequence that you found in the database. Read the description and the comments in the sequence record. You might look at the other matching sequences to get more details. Write a 1-2 paragraph description of the information you can find about this protein and it's function.
When the E value is higher, what does it mean to evaluate the results in the context of the experiment? in statistical terms?
what does it mean to evaluate the results in the context of the experiment? in statistical terms?