Quick synopsis: A type of grass grows in Yellowstone National Park in hot (65° C), unfriendly soil. How the plant manages this feat is a mystery. What we do know, is that the grass can only tolerate high temperatures if it's been infected by a fungus, and the fungus has to be infected by an RNA virus. In the paper describing this discovery, the researchers provided the GenBank accession numbers for the viral sequences. I decided to see if I could find out more about the proteins and what they do. Read part I, part II, and part III.
And now, on with our story.
Down the rabbit hole, we go, but:
We begin with a BLAST
I started the quest by using the accession numbers, from the paper, to get the GenBank records and the sequences. The authors of the paper had already found that one piece of viral RNA (RNA 1) codes for a protein that's likely to be a replicase (1). I confirmed this finding using blastp and found that the predicted proteins do contain a conserved replicase domain.
(Q: Why do I call this a "predicted" protein?
A: Because this is a conventional way of referring to a protein, whose existence has yet to be supported by physical data.)
A replicase, by the way, is a perfectly reasonable thing for a viral RNA to encode. All viruses have to have a way to get their genomes copied if they're going to be able to go off and infect new cells. They would use a replicase to make new copies of the RNA genome.
That leaves the question of the other RNA (this virus has two pieces of RNA that were sequenced). I looked up the predicted protein sequences for the second RNA and used blastp to compare the predicted proteins to the sequences in the non-redundant database. I couldn't find anything for the smaller (168 aa protein), but the larger protein matched lots of hypothetical proteins from fungi. Some of those results are described here.
|Protein GI number||Result|
|EF120984||RNA 1||ABM92658||possible replicase, contains a conserved domain for an RNA dependent RNA polymerase
|ABM92659.1||possible replicase? some matches to the catalytic region of an integrase
331 amino acids
matches lots of hypothetical fungal proteins with unknown functions
168 amino acids
no match to anything
Lost in translation?
The next thing that I tried was blastx. BLASTX takes a DNA (or RNA) sequence, figures out all of the amino acid sequences that could be produced from all six ways of reading the sequence (we call this "translation") and then compares all the possible sequences to a database of protein sequences.
I tried blastx for two reasons.
First, many annotation errors are made because of DNA sequencing mistakes. If one or two bases are missing, the translation can be messed up. It would be like this sentence: "The fat cat sat on pat." If this sentence used the same reading frame and had a letter missing, it would read: "Thf atc ats ato npa." Imagine, now if I went to the library and tried to find a book with the phrase "Thf atc ats." If I had the right sentence, I would probably find Dr. Seuss. If I used the messed up sentence, I'd be out of luck.
The public databases have lots of these kinds of mistakes in translation. In a perfect world, I would be able to get the trace data from the DNA sequences, look at it myself with FinchTV, evaluate the quality, and possibly reassemble the sequences. In the real world, much of this data is not publicly available. NCBI for example, only stores trace data for a small number of viruses - most of them influenza. But enough whining, let's move on.
A challenge with blastx, is that different organisms use different versions of the genetic code and it's not always possible to know which version is used by the organism that you're studying. NCBI offers a choice of 13 genetic codes but I didn't have any luck trying find which code would be used by my RNA virus or even the fungal host. After chewing on this for awhile, I picked "yeast nuclear" reasoning that the virus infects a fungus and yeast is a fungus.
Here are the results:
The top two matches (red bars) are to the predicted sequences that are deposited in GenBank. They serve as a positive control, since they should match themselves.
Scanning down the page, from top to bottom, I see that the next best matching sequences (naturally) are from hypothetical or putative proteins. They had good E values, too, and it is reassuring, though, that they come from fungi (or possibly fungal viruses, I don't have enough data to know which it is).
Looking farther down, a couple of long sequences match both proteins. Both are from rice and one is a transposon sequence. They look like a good match and seem to fit my idea about a possible frame shift. But nothing is known about these proteins, so I decide on another path.
Taking a random walk?
The next path, I stumbled on by accident. I was planning to look at some of the "hypothetical" and "putative" fungal sequences and see if they matched anything interesting, when I found something new.
I had called up the GenBank record for the 331 amino acid protein from RNA 2 and clicked "BLink." Blink is short for "Blast link." BLink takes me to a database of pre-computed blastp results for my Curvularia protein.
I like to use Blink since it has lots of filters for viewing which sequences belong to which kingdom, which part of the protein aligns, which sequences have structures, and so on. I decided that I would get a list these sequences and use those as queries for more searching. So, I clicked the GI list button to get a set of sequences and instead got a surprise!
I never saw that Related Structures tab before!
What could it mean?
Join us next Friday, when we go through looking glass and see what we can find there.
1. Márquez, L., et. al. 2007 A Virus in a Fungus in a Plant: Three-Way Symbiosis Required for Thermal Tolerance Science 26: 513-515.