More flu follies: comparing sequences and making trees, activity 4

What tells us that this new form of H1N1 is swine flu and not regular old human flu or avian flu?

If we had a lab, we might use antibodies, but when you're a digital biologist, you use a computer.

Activity 4. Picking influenza sequences and comparing them with phylogenetic trees

We can get the genome sequences, piece by piece, as I described in earlier, but the NCBI has other tools that are useful, too.

The Influenza Virus Resource will let us pick sequences, align them, and make trees so we can quickly compare the sequences to each other.

This is how I got the sequences that I wrote about yesterday. I think the more people we have looking at sequences, the better off we are.

I'll show you how this works by getting and comparing sequences from the hemagglutinin (HA) protein from the recent cases of H1N1 swine flu and comparing those sequences to the HA protein from other cases of H1N1 swine flu that happened last year.

1. Go to the NCBI Influenza Virus Resource (this will open a new window).

2. Start out by getting the sequences from the recent swine flu cases in California and Texas.

To do this, we will pick Influenza A as the virus species, human as the host, North America as the region, and HA as the segment. Protein sequences are selected by default and those are just fine.

Then, we set the date range from 2009, 03, 01, to 2009, 04, 29.

Last, we click the Add to Query Builder button to get the sequences.

I forgot to put this in the image, but I also used a filter to select for H1.  I typed "H1" in the really long text box.  Also, note, I was looking at the protein sequences.  (We should look at nucleotides, too, but that's a later experiment.)


3. This query finds 7 sequences. If we click the Get Sequences button, we can see that that these are the California and Texas isolates.


Now, we have to decide which groups we'd like to compare. I decided to compare these to other H1N1 flu sequences and to some sequences from pigs.

4. To get other flu sequences for comparison, I used the same queries (1-2) with some changes. 

       a.  For one set of sequences, I changed the host to "Swine."

       b. For the other set of sequences, I changed the date range so that I could get older sequences.

       c.  Each time I changed the settings, I clicked the Add to Query Builder button.

Now, the Query Builder contains the H1 sequences from the seven US cases, 272 sequences from people who've been infected with H1N1 over the past year in North America, and 5 H1 sequences from pigs.


5. Then, I click the Get Sequences button.

This gives me a long list with far more sequences than I want to use. I click the check box at the top to deselect everything, then I use the check boxes to select the sequences I want to compare.

I sorted by year to make my 2009 cases easier to find. Then, it's time to decide which sequences to pick.

Hmmm, of course I picked the seven swine flu cases, then I picked some sequences that were isolated from actual swine, then some other human cases of H1N1 that happened in different parts of North America last year.

At this point, I could download sequences and work on my own computer or I can use some of the analysis tools at the NCBI. I decided to let the NCBI's computers do the work, so I clicked the Multiple Alignment button to see the amino acid similarities, then, I clicked the Build a tree button, and a lot of Next step buttons.

Here's my tree:


After making the tree, I decided to look at all the sequences in my set. Here's what I get from that analysis:

View the full-size image

What do I conclude from this? Well, first, it looks reasonable to say that the people in Texas and California were probably infected with the same strain since those sequences cluster pretty closely together.

Second, it looks like the HA protein from the California and Texas strains is most similar to the HA protein from a strain that infected some pigs in Ohio a couple of years ago and it is not as closely related to the 200 some strains of H1N1 that infected other people in 2008.

You guys can play amateur epidemiologist, too, and look at other strains or look at the New York strains.  I think the more eyes we have looking at these, the better off we are. 

Nucleotide sequences should be looked at and other tree methods would be good to try as well.  And, of course as if things weren't complicated enough, there are 8 different segments of the flu genome.

Have fun!

More like this

This afternoon, I was working on educational activities and suddenly realized that the H1N1 strain that caused the California outbreak might be the same strain that caused an outbreak in 2007 at an Ohio country fair. UPDATE: I'm not so certain anymore that the strains are the same. I'm doing…
No more delays! BLAST away! Time to blast. Let's see what it means for sequences to be similar.  First, we'll plan our experiment.  When I think about digital biology experiments, I organize the steps in the following way:             A.  Defining the question B.  Making the data sets…
Genome sequences from California and Texas isolates of the H1N1 swine flu are already available for exploration at the NCBI. Let's do a bit of digital biology and see what we can learn. Activity 1. What kinds of animals get the flu? For the past few years we've been worrying about avian (bird).…
I was pretty impressed to find the swine flu genome sequences, from the cases in California and Texas, already for viewing at the NCBI. You can get them and work them, too. It's pretty easy. Tomorrow, we'll align sequences and make trees. Activity 3: Getting the swine flu sequence data 1. Go…

Hi, if you Fasta search the sequences for this current North american H1N1 outbreak, it can be seen that in fact a closer maych fopr the HA gene is a kansas strain: (A/Swine/Indiana/P12439/00 (H1N2) but given the dates on all of the isolates, all that really can be said is that the HA gene seems to have come from pig viruses already circulating in the USA.

I think you meant a BLAST search. And Indiana.

But the real problem here is sampling bias. If we sample 200 swine a year from the USA, and 100 from Asia, and ZERO from central America or South America, of course the "best hit" in a BLAST search will be a USA or Asian swine.

What you are looking for, is a swine (or bird, or human) sequence that is 98% to 100% identical to the California 2009 human sequences. 92% to 95% identity indicates a rather distantly related strain.

Sampling bias is our biggest problem here. We have not been sampling hundreds of swine each year in central and south America for the past 20 years. Nor Sand Hill Cranes and other wild migratory birds.

Hi Brian, No i meant a FASTA search which is more sensitive than a BLAST search.. try the, both and you will see. And what you say is obvious, you need a sequence in order to get any kind of aligment.. but the practical issue is we only have the sequences which have been sequenced. And of course the % identity would ideally be as close to 100% as possible, but out of the 1000's of sequences in the database, the A/Swine/Indiana/P12439/00 (H1N2) HA sequence gives the closest match.. so i dont understand why ohio has been labelled as the closest match (well i do as the sample only consisted of HA sequences from 2005 onewards.

I'm trying to develop this as an exercise in my Genetics course. I chose some sequences and aligned them, but for some reason the 'Build a Tree' button in the alignment window is grayed/dimmed out (and thus not available for use).

Wonder what I'm doing wrong?

Hi Russell,

I'll take a look this weekend and see if I can figure out what's going on. In the meantime, you might also want to take a look at a different activity that I wrote that uses Scenario Based Learning to investigate flu and build phylogenetic trees.

You can find this one here: