BLASTing through the flu: activity 5, how similar is similar?

No more delays! BLAST away!

Time to blast. Let's see what it means for sequences to be similar. 

First, we'll plan our experiment.  When I think about digital biology experiments, I organize the steps in the following way: 

           A.  Defining the question

B.  Making the data sets

           C.  Analyzing the data sets

D.  Interpreting the results

I'm going intersperse my results with a few instructions so you can repeat the things that I've done below.  I've some people writing that only experts should be analyzing data.  But  I disagree with those who say that sequence analysis should be left to the experts.  Okay,  expert input is important for the final interpretation, but there's no compelling reason to keep anyone from evaluating public data for themselves. 

A.  What is question are we going to ask?

The question we're going to adress today, is:

How similar are the 2009 H1N1 sequences to each other?

Why do we care?  I think it would be a good frame of reference.

B.  Make the data set(s)

1. Go to the NCBI influenza resources database.

Every time I visit, they have more sequences! 

(But they still don't have the sequences from Mexico!  Why are those sequences missing?)

2.  I decided to look at the influenza nucleotide segments that code for the hemeagglutinin protein (HA), this is the H part of the strain name in H1N1.  (why nucleotides?)   

To get the sequences, I searched for any Human Influenza A nucleotide sequences between 2009-03-01 and 2009-05-01.  (How to do the search.)

I limited the search by H1, specified nucleotide sequences from H1, and required that sequences be full length sequences (you need to all the sequences to be the same length to do good comparisons and some of the new sequences are only partial sequences.)

This gave me 10 H1 sequences.

Doing the Analyses

I did two things to look at similarity.  First, since I wanted to look at full-length sequences, I downloaded the accession numbers and compared them with BLAST.  Second, I downloaded the sequences, and used JalView and ClustalW to make an image so you better see the similarity.


1.  Go to the BLAST nucleotide page.

I selected the checkbox for aligning two or more sequences since I only want to compare these sequences to each other.  Then, I used one of the accession numbers as the query and all ten as the subject sequences.  

Here are the accession numbers so you can do this yourself: 

Query:  FJ971076

Full length human H1 subject sequences:  CY039527

I had all these in a file, so I uploaded the file.


2.  Then, I clicked BLAST.

I found that all the new sequences (in my data set) were between 99-100% identical to the one I used as a query.  That one, my positive control, was of course 100% identical to itself (surprise!).

Changing the formatting to a query-anchored alignment, shows us a little bit about the similarity.  I'm only showing part of the alignment in the image below, but you see from the few positions with differences, the sequences are pretty similar. 


More comparisons

These HA sequences are all really similar.  That's why people are convinced that it's the same strain of virus, at least in Texas and California. 

But, how does this compare to other HA proteins?

There's a really nice, user-friendly program called "JalView" that I like to when for working with multiple sequence alignments.  JalView has some web connections that you can use to do multiple alignments with the ClustalW or Muscle algorithms.  It's easy to edit, add, or group sequences.  And, most of all, I love the coloring options.

In JalView, I added a human H3 nucleotide sequence for comparison and did a multiple alignment with ClustalW.  Three pairs of HA sequences were identical, so I put those into groups.  Then, I colored the sequences by identity to make the differences stand out.


The top picture shows part of the sequence, the bottom image shows the nucleotide sequences from the entire HA segment.   Groups of identical sequences appear as dark bars and any differences appear as white lines.

Interpreting the results
From these analyses, we can see that:

  1. The H1 nucleotide sequences from the April 2009 outbreak (in California and Texas, at least) are at least 99% identical.
  2. The H3 sequences is very, very different (about 50%) from the H1 nucleotide sequences.

Next - what do we see when we blast the whole database?

More like this

The 50% difference between an H1 and an H3 is huge. It could represent anywhere between 5,000 years ago that H1 and H3 diverged from a common ancestral H gene and 50 million years. Once sequences have reached saturation with mutations, the "molecular clock" does not work (yes it is 12 O'clock, but what year is it?).

The more interesting comparisons are between the various H1 genes. We know for example that another lineage of H1N1 human viruses first entered humans between 1916 and 1918 and they have not jumped back into swine and birds since then, they have remained in humans. And we have a couple samples from 1918 (dug out of graves in the tundra) and samples from 1960 or so to present day (from frozen samples in hospital freezers). That is one time vs distance plot we can build for another H1 gene.

We have many swine and bird H1 genes from several independent lineages, 1980 or so to today. So we can trace the rate of evolution of a few more H1 genes.

From all that data, we can then estimate how long ago this "new" swine H1N1 virus H1 gene probably diverged from these other swine H1 genes that are it's closest relatives. The distance seems to be roughly 4% (96% nucleotide identity) or so. It is important to keep track of how much us silent, synonymous change and how many of the mutation are non-silent, changing the amino acids encoded in the H1 protein. Selection on proteins is huge, selection on silent sites is very minimal.

I have not checked any of this data yet. But given what I know about how Flu rather rarely crosses from ducks/chickens/birds to swine and visa versa (not every year, maybe once a decade or so?) and our sampling of avian, human and swine H1 genes, I am guessing that this 4% TO 5% distance between the various H1 genes represents at least 10 years, and possibly as much as 40 years of evolution.

The rate of evolution of the surfaces of the H and N genes changes dramatically upon host switch, but this virus is though to have gone from swine to human in the past month or 2, not 4 years ago. And from what I see in the BLAST result, I doubt all this 4% change is on the virus surface.

My wild guess from what I have seen so far, is that this is not clearly an Asian swine/American swine virus reassortment virus. I know the authors of the WIRED report:
are experts, and have spent more time looking at the data than I have. But I have seen other rapid reports proven wrong before...

Brian, my gut feeling is that your guess of 4-5% difference between H1s representing 10-40 years of divergence is way off. If you look at the H1s from the present outbreak you can already see several divergences -- I don't have time to do sequence-by-sequence comparisons, but there are ~10 points where the nucleotide sequences as a group are different (maybe 0.2% on a sequence-by-sequence comparison). We don't know where these cases came from and it's likely the virus that infected them came from different areas and has been diverging for a little more than a week, but probably not much more than a month or two. My own guess (and again, this is something that can be checked, but I have to take my son to his soccer game in a little while) is that a 4% divergence would be more like a couple of years, which I think would be more consistent with the experts' comments.

Hi all, I am just curious on how is the years of divergence being estimated from sequence differences? Is this something we can see from the phylogenetic tree? or are you simply speaking from your experience?

Phylogenetic trees can be derived by using the number of nucleotides that differ between sequences. if we know the mutation rate for a given species and we know the number of nucleotide changes, we can use the number of mutations to estimate the number of years that have passed since the strains of virus diverged. This is called a molecular clock.

If I'm interpreting Jian et al (2008) correctly, it looks as if you can get 5% divergence in an influenza HA over a single flu season in an outbreak.

Jian, J., Chen, G., Lai, C., Hsu, L., Chen, P., Kuo, S., Wu, H., & Shih, S. (2008). Genetic and Epidemiological Analysis of Influenza Virus Epidemics in Taiwan during 2003 to 2006 Journal of Clinical Microbiology, 46 (4), 1426-1434

iayork: Yes, you are misinterpretting things. There is a huge difference between observing 5% distances between isolates from a single city in a single year, and observing 5% change in a single unbroken lineage of evolution. For example in HongKong in 2005 hundreds of people will enter and leave Hong Kong bringing with them influenza A virus H1N1 strain from the globally circulating strains. We find the same strains in New York and every other major city in the world.
A good review of the HA and NA segments of flu is found here:

And a good analysis of Influenza rate of evolution here:

I am a college science professor and want to use the recent interest in the H1N1 virus to teach basic biology and chemistry concepts. Do you know of any science education work that has been done in this area?

By Ben Hutchinson (not verified) on 02 Jul 2009 #permalink

Hi Ben,

I have one well-tested activity on flu that relates to basic biology and chemistry concepts.

The basic biology concept concerns the relationship between the nucleic acid coding sequence and the phenotype of the organism. A single base change in the nucleic acid sequence changes a amino acid, which changes the activity of a protein and the phenotype of the virus. This also touches on evolution because wild type viruses are sensitive to Tamiflu and the mutant is resistant, so there is a selection process for resistant strains.

The chemistry concepts involve the interaction between the drug (Tamiflu) and the neuraminidase protein. The drug binds to the protein through a series of electrostatic and hydrogen bonds. In the image and using VAST to superimpose structures and Cn3D to view them, you can see that one of the bonds between the drug and the protein is an interaction between a negatively charged carboxylic acid in the drug and a positively charged amino acid (lysine) in the protein. This bond is unable to form in the resistant mutant, thus weakening the interaction between the drug and the protein. And allowing the drug to be displaced by the normal substrate.

You can find a description of the activity here.