Sequencing a Genome, part VII: Want to win $10 million dollars?

By sporte on February 5, 2007.

How to win the X PRIZE in genomics
In October, 2006, the X PRIZE foundation announced that second X prize would focus on genomics. The first team to successfully sequence 100 human genomes in 10 days will win $10 million dollars.

And I would venture to guess, that the winning team would also win in the IP (intellectual property) game and the genetic testing market since they will gain an unprecedented look at genetic variation.

But when is done really done?
The first trick is defining what it means to be done. My husband says that "a sequencing project is done when the people who are doing it say that it is done."

How very true.

The human genome project was completed when the National Human Genome Research Institute (NHGRI) announced that it was complete.

Does this mean that every base in the human genome was identified?

No.

It meant that the NHGRI decided that a significant fraction of the parts that they said they were going to sequence had been sequenced. It was good enough.

I'm not sure how the X PRIZE foundation is defining "done" but anyone competing for this prize will need to know the definition of done in order to calculate the depth of read coverage that they will need. I covered this in a previous installment, but for a quick number, they will need to sequence the same region at least 7 times in order to be certain that they've sequenced 99.9% of the genome. To put this into perspective, that would mean that out of 3 billion bases, 3 x 10⁹ or 3,000,000,000 bases would NOT be "done."

What do the X PRIZE contestants have to consider if they are to win?
Once they've defined what it means to be "done," the contestants have to consider the variables that affect the number of reads that need to be sequenced. Unless they have a really cheap technology or unlimited funds, they will need to know how to reduce the number of reads.

The formula that we derived (1) for calculating the number of reads (Rn) is this:

Once a contestant knows what it means to be done, the value of the numerator is fixed. T (the size of the genome) is the same in every human (well, at least within the same sex), and C is the coverage depth.

In order to reduce the number of reads, you must either get longer reads (increase rL) or increase the number of high quality reads (Pf = passing fraction). I wrote in previous days about some types of reads that wouldn't pass muster (non-random reads, chimeras, E. coli, vector).

Today, I want to show you what happens when reads are short.

When reads are short, much of the information that's generated from sequencing is useless. The data might confirm other data, but it doesn't help us put the larger sequence together. This can be seen in the image below.

Trying to assemble sequences from short reads

Restriction enzymes make short reads.
If you haven't been convinced yet by the data that I've presented here and here, that making a genomic library with restriction enzymes is a bad idea, I have more data to show that RE libraries produce clones with, gasp, SHORT READS!

We used the Finch® Suite to look at the sizes of clones from our two RE libraries. One had been made by digesting genomic DNA with AseI, and the other had been made DraI. In the Finch Suite, we have an algorithm that identifies DNA sequences that match those from common vectors. We use the positions of the vector sequences at the 5' and 3' end of a read to determine the length of an insert.

It turned out that approximately half of the clones from the RE libraries contained fragments with vector sequences on the 5' and 3' ends of the insert (48%, for AseI, and 50% for DraI). This might not have been a problem if long reads were obtained, but our data (graphed below) showed that none of the reads were longer than 750 bases.

Making genomic libraries from restriction enzymes makes lots of short reads

But what about 454, aren't they one of the constestants? and don't they get really short reads?
454 is one of the contestants in the X PRIZE race. Their technology is described in very nice Flash animation (Pyrosequencing from 454).

But their sequencing instruments only get "reads" that are about 300 bases long. How do they address this issue of read length?

I can think of a few things that they do that help them out. First, they use a nebulizer to break the DNA up in random positions. Second, the method that they use, with diluting the sample until they are sequencing single molecules, enables them to obtain sequences that are high quality. Third, since they don't need to clone DNA, they don't have to cope with reads that are all vector, or E. coli, or chimeras. All of those steps increase the fraction of passing reads (Pf) and help compensate for a shorter read length (rL).

I'm not sure, though, how their technology can get around the last challenge that we will discuss with DNA sequencing: repetitive DNA.

But then, I don't think the X PRIZE foundation has a religious view of technology. You can probably use multiple strategies, as long as you get the genome sequences done.

References:
1. Porter, S., Slagel, J., and T. Smith. 2004. Analysis of Genomic DNA Library Quality with the FinchÂ®-Server. Geospiza, Inc. You can download the paper as a pdf document from here: http://www.geospiza.com/research/white-papers.htm
Look in the middle of the page.

2. http://www.454.com

Read the whole series:
Part I: Introduction
Part II: Sequencing strategies
Part III: Reads and chromats
Part IV: How many reads does it take?
Part V: Checking out the library
Part VI: Chimeras are not just funny-looking animals

More like this

So, this is part VII?

Yes, thanks! I fixed it.

I would like to win 10 Million dollars so I can pay off bills,pay for my baby's stuff,get food in the house and clothes on our backs.

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

What An Eclipse Means For US President Donald Trump

More by this author

New home for Discovering Biology in a Digital World

October 30, 2017

Sometime in the next day or two, Scienceblogs will shut down. We've enjoyed the opportunity to blog here for the past 10+ years. Not to worry, @digitalbio and @finchtalk will continue blogging, but more so from their own site at Digital World Biology. The Scienceblogs posts have been…

Synbiobeta: The Future is Now

October 12, 2017

@synbiobeta concluded it’s #sbbsf17 annual meeting on synthetic biology Oct 5, 2017. The progress companies are making in harnessing biology as a platform for manufacturing and problem solving is world changing. Locations of Synbio Companies What is Synthetic Biology? Synthetic biology is a term…

Understanding the CRISPR Cas9 system

September 18, 2016

On Sept. 30th, I'm going to be co-presenting a Bio-Link webinar on Genome Engineering with CRISPR-Cas9 with Dr. Thomas Tubon from Madison College. If you're interested, Register here. Since my part will be to help our audience understand the basics of this system, I prepared a…

Zika virus, drug discovery, and student projects

March 8, 2016

It's well understood in science education that students are more engaged when they work on problems that matter. Right now, Zika virus matters. Zika is a very scary problem that matters a great deal to anyone who might want to start a family and greatly concerns my students. I…

DNA: it's in your blood

February 28, 2016

Did you know small fragments of DNA are circulating in your blood stream? These short pieces of DNA are left behind after cells self-destruct. This self-destruction, or apoptosis, is a normal process. In the case of fetal development, certain cells in our hands die, leaving behind individual…