Shotgun Sequencing a Eukaryotic Genome

By evolgen on February 7, 2007.

Shotgun sequencing refers to the process whereby a genome is sequenced and assembled with no prior information regarding the genomic location of any of the DNA we sequence. There are quite a few steps that you have to go through before you have an assembled genome sequence. We're going to cover isolating DNA, putting the DNA in bacteria, sequencing the DNA, and assembling those sequencing into a complete genome.

Sandy has been running a series on sequencing genomes (parts 1, 2, 3, 4, 5, 6, 7). You should go check it out even if you read this post; while I'm going to deal with some of the basics of shotgun sequencing, she goes over some things that I will not. This post will cover how genome sequencing projects go from organisms to assembled genomes, but there are certain details that I will be leaving out that Sandy has explained quite well.

Isolating and Cloning DNA
Before we can begin sequencing DNA, we must isolate it from dead organisms (or parts of those organisms). There are multiple ways to isolate DNA, but they all involve breaking down the tissues and cells to isolate the nuclei (the membrane bound intracellular structure containing the genomic DNA), then breaking down the nuclei to remove the DNA inside. Once we have the DNA, we can directly sequence it only if we have prior information regarding the some of the sequence of the region we would like to sequence. If we want to sequence an entire genome, we will not have enough sequence information to directly sequence the genomic DNA.

Once we have isolated the DNA, we break it into fragments of different sizes (for reasons discussed below). Those fragments are then mixed with bacteria, and some of the bacteria take up the DNA which gets incorporated into extragenomic DNA sequences called plasmids. Because we know the sequences of those plasmids, we can easily sequence the fragments that are inserted into the plasmids (the fragment shown as a red block in the figure below). Each of those plasmids is known as a clone.

Sequencing
In order to sequence the DNA the old fashioned way (there are some new fangled techniques we won't deal with here), we use primers to initiate the sequencing reaction. Those primers are designed to match the known sequence of the plasmids flanking the region containing our DNA of interest that was inserted into the plasmid (shown as green arrows in the figure above). DNA sequencing can only handle a few hundred nucleotides (DNA letters), and the genomic fragments are on the order of thousands of nucleotides. That means we don't get the entire sequence of the fragment, but we do generate sequences of the ends of the fragments (squiggly lines in the figure). Furthermore, we can keep a record of from which clones the end sequences come, so we know that each pair of end reads should be located in the same genomic region.

Assembling Shotgun Reads
This aspect of shotgun sequencing will receive the brunt of my focus. Hopefully I've set this up properly by describing end sequencing of reads because that is secret to shotgun sequencing. That's a hint to go back and read the previous paragraph and look at the previous figure if you skimmed it over. The sequencing strategy is important. Real important.

Once the DNA sequencing is completed, the sequences are assembled like a puzzle. Ideally, the fragments overlap each other so that the sequences that partially overlap each other will be joined together to form larger sequences. We also would like to have small, medium, and large fragments covering each region (see below). This process continues until all the overlapping sequences are assembled into a bunch of really long sequences known as contigs.

But the contigs only cover portions of each chromosome, and the goal is to have a single sequence that covers an entire chromosome. For various reasons (including repetitive DNA and lack of sequence for all genomic regions) there are genomic sequences that fail to assemble into the contigs. The next best thing we can do is try to fit the contigs together into a single sequence known as a scaffold.

In the figure shown above, three contigs are combined into a single scaffold. The arrows indicate paired end reads of clones -- the red arrows are from one clone and the blue arrows are from a different clone. If multiple paired end reads are located at the ends of two contigs, we can join the contigs into a single scaffold. The red region of the scaffold is the sequence that came from the contigs, and the black region is the sequence inferred to be located between the contigs. Only we don't know what those black sequences are, so we fill them with unknown nucleotides and refer to that region as a gap. While it may seem problematic to introduce gaps into a genome assembly, we make up for the cost of the gaps because they allow us to assemble contigs into scaffolds.

The scaffolds are then assigned to chromosomes using a few different strategies. If we know something about the molecular genetics of the organism we're studying, we can identify genes that have been previously mapped to chromosomes within our scaffolds. If a scaffold contains a gene (or, even better, multiple genes) that is known to be located on a particular chromosome, the scaffold most likely came from that chromosome. And if we know the order of the genes on the chromosome, we can designate and orientation to the scaffold and order multiple scaffolds on a chromosome.

If we lack known genetic markers or the marker set is poor, we can take the some of the clones that we sequenced and map them to chromosomes using in situ hybridization of clones to the actual chromosomes. This involves attaching a fluorescent tag to the cloned sequence and mixing the tagged clone with cells from the organism. The nuclei of the cells can be observed under a microscope and the chromosomes visualized. The clone will anneal to the chromosome from which in came, allowing you to map the clone to a particular chromosome. The contig and scaffold containing the sequences from that clone can then be assigned to that chromosome.

Once all of the sequences have been assembled into contigs, the contigs assembled into scaffolds, and the scaffolds assigned to chromosomes, we have a "draft" assembly of a genome -- not until we minimize the gaps to an acceptable standard can we call the assembly "finished". Most sequenced genomes do not make it beyond the draft stage, as the finishing process is expensive; a draft sequence is usually good enough for most genomes. The genome sequence then gets annotated, a process that involves finding genes and other sequences in the scaffolds. This is done using gene prediction algorithms, comparisons with other annotated genomes and known genes, and other computational techniques.

More like this

BLASTing through the flu: activity 5, how similar is similar?

No more delays! BLAST away! Time to blast. Let's see what it means for sequences to be similar. First, we'll plan our experiment. When I think about digital biology experiments, I organize the steps in the following way:

Development and Role of the Human Reference Sequence in Personal Genomics

A few weeks back, we published a review about the development and role of the human reference genome. A key point of the reference genome is that it is not a single sequence.

More flu follies: comparing sequences and making trees, activity 4

What tells us that this new form of H1N1 is swine flu and not regular old human flu or avian flu? If we had a lab, we might use antibodies, but when you're a digital biologist, you use a computer.

The Future of Eukaryote Genome Sequencing

A couple of weeks ago I suggested that the National Human Genome Research Institute (NHGRI) would no longer be funding de novo genome sequencing projects via white

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

More by this author

This is a Good-bye Post

January 16, 2009

This is the final post ever at evolgen. It was a fun 4+ years, the last three spent at ScienceBlogs, but it has come time for me to close up shop. When I first got into blogging, I did it as a way to share what was on my mind to the few people who would read what I had to say (usually in topics…

Mendel's Garden #27 - Call for Submissions

January 2, 2009

Mendel's Garden is the original genetics blog carnival. The next edition will be hosted by Jeremy at Another Blasted Weblog. If you would like to submit a blog post to be included in the carnival, send an email to Jeremy (jcherfas at mac dot com). The carnival should be posted within the next few…

Eric Lander Teaches?

December 20, 2008

John Hawks points out that Eric Lander has been appointed to co-chair Obama's Council of Advisers on Science and Technology along with science adviser John Holdren and Nobel Laureate Harold Varmus. Here's how the AP article describes Lander: Lander, who teaches at both MIT and Harvard, founded the…

The Implementation of Molecular Evolution for the Masses

December 18, 2008

A couple of years ago, there was talk in the bioblogosphere about getting the general public interested in bioinformatics and molecular evolution: Amateur bioinformatics? Lowering the Ivory Tower with Molecular Evolution Molecular Evolution for the Masses The idea was inspired by the findings of…

Do people still use microarrays?

December 17, 2008

Larry Moran points to a couple of posts critical of microarrays (The Problem with Microarrays): Why microarray study conclusions are so often wrong Three reasons to distrust microarray results Microarrays are small chips that are covered with short stretches of single stranded DNA. People…