DNA sequencing errors hit home

By sporte on March 17, 2008.

One of my colleagues has a two part series on FinchTalk (starting today) that discusses uncertainty in measurement and what that uncertainty means for the present and Next Generation DNA sequencing technologies.

I've been running into this uncertainty myself lately.

I have always known that DNA sequencing errors occur. This is why people build tools for measuring the error rate and why quality measurements are so useful for determining which data to use and which data to believe. But, some of the downstream consequences didn't really hit home for me until a recent project. This project involves having students clone and sequence uncharacterized genes from genomic DNA. My part of the project was to do some research and write the bioinformatics section of the student lab manual.

One of the steps in this process involves using shorter DNA sequences to reconstruct a longer sequence of DNA that we call a contig. We call this process "DNA sequence assembly" and we have to do it because of technical limitations.

This time, however, things are a bit different from my past experience in part because this time we have far less data. For many reasons, the quality of the student-generated chromatograms tends to be low, with only 25-50% of the files containing usable data. This means that each student or lab group only has about three to four reads that they can assemble to create their contig. In some cases, this also means that they might only get the sequence from a single strand.

Since I've been testing the project to find out how things will work for the students, I've been doing many of these assemblies with different small data sets and reviewing the results. It's been quite surprising to realize how frequently errors occur.

I'm finding the errors by two different methods. First, I can detect errors when I look at the assemblies. In the case, below I found a position where one read had a deletion relative to the other. When I reviewed the trace in FinchTV, I could see that the base-caller had missed that A. When I find errors like that, I edit the reads in FinchTV to fix the sequence of bases and save my changes back to the iFinch database.

i-76d667c2b926679633e70633988d598e-sequencing_errors.png

The other place where I detect errors is the step where we compare our proposed genomic sequence to a set of reference mRNAs. In this case, when I look at the blastn results, I can sometimes see alignments that look like this:

In this case, you can see that all of the sequences below my query (shown at the top) have an extra T or C that my query is missing. Again, I go to FinchTV and review the trace to find out if there should be another base in my read that somehow got missed.

I know it's strange, but despite all the assemblies that I've done, it's been working with the small assemblies that have really impressed me with the need to have lots of redundant data. Now, I know what people mean when they say that they minimize errors by collecting more data. I think one of the benefits of this project is that students are going to learn why many of us are excited about Next Generation sequencing technology. The more data we collect, the more we can confirm our results.

I'm certain, in the future, we won't be quite as uncertain.

More like this

From your description, it sounds like the students are sequencing plasmids. How are they isolating the DNA? DNA isolation has produced the biggest variability in sequencing results in my experience.

Back in the day when we were actually doing the sequencing ourselves, the senior graduate student could sequence his boiling mini-prep DNA and I couldn't get mine to work. Turned out that he carefully matched is isopropanol volumes to his supernatant volumes and I was ... less than careful. Once I made the adjustment, I could sequence my boiling preps, too.

The students are cloning genomic DNA via nested PCR, and then using PCR and doing Sanger sequencing from their clones.

I'm not worried about the quality of their data. We can identify poor quality chromatograms and discard the ones with too few Q20 bases.

I was just surprised to see how many errors there were in good quality sequences.

It does makes sense, if a value of Q20 means that one in 100 could be a mistake, then if you had a read with over 800 bases and each base had a quality value of 20, there could easily be 8 mistakes. It's just astonishing to see this in practice and not just in theory.

I run a DNA Sequencing core lab and this article was of particular interest to me. I think error rate is definately something to keep in mind but, as Ron points out, the method (and quality) of isolation is probably the biggest factor in sequence quality. There is also the brand of instrument, age of reagents, competence of the technician, etc. to think about. You want to also watch what region of the read you're looking at. The beginings and ends can contain a lot of miscalls. I often tell my customers that their sequence is like a stick of celery. You cut off the leafy part at the top and the part at the bottom that was stuck in the ground and you now have yourself a nice piece of celery. I don't see the error rate with our control that, according to this article, should be occuring. However, Sandra does well in encouraging the researcher to check the quality of the chromatograms and verify. Many people rely only on the text sequence.

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

Communism V. Journalists: Beijing’s Crackdown on Press Freedom

More by this author

New home for Discovering Biology in a Digital World

October 30, 2017

Sometime in the next day or two, Scienceblogs will shut down. We've enjoyed the opportunity to blog here for the past 10+ years. Not to worry, @digitalbio and @finchtalk will continue blogging, but more so from their own site at Digital World Biology. The Scienceblogs posts have been…

Synbiobeta: The Future is Now

October 12, 2017

@synbiobeta concluded it’s #sbbsf17 annual meeting on synthetic biology Oct 5, 2017. The progress companies are making in harnessing biology as a platform for manufacturing and problem solving is world changing. Locations of Synbio Companies What is Synthetic Biology? Synthetic biology is a term…

Understanding the CRISPR Cas9 system

September 18, 2016

On Sept. 30th, I'm going to be co-presenting a Bio-Link webinar on Genome Engineering with CRISPR-Cas9 with Dr. Thomas Tubon from Madison College. If you're interested, Register here. Since my part will be to help our audience understand the basics of this system, I prepared a…

Zika virus, drug discovery, and student projects

March 8, 2016

It's well understood in science education that students are more engaged when they work on problems that matter. Right now, Zika virus matters. Zika is a very scary problem that matters a great deal to anyone who might want to start a family and greatly concerns my students. I…

DNA: it's in your blood

February 28, 2016

Did you know small fragments of DNA are circulating in your blood stream? These short pieces of DNA are left behind after cells self-destruct. This self-destruction, or apoptosis, is a normal process. In the case of fetal development, certain cells in our hands die, leaving behind individual…