Is phred dead? Let's see the data

By sporte on August 21, 2007.

If you've read the previous posts on this topic, here and here, you're probably aware by now that I have this weird (okay, maybe fanatical) obsession with data. Or at least, with knowing if my data are right so I can get on with life, do the analysis and figure out the results.

My results from last week suggested that re-processing chromatogram data (from the ABI 3730) with phred was probably a bad idea, but still, I only had one data point and I really wanted to know if anyone had done a more thorough study and compared larger numbers of chromatograms.

Naturally, someone had.

tags: DNA sequencing, DNA , base-calling programs

And of course it was ABI. And, the results aren't even new (except to me, I guess).

ABI and their collaborators at the Washington University and Baylor College of Medicine genome centers presented this work in a poster at the Advances in Genome Biology and Technology (AGBT) meeting in 2004 at Marco Island (1).

They looked at basecalling performance with data from 20,000 chromatograms and concluded that:

1. KB produced fewer errors.
2. KB was able to call more bases, which resulted in longer reads.

It certainly puts my quick conclusion from one chromatogram to shame. Oh, why oh why don't I ever read those user bulletins?

Never mind that. ABI kindly gave me permission to post some of their data (2):

These box and whisker plots show the results from chromatograms that were basecalled with the KB basecaller (on top, in blue), chromatograms from ABI instruments (without KB) that were re-processed by phred (in the middle, in red), and chromatograms that were first processed with KB, and then with phred (green, on the bottom) (this was the method that I used the other day with my one chromatogram).

In each case, they compared the read sequences that were obtained with a reference sequence in order to determine the error rate.

(What is a read? A read is a DNA sequence that's been obtained from a chromatogram file. The chromatogram file has lots of extra information like the kind of matrix, the run time, the name of the base calling program, the peak heights, etc. A read sequence only contains the sequence of bases: ATAGAGCTCATCGATCATCTACGTA.... etc. )

We can evaluate reads in a few ways.

We can look at the number of high quality bases (Q20, Q30, Q40).
We can look at the length of the read after trimming off the bad stuff.
And, we can compare the read to a known sequence and count the number of differences.

Part A in the figure shows the length of the read sequence after trimming the poor quality data (less than Q20) bases from both the 5' and 3' ends. In each case, it appears that the KB base caller gave longer reads. In this figure, it looks like the mean values were around 650, 775, and 950 bases for reads from short, medium, and long runs.

Part B shows the error rates. For the rapid runs (top), it looks like phred has a slightly lower mean error rate when it's used to re-process KB-called data. KB and re-processed KB data appear to be tied for the medium length runs and KB wins with the long runs.

To quote ABI: .

..since phred replaces (and ignores) the initial called sequence, re-processing KB-analyzed samples with phred will, on average, degrade the accuracy of the analysis in terms of actual sequence error. Analysis improvements provided by KB algorithm outlined above will be essentially lost.

There you have it, the end of this read and this sequence of posts at the same time. Time to move on to the next generation.

Reference:
1. Gehman, C. et. al. 2004 "Longer Reads with the KB Basecaller" AGBT 2004.
2. Applied Biosystems User Bulletin, FAQ KB Basecaller v1.2.

More like this

Metagenomics, biomes, and dirt: separating good data from bad

The simple fact is this: some DNA sequences are more believable than others. The problem is, that many students and researchers never see any of the metrics that we use for evaluating whether a sequence is "good" and whether a sequence is "bad."

DNA sequencing and bioinformatics, part I: a case study from the classroom

What happens when high school students clone and sequence genomic DNA?

Will the real DNA sequence please stand up?

Sometimes asking a question can be a mistake.

Quantitative measures of DNA sequence quality

How did the human genome ever get finished if every one of the three billion bases had to be reviewed by human eyes?

Figure 1 looks like it would be a nightmare for a colorblind reader.

I agree.

Thanks for the update Sandra. But shouldn't there also be a comparison of phred analysis of raw data compared to KB?
regards,
TJK

Hi Sandra

It is good to see someone looking critically at DNA basecallers. KB is certainly a better basecaller than phred, however, there are better basecallers out there than KB. At risk of tooting my own horn, our company sells a couple of basecallers (LongTrace and PeakTrace) that are better basecallers than either phred or KB. We have a free versions of the software on our website which you can try with your own traces - the links are below.

http://www.nucleics.com/peaktrace-sequencing/
http://www.nucleics.com/longtrace-sequencing/

Cheers

Daniel

Hi Thomas,

You wrote: "shouldn't there also be a comparison of phred analysis of raw data compared to KB?"

Unfortunately, this can't be done. Phred can only work with data that have been previously processed by a sequencing instrument. The closest you can get to doing the experiment that you described is having phred work with data that have been processed on ABI instruments with base callers other than KB.

Thanks Daniel,

I'll take a look.

Hi Daniel,

i am wondering what is the difference between LongTrace and PeakTrace. I am using LongTrace for quite sometime, and tried out trail version of PeakTrace as well. For sure, both give improved result, yet i learn that LongTrace gives better result that PeakTrace. How do you think about this?

Cheers

Nic

Hi Nic

The basic difference is PeakTrace is a basecaller and trace processor combined, while LongTrace is just a trace pre-processor for the KB basecaller. PeakTrace is better than LongTrace in my opinion, but with some trace types KB does better. This is the reason why we still off both versions.

Cheers

Daniel

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

Who Controls The Chicken Controls The World

More by this author

New home for Discovering Biology in a Digital World

October 30, 2017

Sometime in the next day or two, Scienceblogs will shut down. We've enjoyed the opportunity to blog here for the past 10+ years. Not to worry, @digitalbio and @finchtalk will continue blogging, but more so from their own site at Digital World Biology. The Scienceblogs posts have been…

Synbiobeta: The Future is Now

October 12, 2017

@synbiobeta concluded it’s #sbbsf17 annual meeting on synthetic biology Oct 5, 2017. The progress companies are making in harnessing biology as a platform for manufacturing and problem solving is world changing. Locations of Synbio Companies What is Synthetic Biology? Synthetic biology is a term…

Understanding the CRISPR Cas9 system

September 18, 2016

On Sept. 30th, I'm going to be co-presenting a Bio-Link webinar on Genome Engineering with CRISPR-Cas9 with Dr. Thomas Tubon from Madison College. If you're interested, Register here. Since my part will be to help our audience understand the basics of this system, I prepared a…

Zika virus, drug discovery, and student projects

March 8, 2016

It's well understood in science education that students are more engaged when they work on problems that matter. Right now, Zika virus matters. Zika is a very scary problem that matters a great deal to anyone who might want to start a family and greatly concerns my students. I…

DNA: it's in your blood

February 28, 2016

Did you know small fragments of DNA are circulating in your blood stream? These short pieces of DNA are left behind after cells self-destruct. This self-destruction, or apoptosis, is a normal process. In the case of fetal development, certain cells in our hands die, leaving behind individual…

Is phred dead? Let's see the data

More like this

Metagenomics, biomes, and dirt: separating good data from bad

DNA sequencing and bioinformatics, part I: a case study from the classroom

Will the real DNA sequence please stand up?

Quantitative measures of DNA sequence quality

New home for Discovering Biology in a Digital World

Synbiobeta: The Future is Now

Understanding the CRISPR Cas9 system

Zika virus, drug discovery, and student projects

DNA: it's in your blood

Oh, no! There goes Tokyo...

LIGO's Black Holes Probably Did Not Come From One Star (Synopsis)

Did I mention that Caperea is really, really weird?