Why biology students should learn how to program

John Hawks recounts a recent conversation about bioinformatics:

I was talking with a scientist last week who is in charge of a
massive dataset. He told me he had heard complaints from many of his
biologist friends that today's students are trained to be computer
scientists, not biologists. Why, he asked, would we want to do that
when the amount of data we handle is so trivial?

Now, you have to understand, to this person a dataset of 1000 whole genomes is
trivial. He said, don't these students understand that in a few years
all the software they wrote to handle these data will be obsolete? They
certainly aren't solving interesting problems in computer science, and
in a short time, they won't be able to solve interesting problems in biology.

I'd agree that biological data-sets can't compete with particle physicists in terms of sheer scale, although the speed with which they are accumulating is alarming. Where biological data-sets really become intimidating is in their diversity, in the complexity of the underlying processes, and in the levels of noise and bias. I suspect a lot of people used to dealing with extremely large data-sets would still balk at the complexity of computational biology once they dug a little deeper, particularly in a few years' time.

Anyway, this conversation leads John, via an interesting digression into Wolfram Alpha (read the post for details) to pose the following question:


Tomorrow's high-throughput plain-English bioinformatics tool will do
the work of ten thousand 2009 graduate students. If a freely-available
(or heck, even a paid) service can do the bioinformatics, what should
today's graduate students be learning?

I am intrigued by the potential of natural language search algorithms,
and certainly I anticipate a future in which the combination of
well-curated, mutually intelligible biological databases and powerful
search tools makes it much easier for non-informaticians to generate
and explore hypotheses, in the same way that sites like NCBI and Ensembl
have made it simple for bench scientists to access and manipulate
sequence data. There's no question that biologists with little or no
informatics background will be able to query increasingly complex
biological data-sets in increasingly complex ways over the next few years.

That said, such tools and databases, however powerful, will always lag substantially behind the science.
For young biologists who want to work right at the cutting edge - which
will require dealing directly with rapidly changing technologies,
generating biological data at an increasingly dizzying pace and in
constantly evolving formats - solid informatic skills, including at
least basic programming and sound statistical knowledge, will make you a far more productive scientist
.

If you intend to be at the head of your field, you'll often be in a
place where the right tools for the job simply don't exist yet. You
need to be able to develop such tools yourself, or at least speak the
right language to communicate your needs to someone who can; and
speaking that language means having a good working knowledge of computation.

Of course programming languages will change and the scripts you write
as a grad student will be forgotten within a year or two - that's the
nature of science (how many molecular biologists still run Southern blots?). The important thing is learning how to think about large-scale biological data:
how to access, filter and manipulate it. Having basic programming
expertise will make you more effective as a scientist right now, and it
will also prepare you for a career in an increasingly data-driven field.

Of course, informatic skills alone will get you nowhere unless your
ambition is to be the default IT support team for your lab partners. Regardless of
whether you are asking questions using John's hypothetical universal
query engine or an algorithm of your own invention, you need to be asking the right questions, which means developing an understanding of biology that is both deep and broad. If the quoted concern in John's post is valid - if young biologists are actually sacrificing scientific understanding for computational skills - then that is certainly something that needs to be corrected.

Still, let's be sure not to swing too far in the opposite direction. Unless and until Wolfram Alpha triggers the singularity I'd argue that biology grad students will be extremely well-served by developing serious programming and statistical experience, at least if they want to be marching at the head of their field. Speaking as a biologist who entered informatics far too late (as a postdoc), I can think of few other skill areas where investing effort and time early in your career can provide such a dramatic return in terms of scientific productivity and career prospects.

Related: xkcd effectively says the same thing in cartoon style - and read the comments of that post for some useful tips.
 

Subscribe to Genetic Future.

More like this

I often get questions about bioinformatics, bioinformatics jobs and career paths. Most of the questions reflect a general sense of confusion between creating bioinformatics resources and using them. Bioinformatics is unique in this sense. No one confuses writing a package like Photoshop with…
Nature News has a special feature on "big data" - a broad look at the demands of the brave new world of massively high-throughput data generation, and the solutions adopted by research institutes and corporations to deal with those demands. The image to the left (from an article in the feature by…
What do you call a biologist who uses bioinformatics tools to do research, but doesn't program? You don't know? Neither does anyone else. The names we use People who practice biology are known by many names, so many, that the number of names almost reflects the diversity of biology itself.…
A worthy Kickstarter science related project is afoot. Face it. Most science is done on the command line. When it is not, we call it "science by spreadsheet" or name it by some other epithet. Much of that is done on Linux or Linux like computers, but that actually includes Macs, and if you must…

I'm a first year undergrad majoring in biology, so this article may be coming at a good time for me. I know nearly nothing about programming and I have yet to take my first statistics course. I'd appreciate it if anyone has suggestions about what specific ways someone at my point in education should head this advice. Coursework, extracurricular exploration, work, etc.?

"Unless and until Wolfram Alpha triggers the singularity"

I hope that line was tongue-in-cheek...I can't quite tell from the tone of this post. As a computational biologist, I can attest to the fact that no glorified search engine is going to be able to mine genomes in a way at all comparable to the way we can mine them with algorithms that understand the underlying biology.

I agree with the overall point of your post, however -- that we need to be equipping this generation of grad students with basic computational understanding.

But as a graduate student who studies computational biology, I can't say I'm unhappy that most grad students aren't equipped with this basic understanding. Maybe it'll increase my chances of getting a job in a few years :-p

By Computational … (not verified) on 17 Mar 2009 #permalink

Sorry for the second post, but to reply to Student above: take intro to programming and an algorithms and data structures course to familiarize yourself with basic computer science. Then take an algorithms course geared toward computational biologists if your college offers one...they're becoming more common.

Don't take a Perl programming course without taking the other classes I mentioned. Knowing the fundamentals of computer science will take you a lot further than knowing how to program in a specific language. Once you're fluent in computer science broadly, picking up a language like Perl takes a weekend. And without algorithms and data structures you can't be a good programmer in any language, even if you know the syntax.

By Computational … (not verified) on 17 Mar 2009 #permalink

this, of course, is precisely the problem. i greatly admire the integrated sciences program (say... at p.u.) -- but, i dare say, its an easier 'sell' to their students, than to those at other, less ... 'selective'... universities/colleges.

university curricula are already loaded with required courses (to justify departmental faculty rolls) and administrators are lowering both duration of the semester (to reduce utility bills) AND total credits required for graduation (to shorten time to graduation -- the average, btw, is already over 10 (yes, TEN) semesters.

one last thing... my son teaches high school biology in an east-coast-state. with the exception of his AP courses, he is NOT allowed to teach about Golgi Bodies because they are not on the state 'performance' exam.... i can just see him suggesting to the dept.head and administration that he should add a section on programming....

one really last thing... i know that we're all happy.happy about genetics/bioinformatics/computing/etc -- but, programming and computation related to medical applications (i.e., tracking medical records and diagnoses) are probably even more relevant to biology majors (after all, that's why they major in biology to begin with...) and, probably where all the $$$ will be ... :)

I agree in part. 100% on the stats. It would be great to have any additional skills if you can get them, of course. Several years into my bioinformatics career I took an 18-month Saturday certificate course that included programming.

I hate programming.

However, I know what I can or can not ask programmers to do. I can ask more targeted and complete questions when discussing projects with them. And I'm a wildly effective software tester because of it.

Yet I would argue that with appropriate software training many more biologists could accomplish complex tasks with existing interfaces and tools--without programming skills. But that's my job so I have to believe that ;) Seriously, though, I have seen it.

I think creating a class of super end-users would take us far down the road.

To the first "Student"
I generally agree with Comp Bio Grad student (though I've found surprisingly few applications for my Data Structures knowledge in my are).
I think the best approach would be to figure out what upper level classes that have a stats programming element interest you. Take the prerequisites to those courses or speak to the professors of those courses and ask what they recommend you take in advance.

"how many molecular biologists still run Southern blots?"

Ahem. Not everyone has their own personal sequencer.

gillt,

OK, perhaps my perspective is just slightly warped. :-)

Interestingly, part of the trigger for that little tangent was a corridor conversation yesterday about working up a Southern approach to characterise a very complex structural variant. The notion of running a Southern at the Sanger Institute struck me as a kind of heresy, but it IS probably marginally easier than figuring out a way to type the variant via sequencing. It's a close call, though...

Ha. I'm acquainted with the rationale. I see it with my colleagues at NHGRI. For them it's an institutionalized matter of pride to sequence first ask questions later. Great post, btw.

People with 733t computer skills wish they knew more biology. People with deep biology thinking wish they knew more about computers.

By anomalous (not verified) on 18 Mar 2009 #permalink

I couldn't agree with poster #3 more. Taking a data structures class is probably the best bang for your buck. In my school this was one step up from the introductory level, but I'd wager a smart science student could skip the intro CS class and go straight to the DS class. You'll learn the programming language of choice along the way.

A *lot* of scientific computing (speaking from physics here) boils down to well-chosen data structures and then correct operations on those structures.

Rapid accumulation of data and literature in biology is certainly a phenomenon and often already a problem. Whilst the amount of data is dwarfed by what is generated in particle physics, for instance at CERN, the complexity and scope of biological data is enormous. The 'concept web approach' is meant to alleviate the problems of dealing with large amounts of data and information, and to that end, the Concept Web Alliance is being established. Have a look here: http://conceptweblog.wordpress.com/

I think this is a bit silly. A good scientist is distinguished by the ability to think outside the box, and to bring a diversity of skills to bear on a scientific problem. I've been in bioinformatics since 1993 (yes really!) and if you have good training then it doesn't matter if technology changes, you continue to learn and adapt. I have always considered myself a biologist first, and as I left the bench traded pipettes and petri dishes for computers and software.

If you're aren't trained in critical thinking I don't care how good the software is. How will you recognize that your answer makes biological sense, and how will you follow that up to further validate the work? Do you know enough of statistics and the software you are using to analyze you data to fully understand how the result you've gotten was achieved? Ground yourself in the fundamentals, and you can pick up or discard tools as you need them.

By Brian Moldover (not verified) on 19 Mar 2009 #permalink

Alan Kay was a biology student. Tim Berners-Lee was a physicist. I'll take 10 non-computer scientists over 10,000 computer scientists any day. Computer scientists are great at making things efficient and reliable and theoretically sound, but a lot of the world-changing ideas come from other fields.

Ken,

There's no question that a scientist needs to be creative and broad-thinking to generate world-changing ideas. My point is essentially that, given equal ability, a scientist who is able to program and perform complex statistical analyses will be more productive than a scientist who isn't.

Since there's no reason why programming skills and scientific creativity need be mutually exclusive, why not aim for both?

Oh my god. Computers. I've never considered it. I hate lab-work, belly-to-bench stuff. Computers, hurray!!! I'll check my Universität's web-site right away for an introductory course. :-) Happy!

I think this also applies on a much more trivial level. If I'd known how to use R and how to extract the information I really need from messy files, I could have saved myself hours of tedium back when I still thought I wanted to work at a bench. Handling jumbled quantitative PCR output files and skewed expression data would have been much less painful. Nowadays, I meddle around with whole genomes.