What is a Gene?

One of the greatest developments of the post-genomic era has been the refinement of the concept of the 'gene'. The central dogma states that genes encode RNA transcripts which are translated into the amino acid sequence that makes up a protein. But protein coding genes make up a small fraction of many genomes, so what does the rest of the genome do? Some say it's junk. Others say that it's involved in regulating the transcription of the other regions. And even others say that it's transcribed, but not translated. (Note: most think it's some combination of the three.)

We're now discovering that lots of those non-protein-coding regions are actually transcribed into RNAs. Those RNAs may be transcriptional mistakes or they can be functional non-protein-coding RNA. A recent study in Drosophila melanogaster revealed that nearly 30% of the transcribed sequence has not yet been annotated as a functional transcript. Only 29% of those unannotated sequences can be explained as alternative exons of known genes. The other 71% are either unknown protein coding genes or a bunch of untranslated RNAs that need to be characterized (I'd put my money on the latter).

In classical genetics, a gene is a region of the genome that, when mutated, produces an observable phenotype. What would you say is the post-genomic definition of a gene?

More like this

It's a bit like "what is a planet", isn't it. Interesting that astronomy has never had an adequate definition of a planet and biology lacks an adequate definition for a gene.

Problem is that you end up either with a broad definition that covers all cases but is informative (e.g. 'a unit of hereditary information') or something specific for which there's bound to be an exception. I'm sure we can devise something that works though. You'd have to consider the common processes - a template + transcription, for instance.

For years I have been teaching my students that a gene is a segment of DNA that codes for a single RNA molecule with a complementary sequence, regardless of whether that RNA molecule is translated or not. This definition takes into account the genes for the various rRNAs and tRNAs, which are not translated, and also other forms of non-translated RNA that have recently been discovered. By this definition, genes that code for mRNAs that are actually translated are distinguished as "structural genes," using terminology that was first developed to describe the Jacob-Monod model of the lactose operon. Using this same terminology, the gene that codes for the lactose repressor protein is a "regulatory gene," insofar as the repressor does not function in an "extrinsic" biochemical pathway, but rather participates in the regulation of other structural genes.

However, the distinction between "structural" and "regulatory" genes outlined above is insufficient to describe the various kinds of genetically significant DNA sequences now known. For example, it does not include regions of the DNA to which protein regulators bind, but which are not themselves transcribed. It also does not distinguish between RNAs that are translated into proteins (either enzymes or repressor/regulator proteins) and those that are transcribed into RNA but never translated (such as rRNA, tRNA, and the newer non-translated RNAs).

Given the foregoing, it appears to me that there are four (possibly five) functionally different kinds of DNA coding sequences:

(1) translatable sequences: those DNA sequences that are both transcribed into mRNA and later translated into proteins, regardless of function (these can be further subdivided into proteins that participate in non-DNA related biochemical pathways and those that directly regulate DNA, but those seem to me to be classifications of the proteins, not the DNA sequences that code for them);

(2) transcribable sequences: those DNA sequences that are transcribed into RNA (i.e. rRNA, tRNA, etc.), but are not later translated into proteins/polypeptide chains. Again, what the RNAs do after being transcribed is not a function of the DNA, but rather of the RNAs, and therefore should not really be used to classify DNA coding sequences;

(3) binding sequences: those DNA sequences that are not transcribed into RNA nor translated into protein, but which function as binding sites for regulatory molecules such as repressor proteins, homeotic gene products, etc. While such sequences do not code for the production of a transcribed or translated gene product, they still participate in the regulation of other genes by serving as regulatory binding sites; and

(4) non-binding sequences: those DNA sequences that are not transcribed into RNA, not translated into protein, nor function as binding sites for regulatory moelcules. Such sequences would include highly repetitive sequences, tandom repeats, "spacer DNA", pseudogenes, retroviral and transposon inserts (both "dead" and potentially "alive"), etc. This latter category could be further subdivided into "functional" non-coding/non-binding DNA sequences versus "non-functional/parastitic" non-coding/non-binding DNA sequences, depending on whether they arise as part of the functional architecture of the DNA (primarily of eukaryotes), or whether they arise as side-effects of the action of parasitic genetic elements, such as retroviruses or transposons.

There may be other categories of DNA sequences that have other functions, but right now I can't think of any. Therefore, this is how I intend to teach the concept of a "gene" to my students at Cornell from now on.

So much for the Beadle/Tatum "one gene, one enzyme" model, eh? And the classical Mendelian definition of "one gene, one phenotypic trait" is no longer viable as well...

I am glad to see someone is worrying about the definition of the gene. I feel the gene is such a weak concept that it should not be treated as the basis of evolutionary theory. My approach, which I call bioepistemic evolution, is to regard data as fundamental to evolution and to define the gene in terms of data.

Thus, I have offered the following definition of the gene :-

Genes are subsets of the data set defined by the nucleotide sequence of DNA. To qualify as a gene, the data subset must be so formatted that it can be interpreted by an organism into a distinct biochemical activity. An important implication of this definition is that, because biochemical activities are distinct and chemically separable from other such activities, genes may become manifest as distinct and distinguishable, biological phenotypes.

I would like to refine this definition of the gene to maximise its generality and would like to hear any critiques.

Sincerely

John Hewitt

Didn't we just hear that at least some non-gene sequences code for RNA that inhibits some gene expression? The idea is that this can fine-tune how the gene is used. It seems odd that the cell would create a protien, then use RNA to destroy it. But, the idea is that this is much more repsonsive than other techniques.

I'm not an expert. I thought genes were protien coding regions.

We have always had a working definition of 'planet'. Basically anything that moves in the sky is one. There are seven, named after the days of the week. The Sun, Moon, Mars, Venus, Jupiter, Mercury, Saturn.

When the nature of these objects became apparent, the Sun got kicked out for fusion. The Earth got added when it was discovered that it also orbits the Sun. The Moon got kicked out when it was discovered that things orbit other planets.

What we need now is a lower bound and upper bound for the size of planets. The lower bound is arbitrary. So, they picked up on this idea that planets are round. The upper bound is also arbitrary. So, they picked up on this fusion thing. Any bigger, and you're a star. The rest is politics.

Moons also need a lower limit on size. Some say that Jupiter has four moons, and other junk orbiting it. Others say that Saturn has billions of moons (ring particles).