RPM’s slamming of some silly coverage of the C-value enigma got me thinking about the problem of why we see the sorts of variation we do in the amount of non-coding DNA between species. People are right to heckle the questionable assumption that these differences in ncDNA have anything to do with the evolution of phenotypic complexity (though probably a small fraction do), but I think it might still have an interesting functional tale to tell. I’m probably not the first to think of this, but the idea is that variations in quantity of ncDNA are not functional for the organism in themselves but rather the waste product of a particular kind of functional change: gene duplication.
Recall that eukaryotic genomes are regularly bedevilled by selfish tranposons. These are rogue genetic elements with a vested interest in creating duplication events, and the basic idea is that every once in a while one of them will succeed wildly at it and in the process end up dragging a whole gene along for the ride (maybe several times). Most of the time this will be bad, but occasionally it’ll be good, and sometimes it’ll be nearly-neutral and you’ll see functional divergence on the copied locus after the initial duplication event. In the cases where a duplicated gene confers a selective benefit, the newly formed transpositional elements hitchhike along on the newly selected gene’s coattails.
The upshot of this is that we should expect cases of adaptive evolution via gene duplication to be frequently be accompanied by increases in the amount of transpositional cruft in the genome of the species. This also would neatly account for much of the ncDNA variation between species, since gene duplication seems to play an important role in the emergence of species-specific traits. If this idea is correct, the amount of ncDNA should correlate more highly with how much adaptive gene duplication a lineage has undergone rather than phenotypic complexity per se.
This theory should be pretty easy to test: Look at cases of adaptive gene duplication that have happened relatively recently (geologically speaking) and compare the LINEs and such around these loci with those close to the presumed “parent” locus. The further back in time you go the harder it will be to do this comparison due to drift wiping out the traces, but in the cases that are comparable they should have a very similar pattern of nonfunctional repeats. If I have this right. (EDIT: Duh. This isn’t a good test, since you’d probably see the same thing under any sort of duplication. Need to think of something else. Maybe compare lineages of recently duplicated genes: If gene B is a “recent” duplication of gene A, and gene Y is a “recent” duplication of gene X, but genes A and X diverged an extremely long time ago, then the two duplications were probably caused by different retrotransposons and so the LINEs around A and B should tend to be highly similar to each other but very different than those around X and Y, and vice-versa. You’d probably have to compare a bunch of different gene lineages to get a statistically significant result, though, and I don’t know how easy it would be to find enough good candidates.)
Has anyone actually looked at anything like this? Does this idea hang together? How else could we test it?
Update: Looks like another beautiful hypotheses slain by an ugly fact. I’ll just copy-paste what I said in the comments:
Having looked into it, this doesn’t work the way I thought it would. I knew that LINEs sometimes end up dragging some of the host’s genetic material along in their replications, but now I know that the way this happens is that sometimes the reverse-transcription machinery grabs onto host mRNA that’s floating around and splices it in. So what’s being inserted is automatically a pseudogene since the mRNA has already been processed (i.e. there’s no promoter attached to it). For this idea to work it would need to be an active gene. Rats.
Mind you, DNA transposons could still easily easily be a major source of gene duplication since they skip the RNA middleman. But since they’re only a tiny fraction of ncDNA that means it probably has nothing to do with the C-value enigma.