Gene Expression: Adaptive Landscapes: Miscellaneous Points

Front page

Sunday, October 12, 2008

Adaptive Landscapes: Miscellaneous Points posted by DavidB @ 10/12/2008 03:23:00 AM

My post here discussed Sewall Wright's concept of the adaptive landscape, and a post here discussed R. A. Fisher's views on the subject. Before I come to my planned note on Sewall Wright's Shifting Balance theory, there are some points about adaptive landscapes which didn't fit easily into the earlier posts...

Terminology

As mentioned in the post on Wright's 'landscapes', he used two different versions of a multi-dimensional model of fitness. In one interpretation the dimensions, except for that of fitness, represent the number of alleles of different types in an individual genotype. I will call this a genotype landscape. In the other interpretation, the dimensions except for that of fitness represents the proportion of alleles of different types in a population. I will call this a frequency landscape. Both interpretations can be called genetic landscapes.

While Wright's interpretations always have genetic dimensions, other authors have used concepts in which the dimensions of the landscape represent phenotypic or ecological variables. I will call these phenotype landscapes. Peaks in such a landscape represent optimal phenotypes or ecological niches.

In both genetic and phenotype landscapes one of the dimensions usually represents reproductive fitness, but some alternative measure of adaptation may be used. For example if the phenotype is the shape of a fish, the measure of adaptation might be some aspect of swimming efficiency.

Some authors draw a distinction between a fitness landscape and an adaptive landscape, but the distinction is not consistently used. For example, according to Gavrilets (p.30) these terms are used to designate what I have called genotype and frequency landscapes respectively, but McGhee (p.1) uses them to designate genetic and phenotype landscapes. Most authors seem to use the terms 'adaptive landscape', 'fitness landscape', 'genetic landscape', and 'selective landscape' interchangeably, though each of them may also have other meanings. (For example, 'genetic landscape' may be used to describe the geographical distribution of genes.) Anyone searching for relevant studies should try all of these variants. I will use adaptive landscape as a general term embracing all of them.

Literature

There are at least two recent books devoted to adaptive landscapes, by Gavrilets and McGhee (see refs.). Gavrilets deals mainly with genetic landscapes, McGhee with phenotype landscapes. The book by Gavrilets has an extensive bibliography, which provides a good way into the literature on genetic landscapes. The studies I have looked at deal mainly with genotype landscapes. There seems to be comparatively little work on frequency landscapes, perhaps because the subject is less amenable to study by computer simulation.

The number of peaks in genotype landscapes

There is an extensive literature on the number of peaks in genotype landscapes, mainly based on the work of Stuart Kauffman.

To begin with, consider a model devised by Kauffman and Levin (1987). Suppose a genome has N loci. For simplicity, assume the loci are haploid and that there 2 possible alleles at each locus. There are therefore 2^N possible different genotypes. Now, suppose that each distinct genotype has a fitness which is independent of the fitness of any other genotype. We may then represent the fitnesses by numbers chosen at random (Kauffman and Levin use the range of rational numbers between 0.0 and 1.0). For simplicity we stipulate that no two genotypes have exactly the same fitness. If we choose one of the 2^N possible genotypes at random, there are N other genotypes which can be derived from that chosen genotype by varying an allele at a single locus. We call these the neighbours of the chosen genotype. The chosen genotype is a local optimum if it has higher fitness than all of its neighbours. But by the stated assumptions the fitnesses of the N + 1 genotypes concerned are random numbers, each of which must have a probability of 1/(1 + N) of being the largest in the set. There is therefore a probability of 1/(N + 1) that the chosen genotype is a local optimum. But the chosen genotype is randomly chosen from the 2^N possible genotypes, and any other genotype (by the given assumptions) would have an equal chance of being a local optimum within its own 'neighbourhood'. Since there are 2^N possible genotypes in total, the total expected number of local optima in the system is therefore (2^N)/(N + 1) [Note 1].

It is obvious that this number increases rapidly with increasing N. It is equally obvious that the assumption of independent fitnesses for each possible genotype is biologically unrealistic. It implies that no single locus, or combination of fewer than N loci, has any predictable effect of its own on fitness. As an extreme alternative to this, suppose that each locus makes a contribution to fitness which is independent of all other loci. In this case one of the alleles at each locus must be unambiguously fitter than the other allele, regardless of the alleles at other loci. Suppose we designate the fitter of the two alleles by an even number, and the less fit allele by an odd number. It is clear that no genotype containing an 'odd' allele can be a local optimum, because the fitness of the genotype could always be increased by substituting an even allele for the odd one. The only local optimum in the system is therefore the single genotype containing exclusively 'even' alleles, no matter how many genotypes there are in the system. This result can be extended to systems with diploid loci and/or multiple alleles at each locus, provided that one of the alleles at each locus is unambiguously fitter than all other alleles. We could also allow the fitness contribution of a locus to be affected by the alleles at other loci, provided the effect is not so great as to reverse the rank order of fitness of the alleles at each locus. This would be the case, for example, if each allele has a primary effect on one trait which makes a large difference to fitness, and a secondary effect on other traits, provided the secondary effects do not exceed the fitness difference due to the primary effect.

Between the two extreme models, there could be a variety of systems in which the rank order of the contributions of loci to fitness is partly but not entirely independent of other loci. Kauffman has devised a framework known as the NK model. [Note 2] In the NK model there are N haploid loci, with 2 possible alleles at each locus, while the fitness contribution of each locus is affected by the alleles at K other loci as well as itself. The fitness contribution of each possible combination of alleles at each such group of K + 1 loci is a random number chosen from the interval 0.0 to 1.0. For any particular assignment of alleles to the K+ 1 loci, this number determines the fitness contribution of the locus in question. The fitness of the genome as a whole, for any particular assignment of alleles to all N loci, is the average of the contributions for each locus.

The precise way in which the loci are connected to each other may vary. According to Kauffman (p.55) this usually makes little difference to the outcome. It may be useful to consider a simple special case which is not treated by Kauffman. Suppose we divide the N loci into N/(K + 1) discrete sets (assuming for simplicity that N/(K + 1) is a whole number). Let each of the K + 1 loci in each such set be 'connected' to the remaining K loci in the set. There are 2^(K + 1) possible combinations of alleles for each such set, and let each combination be assigned a fitness value randomly chosen from the interval 0.0 to 1.0. For any particular assignment of alleles to the loci, this number constitutes the fitness contribution of every locus in the set to the fitness of the genome. But each such set of K + 1 loci can be treated as a case of the Kauffman/Levin model, and has an expected number of [2^(K+ 1)]/[K + 2] local optima. Since each such set of loci, by assumption, has no effect on the fitness contribution of any loci outside the set, it follows that any combination of local optima for all of the N/(K + 1) discrete sets will also be a local optimum for the entire genome, since any change at a single locus would reduce the overall fitness of the genome. Since there are ([2^(K+ 1)]/[K + 2])^[N/(K + 1)] such combinations, this is the expected number of local optima for the entire genome. It may be easily checked that for the value K = N - 1, where each locus is connected to every other locus in the genome, this reduces to (2^N)/(N + 1), as in the first of the extreme models, while for K = 0, where no locus is connected to any other locus, it reduces to 1, as in the other extreme model. For values of K between 1 and N - 2, the number of local optima increases with increasing K and/or N.

In my simple example the genome is divided into non-overlapping sets of loci. But more generally in the NK model there will be overlap. For example, the sets of connected loci may be arranged cyclically, like abcde, bcdef, cdefg ......zabcd. Or the connections could be chosen at random, in which case there is a non-zero probability that the same locus will enter into more than one set of connected loci. This makes the problem of determining the number of local optima much more complicated. A given set of alleles may be a local optimum with respect to one set of connected loci, but one or more of those alleles may be sub-optimal for another set to which it belongs. In this case, changing one of those alleles will reduce fitness at some loci but increase it at others. The effect on the overall number of local optima for the genome as a whole is not intuitively obvious, and does not seem amenable to calculation by a general formula. Kauffman and others have relied on computer simulations. The most important result is that the number of local optima increases rapidly with increasing N and/or K (Kauffman p.60). This is not surprising, but it may be taken as vindicating Sewall Wright's intuition that in genotype landscapes with a lot of epistatic relations, the number of selective peaks will be very large. In general one may say that for a realistic size of genome (i.e. with thousands of loci) the number of peaks will be very large unless the value of K (averaged over the genome) is close to zero.

Kauffman's NK model is in many ways simplistic, but it does seem quite robust as a basis for exploring the theory of genotype landscapes. Other researchers have developed it in various ways. I don't know (or understand) this work well enough to summarise it, but I recommend the book by Gavrilets, which applies the theory to the problem of speciation. He notably claims that if a sufficient proportion of alleles are allowed to be selectively neutral, then in genotype landscapes of high dimensionality there will usually be a 'network' of ridges connecting the peaks, and along which populations can evolve without crossing fitness 'valleys'.

The number of peaks in frequency landscapes

As noted earlier, there seems to be much less work on frequency landscapes. In my post on Fisher I mentioned that in private letters Fisher argued that as the number of dimensions rises, the proportion of 'level points' which are all-round maxima will fall, and will be about 1/2^N of the total, where N is the number of dimensions. Fisher may have assumed that (a) in each dimension of gene frequencies, only about half of the level points will be maxima, and (b) the location of the maxima in each dimension is usually independent of the other dimensions. [Note 3] With these assumptions, the probability that a level point will be simultaneously maximal in all dimensions will only be about (1/2)^N, or 1 in 2^N, as suggested by Fisher. It does not follow that the number of maxima would not rise. If the number of level points in a single dimension is n, the expected number of level points in N independent dimensions would be n^N, so the expected number of all-round maxima would be (n^N)/2^N. For any n much greater than 2, this will increase rapidly with increasing N; for example, if n = 4, the number of maxima for N = 2, 3, 4.... will be 4, 8, 16... which rapidly becomes enormous.

The validity of the two key assumptions - that about half of the level points in each dimension will be maxima, and that these will be independent of each other - is debatable. First, if we consider loci without epistasis, there are three cases. If one homozygote is superior to the other, while the heterozgyote is either intermediate in fitness or equal to one of the homozygotes, then there will be one maximum and one minimum in the relevant dimension. If the heterozygote is superior to both homozygotes, there will be one maximum and two minima. If both homozygotes are superior to the heterozygote, there will be two maxima and one minimum. There are no cases in which there would be more than two minima or maxima. (If there are more than two alleles at the locus the possibilities are more complicated, but it is difficult to think of realistic scenarios in which there are more than two maxima or minima in each dimension.) For loci without epistasis the assumption that about half of the level points in each dimension will be maxima is therefore plausible as a rough average. But for loci with epistasis the key assumptions are doubtful. The assumption of independence for each dimension is no longer generally valid, as the fitness for all the interacting loci has to be considered simultaneously. For the important case of two interacting loci under selection for an intermediate phenotype (see the post on Wright) there will be two maxima, two minima, and only one saddle point. The key assumptions therefore do not hold even approximately in this case, and if it is at all common, the number of all-round maxima for the genome as a whole may be very large.

It has indeed been claimed (Gavrilets p.37) that the number of maxima is bound to rise with the number of dimensions. But as already discussed in connection with Kauffman's systems, there is no necessity about this: it is quite easy to conceive of a system with only one all-round maximum.

The accessibility and stability of peaks

From an evolutionary point of view, what is important is not just the number of adaptive peaks, but whether they are accessible to the population - i.e. whether the population will evolve towards them - and whether, if the population reaches them, they will be stable under disturbances such as temporary changes in the environment or influxes of migrants. For both purposes, in a frequency landscape we need to consider the 'zone of attraction' of the peaks, i.e. the range of gene frequencies within which the population will move towards the peak under the influence of natural selection. I have not found much discussion of this issue in the literature (which, as I have said, deals mainly with genotype rather than frequency landscapes), but a few general points seem clear.

First, we expect that, other things being equal, higher peaks will have wider zones of attraction. In geometrical terms, if two solid figures have the same shape, the taller figure will have the larger base. In genetic terms, the higher the fitness of a genotype relative to the average fitness of the population, the wider will be the range of gene frequencies within which the genes making up that genotype will be positively selected.

Second, peaks will have a wider zone of attraction if their component genes have an advantage in the heterozygote as well as the homozygote state. If the optimum genotypes contain recessive homozygotes, the genotypes will be rare, and therefore will not contribute much to the fitness of their component alleles, until the relevant alleles are already frequent in the population.

Third, even if a peak has very high fitness, it will not have a wide zone of attraction if the high fitness depends on the epistatic combination of a large number of alleles which do not otherwise have a fitness advantage. In such a case, the advantageous combinations will not appear with significant frequency in the population until all of the component genes already have a high frequency. The peak will be like a spike with a narrow base. Such a peak will be neither easily accessible nor stable, since even if the peak is reached, any fluctuation in the landscape is liable to push the population out of the zone of attraction.

Finally, whether or not a peak is easily accessible to a population depends on the population's current gene frequencies. Here it should be noted that in most of the plausible scenarios for multiple fitness peaks, such as Wright's favourite example of traits under stabilising selection, some of the alleles in the optimum genotypes will (at the peak) be fixed in the population, with alternative peaks at opposite sides or corners of the landscape. (This fact tends to be obscured by illustrative diagrams, including Wright's, which usually show peaks somewhere in the middle of landscape.) If alleles are fixed, the population can only move to another peak if new alleles are introduced by mutation or migration. These new alleles will be opposed by selection unless the environment changes so that the peak itself shifts. In order to move to another peak without migration or a change in environment, a long period of genetic drift, opposed by selection, will be required unless the population is very small. This is one of the key issues of credibility with Wright's shifting balance theory in its original form.

Note 1: Kauffman and Levin, pp.20-21. There might be a suspicion of fallacy somewhere in this argument, as the probability that a genotype is a local optimum is not independent of the probability for other genotypes. It would certainly be fallacious to conclude that there is a probability [1/(N + 1)]^[2^N] that all of the genotypes are local optima, since this is impossible. However, Kauffman and Levin's formula for the number of local optima appears to be valid.

Note 2: Kauffman p.42. Kauffman's description of the model is very concise and not ideally clear, partly because of ambiguity in his use of the terms 'gene', 'allele' and 'locus'. But I think my interpretation is consistent with what Kauffman and others say about the NK model.

Note 3: since Fisher gave no reasons for his claim, this is just speculation. He may quite possibly have had other reasons, but didn't spell them out. In his statistical work Fisher was very familiar with applications of N-dimensional geometry, so he would have had a better understanding than most people of the properties of high-dimensional landscapes .

References

Sergey Gavrilets, Fitness Landscapes and the Origin of Species, 2004.
Stuart Kauffman, The Origins of Order, 1993
Stuart Kauffman and Simon Levin, 'Towards a general theory of adaptive walks on rugged landscapes', J. Theoretical Biology, 1987, 128, 11-45.
George R. McGhee, The Geometry of Evolution: Adaptive Landscapes and Theoretical Morphospaces, 2007

Labels: Burbridge, Population genetics

Haloscan Comments