Thursday, July 28, 2005

Battle of the disciplines   posted by Razib @ 7/28/2005 08:24:00 PM

I'm conflicted, Discovering functional relationships: biochemistry versus genetics:

Biochemists and geneticists, represented by Doug and Bill in classic essays, have long debated the merits of their methods. We revisited this issue using genomic data from the budding yeast, Saccharomyces cerevisiae, and found that genetic interactions outperformed protein interactions in predicting functional relationships between genes. However, when combined, these interaction types yielded superior performance, convincing Doug and Bill to call a truce.

I've cut & pasted the text below, it works out in the end.


For more than ten years, Doug, a retired biochemist, and Bill, a retired geneticist, have lived on a hill overlooking a car factory, debating their strategies for reverse engineering a car (see: Doug advocated rolling up his sleeves, getting under the hood and determining how the parts fit together. Bill preferred tying the hands of a different car-factory worker each morning, then relaxing with a cup of coffee and later examining the cars that emerged from the factory.

One day, Doug and Bill strolled over the next hill. In the midst of debate, they encountered Sharyl, a graduate student in computational genomics. Having overheard their debate, she interjected, ‘I don't know much about cars, but I detect an analogy to biochemistry and genetics. I'm trying to discover functional relationships between genes and proteins in yeast and I wonder which of your strategies would work best.’
Differing approaches to determining gene function

To discover functional relationships, Doug would ask, ‘Which proteins physically interact with my favorite protein?’ By contrast, Bill would perturb the DNA sequence of a gene and observe the consequences in vivo, asking ‘What are the genetic interaction partners of my favorite gene?’ In other words, ‘Which genes produce surprising phenotypes if mutated in combination with my favorite gene?’ Sharyl described how the fields of biochemistry and genetics had ‘gone genomic,’ scaling up their classical approaches to discover functional relationships with ever-greater efficiency. Their resulting systematic studies offered a playing field on which to assess Doug and Bill's dilemma. Sharyl then wondered, ‘Which type of interaction – protein or genetic – is better at revealing functional relationships?’ She pulled out her laptop computer and set to work (Figure 1).

Protein versus genetic interactions in predicting functional relationships

Because ‘gene function’ is vaguely defined, Sharyl used the Gene Ontology (GO) vocabulary, which describes gene products in terms of biological process, cellular component and molecular function ( 1 and 2. She defined three measures of functional relatedness for a pair of genes: (i) shared GO biological process (shared process); (ii) shared GO cellular component (shared component); and (iii) shared GO molecular function (shared function). For example, if two genes were assigned to the same GO biological process category, Sharyl considered the gene pair to have a ‘shared process’. To avoid associations between genes in broadly defined categories, she considered only specific GO categories – those to which 200 or fewer genes (out of not, vert, similar6000 total yeast genes) were assigned, including genes assigned to more specific daughter categories. To represent the biochemists, she chose a high-confidence protein-interaction data set based on affinity purification followed by mass spectrometry (APMS) [3]. For the geneticists, she fielded a recent systematic genetic-interaction data set [4] (Tables 1 and 2 in the supplementary data online; Box 1).

Protein and genetic-interaction screens

Synthetic genetic array (SGA) analysis is a high-throughput method that assesses pairs of genes for genetic interaction 4 and 19. A strain carrying a mutated query gene is crossed to an array of not, vert, similar4700 strains, each mutated in a different non-essential yeast gene. The resulting double mutants are then assessed for fitness. Slow growth or lethality relative to each of the single-mutant strains is declared synthetic sickness or lethality. In the SGA data set used here, 159 query genes were crossed to the array, resulting in not, vert, similar730 000 gene pairs tested for genetic interaction. Based on this data set, the genetic network is between two and 54 times more dense than the protein-interaction network.

Affinity purification followed by mass spectrometry (APMS) is used for high-throughput discovery of physical protein interactions. A ‘bait’ protein is precipitated in a complex with its interacting proteins. Members of this ‘pulled-down’ complex are then identified by mass spectrometry. The two large APMS studies in yeast are known as the tandem affinity purification (TAP) [3] and high-throughput mass spectrometric protein complex identification (HMS-PCI) [6] studies. In both studies, the data can be interpreted in two ways. The spoke interpretation defines an interaction between a bait protein and each protein it pulls down. The matrix interpretation, however, counts interactions between all pairs of proteins pulled down by a bait. In the TAP study, bait constructs were integrated into the yeast genome and expression was controlled by an endogenous promoter. In the HMS-PCI study, however, the bait construct was plamid-borne and expression was controlled by a robust exogenous promoter. Thus, the TAP data set is more likely to be physiologically relevant, although the HMS-PCI study could detect interactions between gene products not normally expressed in the condition examined. The TAP and HMS-PCI data sets employed 1167 and 725 baits, respectively. A gene pair was considered assessed for protein interaction, if at least one gene of the pair was a bait and the other was not filtered out as a ‘promiscuous prey’ [6].

Yeast-two-hybrid (Y2H) is a high-throughput method for assessing direct physical interaction between two proteins (although indirect ‘bridged’ interactions can also be detected). Here our Y2H data set consisted of the union of the interactions reported by Uetz et al. [18] and the ‘core’ version (corresponding to interactions detected at least three times) of the data set produced by Ito et al. [17].

To level the playing field, she considered only the 104 409 gene pairs (the ‘arena’) assessed by both approaches and for which both genes in each pair had a GO annotation. In this arena, the number of gene pairs sharing a specific GO process, component or function was 3841, 1803 and 1139, respectively. The arena contained 48 biochemical interactions and 729 genetic interactions, derived primarily from screens involving the 17 genes used both as baits in the protein-interaction screens and as query genes crossed to 4500 mutants in synthetic genetic array (SGA) analysis. Interestingly, there was no overlap between the protein and genetic interactions (Table 3, supplementary data online). A previous related study [5] did not consider whether gene pairs had been assessed for both types of interaction and used literature-derived interaction data, which are subject to inspection bias.

With a few taps on her keyboard, Sharyl let the games begin. Two proteins exhibiting a protein interaction had a shared process, component or function 42% (P=2e-17), 31% (P=2e-15) and 29% (P=1e-16) of the time, respectively. Genetic interactions were uniformly less-accurate indicators of shared function, with corresponding rates of 19% (P=2e-63), 15% (P=2e-66) and 8% (P=2e-28). However, genetic interactions detected gene pairs with shared function with much higher sensitivity (4–6%) than biochemical interactions (0.5–1.2%; Table 4 in the supplementary data online). When considering different physical-interaction data sets 3 and 6 (Box 1), genetic interactions were consistently more sensitive and sometimes more accurate (see Glossary; Table 4, supplementary data online). Thus, it was difficult to declare a clear winner.
Combining genetic and protein interactions with other data

Are genetic interactions combined with other types of evidence more informative than protein interactions combined with other evidence? Rather than considering each type of interaction in isolation, several groups have previously combined heterogeneous data, using machine learning approaches to predict some property of a gene pair or to predict gene function 7, 8, 9, 10, 11 and 12. Therefore, Sharyl combined multiple types of evidence [11] – including co-localization [13], sequence homology [14], correlated mRNA expression 15 and 16 and chromosomal distance (Table 5, supplementary data online) – to predict shared function. She chose a previously described probabilistic-decision tree approach [12] and compared performance with and without the benefit of protein and/or genetic-interaction data. For each of shared process, component, and function and for each choice of input data, she performed cross-validation: she randomized all gene pairs in the arena into four groups, and successively scored each group using a model trained on the remaining three. She then compared the prediction score of each gene pair with its corresponding shared process, function or component status. A plot of true- versus false-positive rates revealed that genetic and protein interactions were comparable at low sensitivities; however, as sensitivity increased, genetic-interaction data enhanced performance more than protein-interaction data. This trend was observed for shared process (Figure 2), component (Figure 1a, supplementary data online) and function (Figure 1b, supplementary data online). Doug, the biochemist, began to despair.

Before Bill could begin to gloat, however, Sharyl showed that genetic- and protein-interaction data together gave markedly better results than either alone, suggesting that each offers distinctly different types of information. Although protein interactions can represent associations between genes in the same complex or physically connected pathway, genetic interactions can additionally reflect relationships between genes in physically non-interacting pathways. She repeated this analysis with another APMS protein-interaction data set [6] and then with the union of two yeast-two-hybrid (Y2H) data sets 17 and 18 (Tables 1 and 3, and Figures 2 and 3 in the supplementary data online), altering the arena appropriately. In each case, genetics beat biochemistry by a slim margin, but the combination of these complementary interaction types outperformed either alone. Sharyl's results convinced Doug and Bill to shake hands and head back over the hill … until new data or new technology call for a rematch.