Use of signals of positive and negative selection

Full text

Turn on search term navigation

Introduction

Genetic, epigenetic, transcriptomic, and proteomic changes driving carcinogenesis

In the last two decades, the rapid advance in genomics, epigenomics, transcriptomics, and proteomics permitted an insight into the molecular basis of carcinogenesis. These studies have confirmed that tumors evolve from normal tissues by acquiring a series of genetic, epigenetic, transcriptomic, and proteomic changes with concomitant alterations in the control of the proliferation, survival, and spread of affected cells.

The genes that play key roles in carcinogenesis are usually assigned to two major categories: proto-oncogenes that have the potential to promote carcinogenesis when activated or overexpressed and tumor suppressor genes (TSGs) that promote carcinogenesis when inactivated or repressed.

Several alternative mechanisms can modify the structure or expression of a gene in a way that promotes carcinogenesis. These include subtle genetic changes (single nucleotide substitutions, short indels), major genetic events (deletion, amplification, translocation and fusion of genes to other genetic elements), as well as epigenetic changes affecting the expression of cancer genes. These mechanisms are not mutually exclusive: there are many examples illustrating the point that multiple types of the above mechanisms may convert the wild-type form of a cancer gene to a driver gene.

Exomic studies of common solid tumors revealed that usually several cancer genes harbor subtle somatic mutations (point mutations, short deletions, and insertions) in their translated regions but malignancy-driving subtle mutations can also occur in all genetic elements outside the coding region, namely in enhancer, silencer, insulator, and promoter regions as well as in 5'- and 3'-untranslated regions. Intron or splice site mutations that alter the splicing pattern of cancer genes can also drive carcinogenesis (Diederichs et al., 2016). A recent study has presented a comprehensive analysis of driver point mutations in non-coding regions across 2658 cancer genomes (Rheinbay et al., 2020). A noteworthy example of how subtle mutations in regulatory regions may activate proto-oncogenes is the telomerase reverse transcriptase gene TERT that encodes the catalytic subunit of telomerase. Recurrent somatic mutations in melanoma and other cancers in the TERT promoter cause tumor-specific increase of TERT expression, resulting in the immortalization of the tumor cell (Heidenreich et al., 2014).

In addition to subtle mutations, tumors also accumulate major chromosomal changes (Li et al., 2020). Most solid tumors display widespread changes in chromosome number, as well as chromosomal deletions and translocations (Lengauer et al., 1998). Homozygous deletions of a few genes frequently drive carcinogenesis and the target gene involved in such deletions is always a TSG (Cheng et al., 2017). Somatic copy-number alterations, amplifications of cancer genes are also widespread in various types of cancers. Amplifications usually contain an oncogene (OG) whose protein product is abnormally active simply because the tumor cell contains 10–100 copies of the gene per cell, compared with the two copies present in normal cells (Beroukhim et al., 2010; Verhaak et al., 2019). Chromosomal translocations may also convert wild-type forms of TSGs into forms that drive carcinogenesis if the translocation inactivates the genes by truncation or by separating them from their promoter. Similarly, translocations may activate proto-oncogenes by changing their regulatory properties (Haller et al., 2019).

Epigenetic mechanisms such as DNA methylation and histone modifications may also alter the activity of cancer genes. It is now widely accepted that genetic and epigenetic changes go hand in hand in carcinogenesis: numerous genes involved in shaping the epigenome are mutated in common human cancers, and epigenetic changes affect many genes carrying driver mutations (Yang and Yu, 2013; Chen et al., 2017b; Di Domenico et al., 2017; Roussel and Stripay, 2018; Chatterjee et al., 2018). For example, promoter hypermethylation events may promote carcinogenesis if they lead to silencing of TSGs; the tumor-driving role of promoter methylation is obvious in the case of TSGs that are frequently inactivated by mutations in cancer (Pfeifer, 2018). Conversely, there is now ample evidence that promoter hypomethylation can promote carcinogenesis if it leads to increased expression of proto-oncogenes (Van Tongelen et al., 2017).

Non-coding RNAs (ncRNAs) also play key roles in carcinogenesis (Slack and Chinnaiyan, 2019). An explosion of studies has shown that – based on complementary base pairing – ncRNAs may function as OGs (by inhibiting the activity of TSGs), or as tumor suppressors (by inhibiting the activity of OGs or tumor essential genes [TEGs]).

Alterations in the splicing of primary transcripts of protein-coding genes also contribute to carcinogenesis. Recent studies on cancer genomes have revealed that recurrent somatic mutations of genes encoding RNA splicing factors (e.g. SF3B1, U2AF1, SRSF2, ZRSR2) lead to altered splice site preferences, resulting in cancer-specific mis-splicing of genes. In the case of proto-oncogenes, changes in the splicing pattern may generate active oncoproteins, whereas abnormal splicing of TSGs is likely to generate inactive forms of the tumor suppressor protein (Dvinge et al., 2016).

There is now convincing evidence that dysregulation of processes responsible for proteostasis also contributes to the development and progression of numerous cancer types (Mofers et al., 2017; Chen et al., 2017c; Voutsadakis, 2017). Recent studies on tumor tissues have revealed that genetic alterations and abnormal expression of various components of the protein homeostasis pathways (e.g. FBXW7, VHL) contribute to progression of human cancers by excessive degradation of tumor-suppressor molecules or through impaired disposal of oncogenic proteins (Ge et al., 2018; Bernassola et al., 2019).

Hallmarks of cancer and the function of genes involved in carcinogenesis

Hanahan and Weinberg have defined a set of hallmarks of cancer that allow the categorization of cancer genes with respect to their role in carcinogenesis (Hanahan and Weinberg, 2011). These hallmarks describe the biological capabilities usually acquired during the evolution of tumor cells: these include sustained proliferative signaling, evasion of growth suppressors, evasion of cell death, acquisition of replicative immortality, acquisition of capability to induce angiogenesis and activation of invasion and metastasis. Underlying all these hallmarks are defects in genome maintenance that help the acquisition of the above capabilities. Additional emerging hallmarks of potential generality have been suggested to include tumor promoting inflammation, evasion of immune destruction and reprogramming of energy metabolism in order to most effectively support neoplastic proliferation (Hanahan and Weinberg, 2011).

Figure 1 summarizes our current view of the cellular processes that play key roles in tumor evolution to emphasize their contribution to the various major hallmarks of cancer. Changes in the maintenance of the genome, epigenome, transcriptome, and proteome occupy a central position because they increase the chance that various constituents of other cellular pathways will experience alterations that favor the acquisition of capabilities that permit the proliferation, survival, and metastasis of tumor cells.

Figure 1.

Changes of key cellular processes contributing to carcinogenesis.

The central circle refers to processes involved in the maintenance of the integrity of the genome, epigenome, transcriptome, and proteome: defects in these processes increase the chance that genes and proteins of other cellular pathways (represented by segments of the outer circle) will suffer alterations that favor the acquisition of capabilities that permit the proliferation, survival, and metastasis of tumor cells.

Chronology of tumor evolution: initiation and progression

In the first phase of carcinogenesis, a cell may acquire a mutation that permits it to proliferate abnormally, and in the next phase, other mutations allow the expansion of cell number and this process of mutations (and associated epigenetic, transcriptomic and proteomic alterations) continues, thus generating a primary tumor that can eventually metastasize to distant organs. Recent studies on the chronology and genomic landscape of the events that drive carcinogenesis suggest that complex structural changes of the genome occur early, whereas point mutations occur in later disease phases (Maura et al., 2019; Voronina et al., 2020).

According to current estimates, the number of cancer driving mutations needed for the full development of cancer ranges from two-eight depending on cancer type (Vogelstein and Kinzler, 2015; Anandakrishnan et al., 2019). A recent integrative analysis of 2658 whole-cancer genomes and their matching normal tissues across 38 tumor types revealed that, on average, cancer genomes contain four to five driver mutations (Campbell et al., 2020).

Although the temporal order of the mutations affecting genes of key pathways differs among cancer types, it appears that a common feature is that mutations of genes that regulate apoptosis occur in the early phases of tumor progression, whereas mutations of genes involved in invasion pathways occur only in the last stages of carcinogenesis (Gerstung et al., 2011). It has been suggested that the reason why the loss of apoptotic control is a critical step for initiating cancer is that the larger the surviving cell population, the higher the number of cells at risk of acquiring additional mutations.

Analyses of the mutation landscapes and evolutionary trajectories of various tumor tissues have identified BRAF, KRAS, TP53, RB, or APC as the key genes whose mutation is most likely to initiate carcinogenesis, permitting the cell to divide abnormally (Vogelstein and Kinzler, 2015). In the case of ovarian cancers, TP53 mutation is believed to be the earliest tumorigenic driver event, with presence in nearly all cases of ovarian cancer (Bashashati et al., 2013). The prevalence of TP53 mutations and BRCA deficiency in these tumors leads to incompetent DNA repair promoting subsequent steps of carcinogenesis. Studies on the evolution of melanoma from precursor lesions have revealed that the vast majority of melanomas harbor TERT promoter mutations, indicating that these immortalizing mutations are selected at an unexpectedly early stage of neoplastic progression (Shain et al., 2015).

The life history and evolution of mutational processes and driver mutation sequences of 38 types of cancer has been analyzed recently by whole-genome sequencing analysis of 2658 cancers. This study has shown that early oncogenesis is characterized by mutations in a constrained set of driver genes and that the driver mutations that most commonly occur in a given cancer also tend to occur the earliest (Gerstung et al., 2020).

Cancer genes and passenger genes

The prominent role of KRAS and TP53 genes in initiating carcinogenesis has been evident from the observation that their mutation rate in tumors far exceeds those of other genes, suggesting that their mutations are subject to positive selection during tumor evolution.

Several types of approaches exploit this principle for the identification of genes that drive carcinogenesis: the rate of mutation of ‘driver genes’ must be significantly higher in the tumor tissue than those of ‘passenger genes’ (PGs) that have no role in the development of cancer but simply happen to mutate in the same tumor (Parmigiani et al., 2009; Meyerson et al., 2010).

Unfortunately, methods based on mutation frequency alone cannot reliably indicate which genes are cancer drivers because the background mutation rates differ significantly as a consequence of intrinsic characteristics of DNA sequence and chromatin structure (Michaelson et al., 2012). Intrinsic mutation hotspots are mutation hotspots that depend on the nucleotide sequence context, the mechanism of mutagenesis and the action of the repair and replication machineries (Rogozin and Pavlov, 2003). Genes enriched in intrinsic mutation hotspots may accumulate mutations at a significantly higher rate than other genes, creating the illusion of positive selection; based on recurrent mutations they may be mistakenly identified as cancer driver genes (Carter, 2019; Buisson et al., 2019).

In principle, we can avoid this danger if we compare the mutation pattern of the gene in the tumor tissue with that in the normal tissue the tumor has originated from. However, since the rate of mutation in such hotspots depends not only on the nucleotide sequence but also on the mechanism of mutagenesis and the integrity of DNA repair pathways (Buisson et al., 2019; Poulos et al., 2018) mutation hotspots that arise during carcinogenesis could still create the illusion of positive selection.

Chromatin organization also has a major influence on regional mutation rates in human cancer cells (Schuster-Böckler and Lehner, 2012; Gonzalez-Perez et al., 2019). Since large-scale chromatin features, such as replication time and accessibility influence the rate of mutations, this may hinder the distinction of cancer driver genes whose high mutation rate reflects positive selection and PGs whose high mutation rate is the result of the distinctive features of the chromatin region in which they reside. Moreover, since the cell-of-origin chromatin organization shapes the mutational landscape, rates of somatic mutagenesis of genes in cancer are highly cell-type-specific (Polak et al., 2015). Actually, since regional mutation density of ‘passenger’ mutations across the human chromosomes correlates with the cell type the tumor had originated from, this feature may be used to classify human tumors (Salvadores et al., 2019).

Through the comparison of the exome sequences of 3083 tumor-normal pairs Lawrence et al., 2013 have discovered an extraordinary variation in mutation frequency and spectrum within cancer types across the genome, which is strongly correlated with DNA replication timing and transcriptional activity. The authors have shown that by incorporating mutational heterogeneity into their analyses, they could eliminate many of the apparent artefactual findings, improving the identification of genes truly associated with cancer. In a more recent study Lawrence et al., 2014 compared the frequency of somatic point mutations in exome sequences from 4742 human cancers and their matched normal-tissue samples across 21 cancer types and identified 33 genes that were not previously known to be significantly mutated in cancer. They have concluded that 224 genes are significantly mutated in one or more tumor types.

However, since background mutational frequency estimates are not sensitive enough, the list of driver genes (defined as genes with increased somatic mutation rate) is likely to be incomplete, but may also contain false positives. To overcome these limitations of mutation rate-based approaches, several methods use additional features that may distinguish driver genes and PGs. A major group of such approaches incorporates observations about the impact of mutations on the structure and function of well-characterized proteins encoded by proto-oncogenes and TSGs. Several computational methods aim to identify driver missense mutations most likely to generate functional changes that causally contribute to tumorigenesis (Kaminker et al., 2007; Carter et al., 2009; Nussinov et al., 2019).

In a different type of approach Youn and Simon, 2011 identified cancer driver genes as those for which the non-silent mutation rate is significantly greater than a background mutation rate estimated from silent mutations, indicating that the non-silent mutations are subject to positive selection. The authors have identified 28 genes as driver genes, the majority of the significant matches (e.g. EGFR, CDKN2A, KRAS, STK11, TP53, NF1, RB1 PTEN, and NRAS), were well-characterized OGs or TSGs known from earlier studies.

In a more recent study, Zhou et al., 2017 have identified 365 genes for which the ratio of the nonsynonymous to synonymous substitution rate was significantly increased, suggesting that they are subject to the positive selection of driver mutations. However, an obvious limitation of such approaches is that they implicitly assume that synonymous substitutions are selectively neutral and therefore the ratio of the nonsynonymous to synonymous substitution rate properly monitors selection. This is not necessarily true: some synonymous mutations may have a significant impact on splicing, RNA stability, RNA folding and translation of the transcript of the affected gene and may thus actually act as driver mutations (Supek et al., 2014; Hurst and Batada, 2017; Sharma et al., 2019). Furthermore, some mutation hotspots may significantly increase the rate of synonymous mutations therefore a low ratio of nonsynonymous to synonymous substitution rate does not necessarily indicate the absence of positive selection or the action of purifying selection.

Vogelstein et al., 2013 have used a heuristic approach to identify cancer driver genes. Since the patterns of mutations in the first and best-characterized OGs and TSGs were found to be highly characteristic and nonrandom, the authors assumed that the same characteristics are generally valid and may be used to identify previously uncharacterized cancer genes. For example, since many known OGs were found to be recurrently mutated at the same amino acid positions, to classify a gene as an OG, it was required that >20% of the recorded mutations in the gene are at recurrent positions and are missense. Similarly, since in the case of known tumor suppressors the driver mutations most frequently truncate the tumor suppressor proteins, to be classified as a TSG, it was required that >20% of the recorded mutations in the gene are truncating (nonsense or frameshift) mutations. Along these lines, Vogelstein et al., 2013 have analyzed the patterns of the subtle mutations in the Catalogue of Somatic Mutations in Cancer (COSMIC) database to identify driver genes. As a proof of the reliability of this ‘20/20 rule’, the authors emphasized that all well-documented cancer genes passed these criteria (Vogelstein et al., 2013). Although this indicates that the approach detects known cancer genes, it does not guarantee that it detects all driver genes. Acknowledging that additional cancer driver genes might exist, the authors have introduced the term ‘Mut-driver gene’ for genes that contain a sufficient number or type of driver gene mutations to distinguish them from other genes, whereas for cancer genes that are expressed aberrantly in tumors but not frequently mutated they proposed the term ‘Epi-driver gene’.

Based on these analyses, the authors have concluded that out of the 20,000 human protein-coding genes, only 125 genes qualify as Mut-driver genes, of these, 71 are TSGs and 54 are OGs (Vogelstein et al., 2013). The authors have expressed their conviction that nearly all genes mutated at significant frequencies had already been identified and that the number of Mut-driver genes is nearing saturation. This conclusion may not be justified since the criteria used to identify OGs and tumor suppressors appear to be too stringent and somewhat arbitrary.

In search of additional driver genes, Tamborero et al., 2013 employed five complementary methods to find genes showing signals of positive selection and identified a list of 291 ‘high-confidence cancer driver genes’ acting on 3205 tumors from 12 different cancer types. Bailey et al., 2018 used multiple advanced algorithms to identify cancer driver genes and driver mutations. Based on their PanCancer and PanSoftware analysis spanning 9423 tumor exomes, comprising all 33 of The Cancer Genome Atlas projects and using 26 computational tools they have identified 299 driver genes showing signs of positive selection. Their sequence and structure-based analyses detected >3400,400 putative missense driver mutations and 60–85% of the predicted mutations were validated experimentally as likely drivers.

Zhao et al., 2019a have developed driverMAPS (Model-based Analysis of Positive Selection), a model-based approach for driver gene identification that captures elevated mutation rates in functionally important sites and spatial clustering of mutations. The authors have identified 255 known driver genes as well as 170 putatively novel driver genes.

Currently, COSMIC (the Catalogue Of Somatic Mutations In Cancer, https://cancer.sanger.ac.uk/cosmic) is the most detailed and comprehensive resource for exploring the effect of subtle somatic mutations of driver genes in human cancer (Tate et al., 2019) but COSMIC also covers all the genetic mechanisms by which somatic mutations promote cancer, including non-coding mutations, gene fusions, and copy-number variants. In parallel with COSMIC's variant coverage, the Cancer Gene Census (CGC, https://cancer.sanger.ac.uk/census) describes a curated catalogue of genes driving every form of human cancer (Sondka et al., 2018). CGC has recently introduced functional descriptions of how each gene drives disease, summarized into the cancer hallmarks. CGC describes in detail the effect of a total of 719 cancer-driving genes, encompassing Tier 1 genes (574 genes) and a list of Tier 2 genes (145 genes) from more recent cancer studies that show less detailed indications of a role in cancer.

In a different type of approach, Torrente et al., 2016 used comprehensive maps of human gene expression in normal and tumor tissues to identify cancer related genes. These analyses identified a list of genes with systematic expression change in cancer. The authors have noted that the list is significantly enriched with known cancer genes from large, public, peer-reviewed databases, whereas the remaining ones were proposed as new cancer gene candidates. A recent study has provided a comprehensive catalogue of cancer-associated transcriptomic alterations with the top-ranking genes carrying both RNA and DNA alterations. The authors have noted that this catalogue is enriched for cancer census genes (Calabrese et al., 2020).

Using transposon mutagenesis in mice, several laboratories have conducted forward genetic screens and identified thousands of candidate genetic drivers of cancer that are highly relevant to human cancer. The Candidate Cancer Gene Database (CCGD, http://ccgd-starrlab.oit.umn.edu/) is a manually curated database containing a unified description of all identified candidate driver genes (Abbott et al., 2015).

In summary, although a variety of approaches have been developed to identify ‘cancer genes’, there is significant disagreement as to the number of genes involved in carcinogenesis. Some of the studies argue that the number is in the 200–700 range, other approaches suggest that their number may be much higher. Since the ultimate goal of cancer genome projects is to discover therapeutic targets, it is important to identify all true cancer genes and distinguish them from PGs and candidates that do not play a significant role in the process of carcinogenesis.

We must point out, however, that the majority of genomics-based methods were biased as they defined the aim of cancer genomics as the identification of mutated driver genes (equating them with ‘cancer genes’) that are causally implicated in oncogenesis (Futreal et al., 2004). In all these studies, the underlying rationale for interpreting a mutated gene as causal in cancer development is that the mutations are likely to have been positively selected because they confer a growth advantage on the cell population from which the cancer has developed. An inevitable consequence of this focus on positive selection was that most studies neglected the possibility that negative selection may also play a significant role in tumor evolution.

Carcinogenesis as an evolutionary process

In principle, with respect to its effect on carcinogenesis, a somatic mutation may promote or may hinder carcinogenesis or may have no effect on carcinogenesis. In cancer genomics, the mutations that promote carcinogenesis (and are subject to positive selection during tumor evolution) are called ‘driver mutations’ to distinguish them from ‘passenger mutations’ that do not play a role in carcinogenesis (and are not subject to positive or negative selection during tumor evolution). Mutations that impair the growth, survival, and invasion of tumor cells have received much less attention, although they could also play a significant role in shaping the mutation pattern of genes during carcinogenesis. Hereafter, we will refer to this category of mutations as ‘cancer blocking mutations’ because they are deleterious from the perspective of tumor growth.

As discussed above, in cancer genomics, genes are usually assigned to just two categories with respect to their role in carcinogenesis: (1) ‘PGs’ (or bystander genes) that play no significant role in carcinogenesis and their mutations are passenger mutations; (2) ‘driver genes’ that drive carcinogesis when they acquire driver mutations.

The problem with this binary driver gene-PG categorization is that some genes with functions essential for the growth and survival of tumor cells (hereafter referred to as ‘tumor essential genes’) may not easily fit into either category. The coding sequences of driver genes (TSGs, proto-oncogenes), PGs, and TEGs are predicted to experience markedly different patterns of selection during tumor evolution.

The mutation patterns of selectively neutral, bona fide PGs are likely to reflect the lack of positive and negative selection, whereas in the case of TEGs purifying selection is predicted to dominate. In the case of TSGs, the mutation pattern is expected to reflect positive selection for inactivating driver mutations. Proto-oncogenes, however, are expected to show signs of both positive selection for activating mutations and negative selection for inactivating, ‘cancer blocking’ mutations as their activity is essential for their oncogenic role. In the coding regions of proto-oncogenes positive selection for driver mutations is expected to favor nonsynonymous substitutions over synonymous substitutions only at sites that are critical for the novel, oncogenic function. For these sites (and these sites only), the ratio of nonsynonymous to synonymous rates is expected to be significantly greater than one reflecting positive selection. If there are many such sites in a protein, or selection is extremely strong the overall nonsynonymous to synonymous ratio for the entire protein may also be significantly higher than one, otherwise the effect of positive selection on the synonymous to nonsynonymous ratio may be overridden by purifying selection at other sites (Patthy, 1999).

In harmony with some of these expectations, using just the ratio of the nonsynonymous to synonymous substitution rate as a measure of positive or negative selection, Zhou et al., 2017 have shown that in cancer genomes, the majority of genes had nonsynonymous to synonymous substitution rate values close to one, suggesting that they belong to the PG category. The authors have identified a total of 365 potential cancer driver genes that had nonsynonymous to synonymous substitution rate values significantly greater than one (reflecting the dominance of positive selection). Conversely, 923 genes had nonsynonymous to synonymous substitution rate values significantly less than one (reflecting the dominance of negative selection), leading the authors to suggest that these negatively selected genes may be important for the growth and survival of cancer cells.

Pyatnitskiy et al., 2015 have also used the dN/dS ratio (the ratio of nonsynonymous and synonymous substitution rates) as an indicator of selective pressure and have identified 91 protein-coding genes (’essential cancer proteins’) with amino acid sequences under negative selection.

Realizing that genes whose wild-type coding sequences are needed for tumor growth are also of key interest for cancer research, Weghorn and Sunyaev, 2017 have also focused on the role of negative selection in human cancers. The authors have used an approach based on the principle that both positive and negative selection can be inferred by comparing the observed mutation rates to the expectation under the sole action of the mutation process. As the authors have pointed out, identification, and analysis of true negatively selected,’ undermutated’ genes is particularly difficult since the sparsity of mutation data results in lower statistical power, making conclusions less reliable. Although the signal of negative selection was exceedingly weak, the authors have noted that the group of negatively selected candidate genes is enriched in cell-essential genes identified in a CRISPR screen (Wang et al., 2015a), consistent with the notion that one of the potential causes of negative selection is the maintenance of genes that are responsible for basal cellular functions. Based on pergene estimates of negative selection inferred from the pan-cancer analysis the authors have identified 147 genes with significant negative selection. The authors have noted that among the 13 genes showing the strongest signs of negative selection there are several genes (ATAT1, BCL2, CLIP1, GALNT6, CKAP5, and REV1) that are known to promote carcinogenesis.

In a similar work, Martincorena et al., 2017 have used the normalized ratio of non-synonymous to synonymous mutations, to quantify selection in coding sequences of cancer genomes. Using a nonsynonymous-to-synonymous substitution rate value >1 as a marker of cancer genes under positive selection, they have identified 179 cancer genes, with about 50% of the coding driver mutations being found to occur in novel cancer genes. The authors, however, have concluded that purifying selection is practically absent in tumors since nearly all (>99%) coding mutations are tolerated and escape negative selection. The authors have suggested that this remarkable absence of negative selection on coding point mutations in cancer indicates that the vast majority of genes are dispensable for any given somatic lineage, presumably reflecting the buffering effect of diploidy and the inherent resilience and redundancy built into most cellular pathways.

The key message of Martincorena et al., 2017 that negative selection has no role in cancer evolution had a major impact on cancer genomics research as reflected by several commentaries in major journals of the field that have propagated this conclusion (Bakhoum and Landau, 2017; Koch, 2017; Vitale and Galluzzi, 2018).

Some more recent studies, however, contradict this conclusion. Although Zapata et al., 2018 have also used the ratio of nonsynonymous-to-synonymous substitutions to identify genes that are under selection, they have detected significant negative selection in the case of 25 genes. López et al., 2020, focusing on dN/dS values for truncating mutations, have shown that purifying selection of essential genes is significant in early phases of tumor evolution (before whole genome duplications), whereas whole-genome doubling allows the accumulation of deleterious alterations. Tilk et al., 2020 have shown that appreciable negative selection (dN/dS ~ 0.4) is present in tumors with a low mutational burden, while the majority of tumors exhibit dN/dS ratios approaching 1, suggesting that tumors with higher mutational burden do not remove deleterious mutations.

Van den Eynden and Larsson, 2017, however, cautioned that it is crucial to take into account mutational signatures when applying the dN/dS metric to cancer somatic mutation data. For example, the authors have shown that the low dN/dS values observed in malignant melanoma may be due to the predominance of C to T mutations in this tumor and do not necessarily indicate gene essentiality. The authors have also shown that purifying selection is very limited and similar in all tumor types if the dN/dS metric uses mutational signature-derived substitution probabilities.

In view of the contradicting conclusions about the significance of negative selection in tumor evolution, in the present work we have reexamined this question using an approach that attempts to overcome some of the problems highlighted by earlier studies.

First, most studies used a single dN/dS metric to measure nonsynonymous to synonymous substitution rates as indicators of selective pressure and paid less attention to the fact that the strength of purifying selection is an order of magnitude greater for nonsense mutations than for missense mutations (Gorlov et al., 2006). Furthermore, the use of a single dN/dS value for a transcript may preclude the simultaneous detection of positive and negative selection of activating and inactivating mutations, both of which might operate for a given gene. To overcome these limitations, in the present study we have used a clustering-based approach that can detect different signals of selection manifested in rates of nonsense, missense versus silent substitutions in the coding regions of genes.

Second, an inherent problem with the detection of purifying selection in tumor tissues is that putative TEGs are likely to be undermutated relative to PGs and driver genes, resulting in low statistical power of their analyses based on dN/dS metrics. We have reduced this problem by combining subtle somatic mutations from different tumors types and limiting our work to transcripts that have at least 100 somatic mutations in tumors. (Note that the requirement of a minimum number of mutations does not place a theoretical limit on this approach; progress with genome-wide screens and collection of more data is overcoming this limitation.)

In harmony with earlier observations, our analyses have confirmed that the vast majority of human genes are PGs that do not show detectable signals of selection, whereas known TSGs are positively selected for inactivating (primarily nonsense and frame-shift) mutations. Known OGs, however, were found to display signals of both negative selection for inactivating (nonsense, frame-shift) mutations and positive selection for activating (missense) mutations. Improved detection of signals of selection has permitted the identification of a number of novel driver genes that are likely to play important roles in carcinogenesis as TSGs or as OGs.

Significantly, we have identified a cluster of human genes that show clear signs of negative selection during tumor evolution, suggesting that their functional integrity is essential for the growth and survival of tumor cells. The group of negatively selected genes includes genes known to play critical roles in the Warburg effect of cancer cells, others are known to mediate invasion and metastasis of tumor cells, indicating that negatively selected TEGs may prove a rich source for novel targets for tumor therapy.

Results

Distinguishing PGs and cancer genes

The rationale of the analyses described in the present work is that — due to their different roles in carcinogenesis — proto-oncogenes, TSGs, TEGs, and PGs are expected to experience different patterns of selection during tumor evolution and this is reflected in the relative rates of missense, nonsense, and silent mutations of their protein-coding regions. To monitor these differences, we have calculated for each transcript the fraction of somatic substitutions that could be assigned to the silent (fS), misssense (fM), and nonsense (fN) category and analyzed their relative rates. (For details, the reader should consult the Materials and methods section).

Our analyses have shown that in 3D scatter plots of the fS, fM, and fN values of transcripts the majority of genes are present in a central cluster characterized by fS, fM, and fN values close to those expected assuming no mutation bias and absence of selection, consistent with the view that they correspond to PGs (Figure 2). Known OGs, however, were found in a separate cluster characterized by higher fM values, reflecting positive selection for missense mutations, whereas the cluster of known TSGs has higher fN values, reflecting positive selection for truncating nonsense mutations (Figure 2B and C).

Figure 2.

Analyses of fS, fM, and fN parameters of human protein-coding genes of tumor tissues.

The figure shows the results of the analysis of 13,803 transcripts containing at least 100 subtle, confirmed somatic mutations from tumor tissues, including only mutations identified as not single-nucleotide polymorphisms (SNPs). Axes x, y, and z represent the fractions of somatic single-nucleotide substitutions that are assigned to the synonymous (fS), nonsynonymous (fM), and nonsense (fN) categories, respectively. In Panel A, each gray ball represents a human transcript; note that the majority of human genes are present in a dense cluster. Panel B highlights the positions of transcripts of the genes identified by Vogelstein et al., 2013 as oncogenes (OGs, large red balls) or tumor suppressor genes (TSGs, large blue balls). It is noteworthy that these driver genes separate significantly from the central cluster and from each other: OGs have a significantly larger fraction of nonsynonymous, whereas TSGs have significantly larger fraction of nonsense substitutions. Panel C shows data only for candidate cancer genes present in the CG_SO^2SD_SSI^2SD list (see Materials and methods). The positions of novel cancer gene transcripts validated in the present work are highlighted as large green balls.

Known cancer genes also separate from the majority of human genes in 3D scatter plots of rSM, rNM, rNS parameters, defined as the ratio of fS/fM, fN/fM, fN/fS, respectively (Figure 3). In these scatter plots, OGs separate from the central cluster in having lower rSM and rNM values, whereas TSGs have higher rNS and rNM values than those of the central cluster (Figure 3).

Figure 3.

Analyses of rSM, rNM, rNS parameters of human protein-coding genes of tumor tissues.

The figure shows the results of the analysis of 13,803 transcripts containing at least 100 subtle, confirmed somatic mutations from tumor tissues, including only mutations identified as not single-nucleotide polymorphisms (SNPs). Axes x, y, and z represent the rSM, rNM, rNS values defined as the ratio of fS/fM, fN/fM, fN/fS, respectively. Each ball represents a human transcript; the positions of transcripts of the genes identified by Vogelstein et al., 2013 as oncogenes (OGs, large red balls) or tumor suppressor genes (TSGs, large blue balls) are highlighted. Panels A1, A2 show the distribution of the 13,803 transcripts at different magnification. Note that the majority of human genes are present in a dense cluster but known OGs and TSGs separate significantly from the central cluster and from each other. The rNS and rNM values of TSGs are higher, whereas the rSM and rNM values of OGs are lower than those of passenger genes. Panels B1, B2 show data only for candidate cancer genes present in the CG_SO^2SD_SSI^2SD list (see Materials and methods). The positions of novel cancer gene transcripts validated in the present work are highlighted as large green balls.

The separation of known cancer genes from the majority of human genes is even more manifest in 3D scatter plots of parameters rSMN, rMSN, and rNSM defined as the ratio of fS/(fM+fN), fM/(fS+fN), and fN/(fS+fM), respectively (Figure 4). In these plots, the transcripts form a three-pronged cluster, with known OGs and TSGs being present on separate spikes of this cluster, the rMSN and rNSM spikes, respectively (Figure 4).

Figure 4.

Analyses of rSMN, rMSN, and rNSM parameters of human protein-coding genes of tumor tissues.

The figure shows the results of the analysis of transcripts containing at least 100 subtle, confirmed somatic mutations from tumor tissues, including only mutations identified as not single-nucleotide polymorphisms (SNPs). Axes x, y, and z represent the rSMN, rMSN, and rNSM defined as the ratio of fS/(fM+fN), fM/(fS+fN), and fN/(fS+fM). Each ball represents a human transcript; the positions of transcripts of the genes identified by Vogelstein et al., 2013 as oncogenes (OGs, large red balls) or tumor suppressor genes (TSGs, large blue balls) are highlighted. Panels A1, A2 show the distribution of the 13,803 transcripts at different magnification. Note that the majority of human genes are present in a dense cluster but known OGs and TSGs separate significantly from the central cluster and from each other. The rNSM values of TSGs are higher, their rMSN and rSMN are lower than those of passenger genes (PGs). OGs also separate from PGs in that their rMSN values are higher and their rSMN and rNSM values are lower than those of PGs. Panels B1, B2 show data only for candidate cancer genes present in the CG_SO^2SD_SSI^2SD list (see Materials and methods). The positions of novel cancer gene transcripts validated in the present work are highlighted as large green balls.

There is, however, a fourth cluster of genes that deviates from the clusters of PGs, OGs, and TSGs (Figures 2, 3 and 4). The high fS, rSM, and rSMN values of the transcripts in this group suggest that they are subject to purifying selection during tumor evolution, raising the possibility that this group may contain genes essential for the survival of tumors.

The analyses discussed above did not take into account the impact of differences in mutation probability on the fN, fM, and fS values of transcripts. To check the influence of this factor, we have calculated the expected fN*, fM*, and fS* values for all human transcripts using the probabilities of the six substitution classes (C>A, C>G, C>T, T>A, T>C, and T>G) observed across tumors (for deatails the reader should consult the Materials and methods section).

The various types of observed/expected ratios (rN*, rM*, rS*; rSM*, rNM*, rNS*; rSMN*, rMSN* and rNSM*) were calculated for each transcript and the data were analyzed in 3D scatter plots as described above for the observed values. As shown in Figures 5, 6 and 7, the distribution of transcripts in these 3D scatter plots are similar to those observed in the corresponding Figures 2, 3 and 4, indicating that the separation of the clusters of PGs, OGs, TSGs, and TEGs is relatively insensitive to transcript-specific differences in mutation probabilities.

Figure 5.

Analyses of rS*, rM*, and rN* parameters of human protein-coding genes of tumor tissues.

The figure shows the results of the analysis of transcripts containing at least 100 subtle, confirmed somatic mutations from tumor tissues. Axes x, y, and z represent rS*, rM*, and rN* values, respectively. In Panel A, each gray ball represents a human transcript; note that the majority of human genes are present in a dense cluster. Panel B highlights the positions of transcripts of the genes identified by Vogelstein et al., 2013 as oncogenes (OGs, large red balls) or tumor suppressor genes (TSGs, large blue balls). It is noteworthy that these driver genes separate significantly from the central cluster and from each other: OGs have a significantly larger fraction of nonsynonymous, whereas TSGs have significantly larger fraction of nonsense substitutions than expected. Panel C shows data only for candidate cancer genes present in the CG_SO^2SD_SSI^2SD list (see Materials and methods). The positions of novel cancer gene transcripts validated in the present work are highlighted as large green balls.

Figure 6.

Analyses of rSM*, rNM*, rNS* parameters of human protein-coding genes of tumor tissues.

The figure shows the results of the analysis of transcripts containing at least 100 subtle, confirmed somatic mutations from tumor tissues. Axes x, y, and z represent rSM*, rNM*, rNS* values, respectively. Each ball represents a human transcript; the positions of transcripts of the genes identified by Vogelstein et al., 2013 as oncogenes (OGs, large red balls) or tumor suppressor genes (TSGs, large blue balls) are highlighted. Panels A1 and A2 show the distribution of the transcripts at different magnification. Note that the majority of human genes are present in a dense cluster but known OGs and TSGs separate significantly from the central cluster and from each other. The rNS* and rNM* values of TSGs are higher, whereas the rSM* and rNM* values of OGs are lower than those of passenger genes. Panels B1, B2 show data only for candidate cancer genes present in the CG_SO^2SD_SSI^2SD list (see Materials and methods). The positions of novel cancer gene transcripts validated in the present work are highlighted as large green balls.

Figure 7.

Analyses of rSMN*, rMSN*, and rNSM* parameters of human protein-coding genes of tumor tissues.

The figure shows the results of the analysis of transcripts containing at least 100 subtle, confirmed somatic mutations from tumor tissues. Axes x, y, and z represent the rSMN*, rMSN*, and rNSM* values, respectively. Each ball represents a human transcript; the positions of transcripts of the genes identified by Vogelstein et al., 2013 as oncogenes (OGs, large red balls) or tumor suppressor genes (TSGs, large blue balls) are highlighted. Panels A1, A2 show the distribution of the transcripts at different magnification. Note that the majority of human genes are present in a dense cluster but known OGs and TSGs separate significantly from the central cluster and from each other. The rNSM* values of TSGs are higher, their rMSN* and rSMN* are lower than those of passenger genes (PGs). OGs also separate from PGs in that their rMSN* values are higher and their rSMN* and rNSM* values are lower than those of PGs. Panels B1, B2 show data only for candidate cancer genes present in the CG_SO^2SD_SSI^2SD list (see Materials and methods). The positions of novel cancer gene transcripts validated in the present work are highlighted as large green balls.

Analyses of candidate cancer gene sets

We assumed that the genes whose patterns of subtle mutations deviate significantly (by more than 2SD) from those of prototypical PGs are enriched in cancer genes that play important role in carcinogenesis. The patterns of subtle mutations of candidate cancer genes assign them to one of the three main clusters that show signs of positive and/or negative selection (see Figures 2–7). (A) Genes positively selected for inactivating (nonsense and frame-shift) mutations – putative TSGs; (B) genes positively selected for missense mutations and negatively selected for inactivating mutations – putative proto-oncogenes; (C) negatively selected genes – putative TEGs.

The assumption that the cancer genes assigned to these three clusters play significant roles in carcinogenesis has strong support in the case of the first two categories: the approach used in the present study correctly assigned the known, ‘gold standard’ TSGs and OGs (Supplementary file 1). In the case of the third category, however, no similar gold standard exists for TEGs.

To check the validity and predictive value of the assumption that the genes assigned to the three clusters play critical roles in carcinogenesis, we have selected a number of genes at random from each cluster for further in-depth analyses. We have used three criteria to select genes for detailed analyses from the combined list of candidate cancer genes that deviate from the central clusters of PGs by more than 2SD (see Materials and methods). (1) The candidate gene is among the genes showing the strongest signals of selection characteristic of the given group. (2) The candidate gene is novel in the sense that it is not listed among the 145’ gold standard’ OGs and TSGs of Vogelstein et al., 2013 or among the 719 cancer genes of CGC (Sondka et al., 2018). (3) There is substantial experimental information in the scientific literature on the given gene to permit the assessment of its role in carcinogenesis.

The genes discussed below include genes positively selected for truncating mutations (putative TSGs), genes positively selected for missense mutations and negatively selected for inactivating mutations (putative proto-oncogenes) and negatively selected genes (putative TEGs). In the main text, we summarize only the major conclusions of our analyses; for annotations of the individual genes, the reader should consult Appendix 1. We discuss examples of negatively selected genes in the main text in more detail since earlier studies that focussed on positive selection of driver mutations inevitably missed these genes. We also discuss some instructive examples of’ false’ hits, that is cases where the mutation parameters deviate significantly from those of PGs, but this deviation is not due to selection.

Novel cancer genes positively selected for nonsense mutations

We have selected genes positively selected for truncating mutations from the combined list of candidate transcripts, that is, transcripts whose parameters deviate from those of PGs by more than 2SD (for details see Materials and methods). We have used the additional restriction that genes with indel_rNSM <0.125 were excluded (Supplementary file 1), thereby removing OGs and TEGs. Out of the 624 genes that satisfy these criteria, we have subjected B3GALT1, BMPR2, BRD7, ING1, MGA, PRRT2, RASA1, RNF128, SLC16A1, SPRED1, TGIF1, TNRC6B, TTK, ZNF276, ZC3H13, ZFP36L2, and ZNF750 to further analysis.

Annotation of the majority of these genes (BMPR2, BRD7, ING1, MGA, PRRT2, RASA1, RNF128, SLC16A1, SPRED1, TGIF1, TNRC6B, ZC3H13, ZFP36L2, and ZNF750) has provided convincing evidence for their role in carcinogenesis as tumor suppressors. Interestingly, experimental evidence suggests that TTK, encoding dual specificity protein kinase TTK, is a proto-oncogene that may be converted to an OG by truncating mutations affecting the very C-terminal end of the protein, downstream of its kinase domain (for further details see Appendix 1). Our annotations suggest that B3GALT1, ZNF276 are false positives whose apparent mutation pattern deviates significantly from those of PGs, but this deviation is not due to selection.

Based on functional annotation of the TSGs identified and validated in the present work (see Appendix 1), we have assigned them to various cellular processes of cancer hallmarks in which they are involved (Table 1).

Table 1.

Assignment of novel positively or negatively selected cancer genes to key cellular processes of carcinogenesis.

Hallmarks of cancer	Gene symbol
Defects of genome, epigenome, transcriptome, or proteome maintenance	CDK8, FOXG1, IDH3B, MARCH7, MGA, NOVA1, *PNCK, RNF128, TGIF1, TNRC6B, TWIST1*, ZC3H13, ZFP36L1, ZFP36L2, ZNF750
Sustained proliferation	AURKA, BRD7, ING1, FOXG1, MAPK13, PNCK, PRRT2, RASA1, RIT1, SPRED1, TRIB2, TTK, YAP1, YES1, ZFP36L1, ZFP36L2, ZNF750
Evasion of growth suppressors
Reprogramming of metabolism	BRD7, G6PD, SLC16A1, SLC16A3, SLC2A1, SLC2A8, YAP1, YES1
Replicative immortality	*NOVA1*
Evasion of cell death	BRD7, ING1, MAPK13, PNCK, PRRT2, TP73, TRIB2, TTK, YAP1, YES1, ZNF750
Evasion of immune destruction
Tumor promoting inflammation	BMP2R, CCR2, CCR5, CX3CR1, MAPK13
Inducing angiogenesis	*CCR2*
Activation of invasion and metastasis	*CCR2, CCR5, CX3CR1, RASA1, TBXA2R*

For annotation of novel genes identified in the present study see Appendix 1. The names of negatively selected genes are marked by bold underline.

Comparison of the list of 624 genes present in this dataset (CG_SSI^2SD rNSM >0.125) with lists identified by others (Supplementary file 1) revealed that ~60–100 of our candidate TSG-like genes are also found in several gene lists identified by others through analyses of somatic mutations of tumor tissues. Many of the genes selected for annotation are present in at least one of the candidate gene lists identified by others; the genes of MGA, RASA1, TGIF1, ZFP36L2, and ZNF750 are present in multiple cancer gene lists (Supplementary file 1). It is noteworthy, however, that RNF128, SLC16A1, SPRED1, TNRC6B, and TTK are novel in that they are found only among the candidate cancer genes identified by forward genetic screens in mice (Abbott et al., 2015) or among the genes whose expression changes in cancer (Torrente et al., 2016).

We have also analyzed the genes present in dataset CG_SO*^2SD_rNSM >3, that is, candidate cancer genes for which the observed rNSM values are more than threefold higher than expected taking into account mutational signature-derived substitution probabilities of tumors (Supplementary file 2). We have found that 164 (100%) of the 164 genes present in this dataset are also present in the dataset CG_SSI^2SD rNSM >0.125. It is noteworthy that the majority of candidate TSGs selected for annotation (B3GALT1, BMPR2, BRD7, ING1, MGA, PRRT2, RASA1, SLC16A1, SPRED1, TGIF1, ZNF276, ZFP36L2, and ZNF750) are present among the genes shared by the two datasets that show the strongest signals of positive selection for nonsense substitutions.

Novel cancer genes positively selected for missense and negatively selected for nonsense mutations

We have selected genes positively selected for missense and negatively selected for inactivating mutations from the list of candidate transcripts using the restriction that genes with rMSN <3.00 (440) were excluded, thereby removing the majority of TSGs and TEGs (Supplementary file 1). Out of the 440 genes that satisfy these criteria, we have subjected AURKA, CDK8, IDH3B, MARCH7, RIT1, YAP1, and YES1 to further analysis.

Annotation of these genes has confirmed that they play important roles in carcinogenesis as OGs. Three of these genes encode kinases (Aurora kinase A, also known as breast tumor-amplified kinase; cyclin-dependent kinase 8; tyrosine-protein kinase Yes, also known as proto-oncogene c-Yes) but unlike many other oncogenic kinases, these OGs do not show significant clustering of missense mutations. In fact, only in the case of IDH3B and RIT1 did we observe clustering of missense mutations, indicating that recurrent mutation is not an obligatory property of proto-oncogenes.

Based on functional annotation of the novel OGs identified and validated in the present work (see Appendix 1), we have assigned them to various cellular processes of cancer hallmarks in which they are involved (Table 1).

Comparison of this list of 440 genes (CG_SO^2SD rMSN >3.00) with the lists of cancer genes identified by others (Supplementary file 1) revealed that ~60–100 of our candidate OG-like genes are present in cancer gene lists identified by others through analyses of somatic mutations of tumor tissues.

Out of the genes that we have selected for annotation only the RIT1 gene has been identified by others as an OG, based on the analysis of somatic mutations (Supplementary file 1). AURKA and IDH3B are not present in any of the lists of cancer genes, whereas CDK8, MARCH7, YAP1, and YES1 are found among the more than 9000 candidate cancer genes identified by forward genetic screens in mice (Abbott et al., 2015). Interestingly, TTK, identified as a gene positively selected for truncating mutations (see list CG_SSI^2SD rNSM >0.125), but annotated as an OG, is also present in the list of genes positively selected for missense mutations (CG_SO^2SD rMSN >3.00).

We have also analyzed the genes present in dataset CG_SO*^2SD_rMSN >1.50, that is, genes for which the observed rMSN values are more than 1.5-fold higher than expected taking into account mutational signature-derived substitution probabilities of tumors (Supplementary file 2). We have found that 119 (98.3%) of the 121 genes present in this dataset are also present in the dataset CG_SO^2SD rMSN >3.00. It should be noted that the majority of candidate OGs selected for annotation (AURKA, RIT1, YAP1, and YES1) are found among the genes shared by the two datasets, showing strong signals of positive selection for missense substitutions.

Negatively selected genes

We have selected putative TEGs from the list of candidate cancer genes using the restriction that we have excluded genes with rSMN <0.5 to eliminate OGs and TSGs (Supplementary file 3). Out of the 505 genes, we have subjected CX3CR1, FOXG1, FOXP2, G6PD, MAPK13, MLLT3, NOVA1, PNCK, RUNX2, SLC16A3, SLC2A1, SLC2A8, TBP, TBXA2R, TP73, and TRIB2 to further analysis.

Our analyses have confirmed that in the majority of cases (CX3CR1, FOXG1, G6PD, MAPK13, NOVA1, PNCK, SLC16A3, SLC2A1, SLC2A8, TBXA2R, TP73, TRIB2) the high synonymous-to-nonsynonymous and nonsense mutation rates could be interpreted as evidence for purifying selection during tumor evolution. There were, however, several examples (e.g. DSPP, FOXP2, MLLT3, RUNX2, TBP) where high synonymous-to-nonsynonymous and nonsense mutation rates were found to reflect increased rates of synonymous substitution (due to the presence of mutation hotspots), rather than decreased rates of nonsynonymous and nonsense substitutions that could be due to purifying selection (for details see Appendix 1).

Annotations of the genes CX3CR1, FOXG1, G6PD, MAPK13, NOVA1, PNCK, SLC16A3, SLC2A1, SLC2A8, TBXA2R, TP73, and TRIB2 have confirmed that all of them play important roles in carcinogenesis (see Appendix 1) permitting their assignment to various cellular processes of cancer hallmarks (Table 1.). As discussed below (and in Appendix 1), they fulfill pro-oncogenic functions by promoting cell proliferation (FOXG1, MAPK13, PNCK, TRIB2), evasion of cell death (MAPK13, PNCK, TP73), replicative immortality (NOVA1), reprogramming of energy metabolism of cancer cells (G6PD, SLC16A3, SLC2A1, SLC2A8), inducing tumor promoting inflammation (CX3CR1, MAPK13) and invasion and metastasis (CX3CR1, TBXA2R). In view of the pro-oncogenic role of these proteins, it is noteworthy, that G6PD, MAPK13, PNCK, SLC16A3, and SLC2A1 are among the candidate cancer genes identified by forward genetic screens in mice (Abbott et al., 2015).

Comparison of our list of 505 negatively selected genes (CG_SO^2SD_rSMN > 0.5) with those identified by others have revealed very little similarity (Supplementary file 3). Out of the 147 genes of Weghorn and Sunyaev, 2017, only one is present in the list of top-ranking negatively selected genes identified in the present study. Similarly, only four of the 25 genes of Zapata et al., 2018 and only five of the 91 genes of Pyatnitskiy et al., 2015 are found in our list of negatively selected genes (Supplementary file 3).

We observed a greater similarity when we compared our list of negatively selected genes with that of Zhou et al., 2017; 32 of the 112 genes identified by Zhou et al., 2017 are also present among the 505 negatively selected genes identified in the present work (Supplementary file 3). It is noteworthy that top-ranking genes present in both lists include the ACKR3, TBP, and MLLT3 genes. As discussed in Appendix 1, the apparent signals of negative selection (high synonymous-to-nonsynonymous rates) of genes like DSPP, FOXP2, MLLT3, RUNX2, and TBP may reflect the presence of mutation hotspots generating silent mutations and not purifying selection. Zhou et al., 2017 have also noted that "some cancer genes also show negative selection in cancer genomes, such as the OG MLLT3" and that "interestingly, MLLT3 has recurrent synonymous mutations at amino acid positions 166 to 168". Apparently, the authors did not realize that this observation of recurrent silent substitutions (in a poly-Ser region of the protein) questions the validity of the claim that the unusually low nonsynonymous to synonymous rate is due to negative selection (for more detail see Appendix 1).

In summary, the pro-oncogenic, negatively selected genes annotated and validated in the present work are missing from the earlier lists of negatively selected genes (Zhou et al., 2017; Pyatnitskiy et al., 2015; Weghorn and Sunyaev, 2017; Zapata et al., 2018). A possible explanation for the lack of similarity of top-ranking negatively selected genes identified in the present study with those identified by others is that we have limited our work to transcripts that have at least 100 somatic mutations. It is noteworthy that a large fraction of genes identified by others did not pass this requirement (see Materials and methods).

We have also analyzed the genes present in dataset CG_SO*^2SD rSMN >1.50, that is, candidate cancer genes for which the observed rSMN values are more than 1.5-fold higher than expected taking into account mutational signature-derived substitution probabilities of tumors (Supplementary file 4). We have found that 200 (86.5%) of the 231 genes present in this dataset are also present in dataset CG_SO^2SD rSMN >0.5. It should be noted that the majority of candidate TEGs selected for annotation (CX3CR1, FOXG1, FOXP2, MAPK13, MLLT3, NOVA1, RUNX2, SLC16A3, SLC2A8, TBP, TBXA2R, and TRIB2) are found among the 200 genes shared by the two datasets and that show the strongest signals of negative selection for missense and nonsense substitutions.

Negative selection, cell essentiality, and tumor essentiality of genes

As we have emphasized in the Introduction, the conclusions drawn from earlier studies searching for signs of negative selection are highly controversial. A highly publicized study has propagated the conclusion that negative selection has no role in tumor evolution (Martincorena et al., 2017; Bakhoum and Landau, 2017; Koch, 2017; Vitale and Galluzzi, 2018). Martincorena et al., 2017 have argued that the practical absence of purifying selection during tumor evolution is due to the buffering effect of diploidy and functional redundancy of most cellular pathways.

A recent study has examined the influence of functional redundancy on the essentiality of genes (De Kegel and Ryan, 2019). The authors have used CRISPR score profiles of 558 genetically heterogeneous tumor cell lines and converted continuous values of gene CRISPR scores to binary essential and nonessential calls. These analyses have shown that 1014 genes belong to a category of ‘broadly essential genes’, that is, these genes were found to be essential in at least 90% of the 558 cell lines. De Kegel and Ryan, 2019 have shown that, compared to singleton genes, paralogs are less frequently essential and that this is more evident when considering genes with multiple paralogs or with highly sequence-similar paralogs. In harmony with these conclusions, López et al., 2020 have found that purifying selection of essential genes is significant in early phases of tumor evolution but in later phases whole-genome doubling allows the accumulation of deleterious alterations.

Since the group of negatively selected genes identified by Weghorn and Sunyaev, 2017 were shown to be enriched in cell-essential genes (Wang et al., 2015a), the authors have proposed that the major cause of negative selection during tumor evolution is the maintenance of genes that are responsible for basal cellular functions. Nevertheless, Weghorn and Sunyaev, 2017 have pointed out that negative selection is also expected to act on neoantigens, expanding the possible scope of purifying selection beyond cell essentiality.

Although analyses of negatively selected genes have led Zapata et al., 2018 to conclude, "Processes that are most strongly conserved are those that play fundamental cellular roles such as protein synthesis, glucose metabolism, and molecular transport" they also emphasized the possible importance of less basic functions. Since the immune system is capable of discriminating cancer cells by recognizing mutated epitope sequences the authors have hypothesized that native epitope sequences would be protected from nonsynonymous mutations during tumor evolution. In harmony with this hypothesis, the authors have observed signals of selection in the immunopeptidome and proteins of the epitope presentation machinery, arguing for their importance in the evasion of immune surveillance by tumors.

Gene Ontology analysis of the negatively selected ‘essential cancer proteins’ identified by Pyatnitskiy et al., 2015 have revealed enrichment of essential proteins related to membrane and cell periphery, leading the authors to speculate that this could be a sign of immune system-driven negative selection of cancer neo-antigens.

In summary, there is some disagreement about the significance of purifying selection in tumor evolution and whether tumor essential functions can be equated with basic cellular functions.

In order to assess the contribution of cell-essentiality to purifying selection during tumor evolution, we have plotted various measures of negative selection of human genes as a function of their cell-essentiality scores determined by De Kegel and Ryan, 2019. These analyses have shown that there is a very weak, positive correlation (Pearson's r = 0.05345, p<0.05) between rSMN (a measure of purifying selection) and the cell-essentiality scores of transcripts (Figure 8, Supplementary file 5). Since, by definition, there is a negative correlation between the essentiality of genes and their cell-essentiality scores (De Kegel and Ryan, 2019), our data indicate that cell essentiality does not contribute significantly to purifying selection during tumor evolution.

Figure 8.

Cell-essentiality scores of human genes and negative selection during tumor evolution.

The figure shows the results of the analysis of transcripts containing at least 100 subtle, confirmed somatic, non-polymorphic mutations from tumor tissues. The abscissa indicates the cell-essentiality score of the genes, the ordinate shows the rSMN parameters of the transcripts. Each ball represents a human transcript. Transcripts showing strongest signals of negative selection (CG_SO^2SD rSMN >0.5) are represented by dark orange balls.

It is also noteworthy that the cell essentiality scores of negatively selected genes (CG_SO^2SD rSMN >0.5) are not significantly different from those of PGs (Figure 8, Supplementary file 5). Comparison of CRISPR scores (−0.07665 ± 0.17269) of the cluster of negatively selected genes of CG_SO^2SD rSMN >0.5 listed in Supplementary file 3 with CRISPR scores (−0.09506 ± 0.24168) of the cluster of PGs (PG_SO^r3_1SD) revealed that they are not significantly different (p>0.05). This indicates that basic cell-essentiality per se does not explain the purifying selection observed for this cluster of genes.

Comparison of the lists of negatively selected genes identified in the present work with the 1014 ‘broadly essential genes’ defined by De Kegel and Ryan, 2019 has revealed that there is practically no overlap between the two groups. Only six of the 1014 broadly essential genes are included in our list of negatively selected genes (Supplementary file 3). This observation also suggests that cell-essentiality defined by CRISPR scores determined experimentally on cell lines is not relevant for negative selection during tumor evolution in vivo.

Our analyses of cases of strong purifying selection suggest that it has more to do with a function specifically required by the tumor cell for its growth, survival, and metastasis than with general basic cellular functions (Table 1). It is noteworthy in this respect, that the genes showing the strongest signals of negative selection include several plasma membrane receptor proteins (e.g. ACKR3, CCR2, CCR5, CX3CR1, TBXA2R) that cancer cells utilize to promote migration, invasion, and metastasis (Appendix 1). Significantly, these proteins exert their biological functions (in cell migration, inflammation, angiogenesis etc.) primarily at the organism level, therefore their cell-essentiality scores may have little to do with their overall essentiality for tumor growth and metastasis. Inspection of the data of De Kegel and Ryan, 2019 shows that ACKR3, CX3CR1, TBXA2R were not assigned to the essential category in any of the 558 tumor cell lines tested.

Negatively selected, TEGs identified in the present study do include proteins involved in cell-level processes: they promote cell proliferation (FOXG1, MAPK13, PNCK, and TRIB2), evasion of cell death (MAPK13, PNCK, and TP73), replicative immortality (e.g. NOVA1), or they are crucial for the reprogramming of energy metabolism in cancer cells (e.g. GAPD, SLC16A3, SLC2A1, and SLC2A8). Nevertheless, their negative selection is unlikely to be a mere reflection of their basic cellular functions. Rather, it reflects the exceptional role of the corresponding cancer hallmarks (evasion of cell death, replicative immortality, reprogramming of metabolism) in carcinogenesis (Figure 1). In harmony with this conclusion NOVA1, SLC16A3, SLC2A8, and TP73 were assigned to the essential category by De Kegel and Ryan, 2019 in less than 10% of the 558 tumor cell lines tested. SLC2A1 (glucose transporter 1) is an exception in as much as it was found to be cell-essential in 41% of the cell lines. Significantly, several nutrient transporter genes (SLC16A3, SLC2A1, and SLC2A8) were found among the genes showing the strongest signs of purifying selection. It must be mentioned here that Zapata et al., 2018 have also noted that the glucose transporters SLC2A1 and SLC2A8 and the lactate transporter SLC16A3 show signs of purifying selection, although they did not list these genes among the 25 genes with significant negative selection.

The most likely explanation for the tumor essentiality of transporter protein genes SLC16A3, SLC2A1, and SLC2A8 is that tumor cells have an increased demand for nutrients and this demand is met by enhanced cellular entry of nutrients through upregulation of specific transporters (Ganapathy et al., 2009). The uncontrolled cell proliferation of tumor cells involves major adjustments of energy metabolism in order to support cell growth and division in the hypoxic microenvironments in which they reside. Otto Warburg was the first to observe an anomalous characteristic of cancer-cell energy metabolism: even in the presence of oxygen, cancer cells limit their energy metabolism largely to glycolysis, leading to a state that has been termed ‘aerobic glycolysis’ (Warburg, 1956a; Warburg, 1956b). Cancer cells are known to compensate for the lower efficiency of ATP production through glycolysis than oxidative phosphorylation by upregulating glucose transporters, such as facilitated glucose transporter member 1, GLUT1 (encoded by the SLC2A1 gene), thus increasing glucose import into the cytoplasm (Jones and Thompson, 2009; DeBerardinis et al., 2008; Hsu and Sabatini, 2008).

The markedly increased uptake of glucose has been documented in many human tumor types, by visualizing glucose uptake through positron emission tomography. The reliance of tumor cells on glycolysis is also supported by the hypoxia response system: under hypoxic conditions, not only glucose transporters but also multiple enzymes of the glycolytic pathway are upregulated (Jones and Thompson, 2009; DeBerardinis et al., 2008; Semenza, 2010a; Semenza, 2010b; Kroemer and Pouyssegur, 2008).

In our view, the central role of GLUT1 in cancer metabolism is reflected by the fact that the SLC2A1 gene encoding this glucose transporter is among the genes that show the strongest signals of purifying selection. The key importance of GLUT1 in cancer may be illustrated by the fact that high levels of GLUT1 expression correlates with a poor overall survival and is associated with increased malignant potential, invasiveness, and poor prognosis (Wang et al., 2017a; Deng et al., 2018; de Castro et al., 2019). The strict requirement for GLUT1 in the early stages of mammary tumorigenesis highlights the potential for glucose restriction as a breast cancer preventive strategy (Wellberg et al., 2016). The tumor essentiality of GLUT1 may also be illustrated by the fact that knockdown of GLUT1 inhibits cell glycolysis and proliferation and inhibits the growth of tumors (Xiao et al., 2018). In view of its essentiality for tumor growth, GLUT1 is a promising target for cancer therapy (Shibuya et al., 2015; Noguchi et al., 2016; Chen et al., 2017d).

Recent studies suggest that the YAP1-TEAD1-GLUT1 axis plays a major role in reprogramming of cancer energy metabolism by modulating glycolysis (Lin and Xu, 2017). These authors have shown that YAP1 and TEAD1 are involved in transcriptional control of the glucose transporter GLUT1, whereas knockdown of YAP1 inhibited glucose consumption, and lactate production of breast cancer cells, overexpression of GLUT1 restored glucose consumption and lactate production.

Besides GLUT1 another glucose transporter, GLUT8 (encoded by the SLC2A8 gene) also shows strong signals of negative selection, arguing for its importance in tumor survival. In harmony with this interpretation, there is evidence that GLUT8 is overexpressed in and is required for proliferation and viability of tumors (Goldman et al., 2006; McBrayer et al., 2012).

Due to abnormal conversion of pyruvic acid to lactic acid even under normoxia, the altered metabolism of glucose consuming tumors must rapidly efflux lactic acid to the microenvironment to maintain a robust glycolytic flux and to prevent poisoning themselves (Mathupala et al., 2007). Survival and maintenance of the glycolytic phenotype of tumor cells is ensured by monocarboxylate transporter 4 (MCT4, encoded by the SLC16A3 gene) that efficiently transports L-lactate out of the cell (Ganapathy et al., 2009). Significantly, MCT4, encoded by the SLC16A3 gene also shows strong signals of negative selection, in harmony with its importance in tumor survival. As high metabolic and proliferative rates in cancer cells lead to production of large amounts of lactate, extruding transporters are essential for the survival of cancer cells as illustrated by the fact that knockdown of MCT4 increased tumor-free survival and decreased in vitro proliferation rate of tumor cells (Andersen et al., 2018). Using a functional screen Baenke et al., 2015 have also demonstrated that monocarboxylate transporter four is an important regulator of breast cancer cell survival: MCT4 depletion reduced the ability of breast cancer cells to grow, suggesting that it might be a valuable therapeutic target. In harmony with the essentiality of MCT4 for tumor growth, several studies indicate that expression of the hypoxia-inducible monocarboxylate transporter MCT4 is increased in tumors and its expression correlates with clinical outcome, thus it may serve as a valuable prognostic factor (Witkiewicz et al., 2012; Doyen et al., 2014; Baek et al., 2014). Consistent with the key importance of MCT4 for the survival of tumor cells, its selective inhibition to block lactic acid efflux appears to be a promising therapeutic strategy against highly glycolytic malignant tumors (Choi et al., 2016; Todenhöfer et al., 2018; Choi et al., 2018; Zhao et al., 2019b).

Interestingly, the thromboxane A2 receptor gene (TBXA2R) as well as several chemokine receptor protein genes (CCR2, CCR5, CX3CR1) were also found among the genes showing strong signs of purifying selection (see Appendix 1). (Note that Pyatnitskiy et al., 2015 have also identified CCR5 as a negatively selected gene). The most likely explanation for their essentiality for tumor growth is that tumor cells rely on these receptors in various steps of invasion and metastasis (see Appendix 1). It is noteworthy in this respect that another member of the family of chemokine receptors, the atypical chemokine receptor 3, ACKR3 is also among the genes showing very high values of rSMN, suggesting negative selection of missense and nonsense mutations (Supplementary file 3). (Note that Zhou et al., 2017 have also identified ACKR3 as a negatively selected gene). Significantly, ACKR3 is a well-known OG, present in Tier 1 of the Cancer Gene Census. Several studies support the key role of ACKR3 in tumor invasion and metastasis (Li et al., 2014; Stacer et al., 2016; Zhao et al., 2017; Puddinu et al., 2017; Melo et al., 2018; Qian et al., 2018). Since knock-down or pharmacological inhibition of ACKR3 has been shown to reduce tumor invasion and metastasis, ACKR3 is a promising therapeutic target for the control of tumor dissemination (for further details see Appendix 1).

Negative selection of germline mutations in the human population versus negative selection of somatic mutations in cancer

The data discussed in the previous section indicate that the importance (‘essentiality’) of a given gene is a question of perspective. Cell-essential genes may be non-essential for tumor growth, whereas TEGs with tumor-specific functions do not necessarily have cell-essential functions. Similarly, we may assume that the importance of a gene might be quite different from the perspective of tumor cells and from the perspective of the entire organism. One could speculate that somatic mutations of genes with functions that have no relevance for tumor growth (PGs) experience neutral evolution during tumor growth, whereas germline mutations of the same genes may be subject to purifying selection at the level of organismal evolution, as is true for the majority of genes (Gorlov et al., 2006). One may also assume that genes with tumor essential, tumor-specific functions may be subject to purifying selection during both tumor evolution and organism evolution, but the strength of purifying selection of these genes is increased in tumors relative to those of genes that do not have tumor-specific functions.

To test these assumptions, we have determined the signals of selection of germline mutations (Supplementary file 6) and compared them with those determined for the same genes in the case of somatic mutations of cancer. Comparison of the patterns of germline and somatic mutations of human transcripts (Supplementary file 7) has revealed that the proportion of silent substitutions is significantly higher for germline mutations than for somatic mutations of tumors (fS^g: 0.33900 versus fS^s: 0.24604, p<0.05). Conversely, the proportions of nonsense and missense mutations are significantly lower for germline mutations than for somatic mutations of tumors (fN^g: 0.02329 versus fN^s: 0.04669, p<0.05; fM^g: 0.63771 versus fM^s: 0.70727, p<0.05). These observations are in harmony with the dominance of purifying selection in the human population (Gorlov et al., 2006).

As shown in Figure 9, the pattern of the distribution of transcripts in 3D scatter plots of fM, fN, and fS parameters for germline mutations are strikingly different from those observed in the case of fM, fN, and fS parameters of somatic mutations in cancer (compare Figure 9A and B). In addition to a general shift of germline mutations to lower fN and fM and higher fS values, in the case of germline mutations the fN, fM, and fS parameters of transcripts of TSGs, OGs, and TEGs do not separate from those of the central cluster of genes. Similarly, the distribution of transcripts in 3D scatter plots of rS**, rM**, and rN** parameters for germline mutations are different from those observed in the case of rS*, rM*, and rN* parameters of somatic mutations in cancer (compare Figure 9C and D): cancer genes do not separate from the central cluster of genes.

Figure 9.

Comparison of the patterns of germline mutations of genes with those of somatic mutations observed during tumor evolution.

Panel A: fS, fM, and fN scores of somatic mutations in cancer, Panel B: fS, fM, and fN scores of germline mutations. Panel C: rS*, rM*, and rN* scores of somatic mutations in cancer, Panel D: rS**, rM**, and rN** scores of germline mutations. Each ball represents a human transcript. The positions of transcripts of the genes identified by Vogelstein et al., 2013 as oncogenes (OGs, large red balls) or tumor suppressor genes (TSGs, large blue balls) are highlighted. Novel proto-oncogenes, tumor suppressors and tumor essential genes identified in the present work are highlighted in magenta, cyan, and green, respectively.

Comparison of the fS, rSM, and rSMN parameters of germline and somatic mutations of transcripts (Figure 10, Supplementary file 7) has shown that there is only weak correlation between the strength of purifying selection of genes during tumor evolution and organismal evolution. The Pearson's r values for the correlations of the fS, rSM, and rSMN parameters of germline and somatic mutations are 0.1127, 0.05757, and 0.02635, p<0.05, respectively.

Figure 10.

Comparison of fS, rSM, and rSMN scores of genes determined for somatic mutations in tumors with those determined for germline mutations.

The abscissas indicate the fS^g (panel A), rSM^g (panel B), and rSMN^g (panel C) scores of germline mutations of human genes and the ordinates shows the corresponding fS^s, rSM^s, and rSMN^s scores of somatic mutations of tumors for the same genes. Each ball represents a human gene. Transcripts showing the strongest signals of negative selection during tumor evolution (CG_SO^2SD rSMN >0.5) are represented by dark orange balls.

These comparisons have also revealed that – relative to other genes – the candidate TEGs identified in the present study (CG_SO2SD_rSMN >0.5) display significantly stronger signals of purifying selection during tumor evolution than during organismal evolution (Figure 10, Supplementary file 7). The fS, rSM, and rSMN parameters of somatic mutations of candidate TEGs are significantly higher than those of other genes (fS^s: 0.38322 versus 0.24045, p<0.05; rSM^s: 0.66013 versus 0.34375, p<0.05; rSMN^s: 0.62774 versus 0.32356, p<0.05). The fS, rSM, and rSMN parameters of the germline mutations of candidate TEGs, however, differ much less from the corresponding parameters of other genes (fS^g: 0.36487 versus 0.33831, p<0.05; rSM^g: 0.64054 versus 0.56394, p<0.05; rSMN^g: 0.61264 versus 0.56178, p<0.05). These observations indicate that the negative selection of candidate TEGs during tumor evolution is not a simple reflection of their essentiality at the organism level; it is more likely that they serve tumor-specific functions.

In order to assess the contribution of cell-essentiality to purifying selection during organismal evolution we have plotted rSMN^g, a measure of negative selection of germline mutations of human genes, as a function of their cell-essentiality scores determined by De Kegel and Ryan, 2019. These analyses have shown that there is a very weak negative correlation (Pearson's r = −0.03662, p<0.05) between the strength of purifying selection of transcripts (rSMN^g) and their cell-essentiality scores (Figure 11, Supplementary file 7). This observation also indicates that essentiality of cell-level functions measured by cell-essentiality scores contribute to, but do not explain the strength of purifying selection observed during organismal evolution.

Figure 11.

Cell-essentiality scores of human genes and negative selection on single-nucleotide polymorphisms (SNPs).

The figure shows the results of the analysis of transcripts containing at least 100 polymorphic mutations. The abscissa indicates the cell-essentiality score of the genes, the ordinate shows the rSMN^g parameters of the transcripts. Each ball represents a human transcript. Note that there is a weak negative correlation (Pearson's r = −0.03662, p<0.05) between the strength of purifying selection of transcripts (rSMN^g) and their cell-essentiality scores.

Discussion

One of the major goals of cancer research is to identify all ‘cancer genes’, that is genes that play a role in carcinogenesis. In the last two decades, several types of approaches have been developed to achieve this goal, but the implicit assumption of most of these studies was that a distinguishing feature of cancer genes is that they are positively selected for mutations that drive carcinogenesis. As a result of combined efforts, the PCAWG driver list identifies a total of 722 protein-coding genes as cancer driver genes and 22 non-coding driver mutations (Rheinbay et al., 2020; Campbell et al., 2020).

In a recent editorial, commenting on a suite of papers on the genetic causes of cancer, Nature has expressed the view that the core of the mission of cancer-genome sequencing projects—to provide a catalogue of driver mutations that could give rise to cancer—has been achieved (Editorial, 2020). It is noteworthy, however, that, although on average, cancer genomes were shown to contain four to five driver mutations, in around 5% of cases no drivers were identified in tumors (Campbell et al., 2020). As pointed out by the authors, this observation suggests that cancer driver discovery is not yet complete, possibly due to failure of the available bioinformatic algorithms. The authors have also suggested that tumors lacking driver mutations may be driven by mutations affecting cancer-associated genes that are not yet described for that tumor type, however, using driver discovery algorithms on tumors with no known drivers, no individual genes reached significance for point mutations (Campbell et al., 2020).

In our view, these observations actually suggest that a rather large fraction of cancer genes remains to be identified. Assuming that tumors, on average, must have driver mutations affecting at least four or five cancer genes and that known and unknown cancer genes play similar roles in carcinogenesis, the observation that a 0.05 fraction of tumors has no known drivers (i.e. they are driven by four to five unknown cancer drivers) indicates that about half of the drivers is still unknown. If we assume that ~50% of cancer genes is still unknown 3–6% (0.5⁵–0.5⁴, i.e. 0.03125–0.0625 fraction) of tumors is expected to lack any of the known driver genes, and to be driven by four or five unknown driver mutations. Since the list of known drivers used in the study of the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium (Campbell et al., 2020) comprises 722 driver genes, these observations suggest that hundreds of cancer driver genes remain to be identified.

In the present work, we have used analyses that combined multiple types of signals of selection, permitting improved detection of positive and negative selection. Our analyses have identified a large number of novel positively selected cancer gene candidates, many of which could be shown to play significant roles in carcinogenesis as tumor suppressors and OGs. Significantly, our analyses have identified a major group of human genes that show signs of negative selection during tumor evolution, suggesting that the integrity of their function is essential for the growth and survival of tumor cells. Our analyses of representative members of negatively selected genes have confirmed that they play crucial pro-oncogenic roles in various cancer hallmarks (Table 1). It is important to emphasize that a survey of the group of OGs and pro-oncogenic TEGs reveals that they form a continuum in as much as there are numerous known OGs where negative selection also dominates (e.g. ACKR3, BCL2).

Although several groups have investigated the role of negative selection in tumor evolution earlier (Zhou et al., 2017; Pyatnitskiy et al., 2015; Weghorn and Sunyaev, 2017; Martincorena et al., 2017; Zapata et al., 2018; López et al., 2020; Tilk et al., 2020; Van den Eynden and Larsson, 2017), the study that received the greatest attention has reached the conclusion that negative selection has no role in tumor evolution (Martincorena et al., 2017; Bakhoum and Landau, 2017; Koch, 2017; Vitale and Galluzzi, 2018). The data presented here contradict the latter conclusion.

We believe that the approach reported here will promote the identification of numerous novel OGs, TSGs, and pro-oncogenic TEGs that may serve as therapeutic targets.

Materials and methods

Somatic mutation data

Cancer somatic mutation data were extracted from COSMIC v88, the Catalogue Of Somatic Mutations In Cancer (https://cancer.sanger.ac.uk/cosmic/download), which includes single nucleotide substitutions and small insertions/deletions affecting the coding sequence of human genes. The downloaded file (CosmicMutantExport.tsv, release v88) contained data for 29,415 transcripts (Supplementary file 8). For all subsequent analyses we have retained only transcripts containing mutations that were annotated under’ Mutation description’ as substitution or subtle insertion/deletion. This dataset contained data for 29,405 transcripts containing 6,449,721 mutations (substitution and short indels, SSI) and 29,399 transcripts containing 6,141,650 substitutions only (SO). Supplementary file 9 contains the metadata for these SO and SSI datasets.

Since we were interested in the selection forces that operate during tumor evolution, only confirmed somatic mutations were included in our analyses. In COSMIC such mutations are annotated under’ Mutation somatic status’ as Confirmed Somatic, that is confirmed to be somatic in the experiment by sequencing both the tumor and a matched normal tissue from the same patient. Supplementary file 10 indicates the contribution of major tumor types (‘Tumor Primary site’) to the somatic mutations of the dataset. As to’ Sample Type, Tumor origin’: we have excluded mutation data from cell-lines, organoid-cultures, xenografts since they do not properly represent human tumor evolution at the organism level. We have found that by excluding cell lines we have eliminated many artifacts of spurious recurrent mutations caused by repeated deposition of samples taken from the same cell-line at different time-points. To eliminate the influence of polymorphisms on the conclusions we retained only somatic mutations flagged 'n' for SNPs. (Supplementary file 8).

To increase the statistical power of our analyses, we have limited our work to transcripts that have at least 100 somatic mutations; Supplementary file 5 contains the metadata for transcripts containing at least 100 confirmed somatic, non polymorphic mutations identified in tumor tissues. Hereafter, unless otherwise indicated, our analyses refer to datasets containing transcripts with at least 100 somatic mutations. This limitation eliminated ~38% of the transcripts that contain very few mutations but reduced the number of total mutations only by 9% (Supplementary file 8).

It should be noted that requiring a higher minimum number of somatic mutations increases the statistical power of the analyses but may disfavor the identification of negatively selected genes that tend to be undermutated. To assess the influence of the cut-off value of the minimum number of mutations on the robustness of the conclusions about negatively or positively selected genes, we have compared the results of analyses in which the minimum number of somatic mutation per gene was set as 0, 50, 100, or 500 (Supplementary files 11–13, Figure 12).

Figure 12.

Analyses of fS, fM, and fN parameters of datasets N0, N50, N100, and N500 containing transcripts of human protein-coding genes with at least 0, 50, 100, or 500 somatic substitutions in tumors.

The figure shows the results of the analysis of 29,333, 21,307, 13,803, and 997 transcripts present in datasets N0 (panel A), N50 (panel B), N100 (panel C), and N500 (panel D), respectively. Axes x, y, and z represent the fractions of somatic single nucleotide substitutions that are assigned to the synonymous (fS), nonsynonymous (fM), and nonsense (fN) categories. Each gray ball represents a human transcript. The positions of transcripts of the genes identified by Vogelstein et al., 2013 as oncogenes (OGs, large red balls) or tumor suppressor genes (TSGs, large blue balls) are highlighted; novel proto-oncogenes, TSGs, and negatively selected tumor essential genes validated in the present work are represented by large magenta, cyan, and green balls, respectively. It is noteworthy that the requirement of at least 50 somatic mutations per transcript eliminates transcripts where the signal-to-noise ratio is too low to permit detection of signals of selection through the analysis of fS, fM, and fN parameters (compare panel A and panel B). It should also be noted that the requirement of at least 500 somatic mutations per transcript eliminates transcripts of negatively selected genes (compare panel C and panel D), consistent with the view that they tend to be undermutated.

The choice of the minimum number of somatic mutations was found to have a strong influence on the pattern of observed fN, fS, and fM scores (Supplementary files 11–13, Figure 12). In the case of dataset N0 (no requirement for a minimum number of mutations), a large number of transcripts with less than 50 substitutions had scores of zero for one or two of the fN, fS, and fM parameters due to the absence of somatic mutations in those categories (Figure 12A, Supplementary file 11). Increasing the minimum number of somatic mutations per transcript to 50, 100, and 500 resulted in loss of these transcripts and elimination of a diffusely scattered group of transcripts that do not cluster with PGs, known OGs, TSGs, and TEGs (compare Figure 12A and B,C and D, Supplementary file 11). These observations indicate that we cannot draw valid conclusions about the significance of selection from fN, fS, and fM scores in cases where the number of mutations of a given transcript is too low to permit meaningful analyses. Our analyses have also revealed that a large proportion (22%) of the transcripts unique to the N0^2SD dataset (containing fewer than 50 substitutions), correspond to short transcript fragments encoding less than 100 amino acids (Supplementary file 12). This finding suggests that the requirement for a minimum number of somatic mutations would not only increase statistical power, but also increases the biological relevance of the conclusions with the elimination of fragments that do not properly represent the full-length coding sequences.

Our analyses, however, have shown that the requirement of more than 500 somatic mutations per transcript (dataset N500) is too stringent. Although the majority of the 86 transcripts of the N500^2SD dataset correspond to known OGs (25) or TSGs (48), most OGs and TSGs are not represented in the N500^2SD dataset (Supplementary file 12). Furthermore, none of the negatively selected TEGs validated in the present study is present among the 86 transcripts of the N500^2SD dataset or among the 997 transcripts of the N500 dataset (Figure 12D, Supplementary file 12). This observation is consistent with the view that since negatively selected genes tend to have fewer mutations, they are less likely to pass the requirement for a high number of somatic mutations.

Our analyses suggest that the choice of 50 or 100 as the minimum number of somatic mutations per transcript represent acceptable trade-off between statistical power and loss of negatively or positively selected genes. As shown in Supplementary file 12, both the N50^2SD dataset (1846 transcripts) and the N100^2SD dataset (1060 transcripts) contained the majority of known OGs, known TSGs, and TEGs validated in the present work, but since the choice of 100 offers higher statistical power, we have used this dataset in our analyses.

Our choice of 100 as the minimum number of somatic mutation per transcript may also have some relevance for the lack of more extensive similarity of our list of negatively selected genes with those identified by others. As shown in Supplementary file 13, only 48%, 64%, 77%, and 89% of the negatively selected genes identified by Weghorn and Sunyaev, 2017, Zapata et al., 2018, Zhou et al., 2017, and Pyatnitskiy et al., 2015, respectively are present in dataset N100 containing 13,803 transcripts with at least 100 somatic mutations (Supplementary file 11). It thus appears that one of the reasons for the differences observed is that, with respect to minimum number of mutations, we have used rather stringent criteria to increase the robustness of our estimates. We wish to point out that, in order to obtain more reliable estimates of purifying selection, Pyatnitskiy et al., 2015 have also excluded genes from their analysis that carried a low number of mutations but they excluded only those with less than 11 mutations.

The COSMIC database of somatic mutations used in the present study contains data obtained by three main types of sequencing: whole-genome sequencing (WGS), whole-exome sequencing (WES) and targeted sequencing. As shown in Supplementary file 8, targeted screens provided substitution mutation data for only 13,120 transcripts of human genes, whereas genome wide screens covered 29,407 human transcripts as opposed to 29,415 transcripts covered by targeted plus genome wide screens. The contribution of targeted screens to somatic point mutations is even more restricted: only 508,124 (8.3%) of the 6,141,650 somatic point mutations of the entire COSMIC database were identified by targeted sequencing (Supplementary file 8). To check the impact of targeted sequencing on the dataset, in some analyzes we have used somatic mutation data only from genome-wide screens, excluding those obtained by targeted sequencing. We have found that omission of the data from targeted screens had no significant effect on the conclusions drawn from our analyses. Several factors may explain this observation. First, targeted screens usually focus on known cancer genes and they usually just reinforce the ‘known cancer gene’ status of the targeted genes. Second, since only a small fraction of the somatic mutations originates from targeted screens their impact is limited even in the case of the targeted genes. Finally, inclusion or omission of data from targeted screens has no impact on the number and pattern of mutations of non-targeted genes identified in genome wide screens.

Germline mutation data

Information on SNPs affecting the coding regions of human genes was downloaded from the dbSNP database (https://www.ncbi.nlm.nih.gov/snp/). For each SNP, we extracted nucleotide and amino acid variants from the original dbSNP file. In cases where two or three mutant variant was reported for a specific rsID, each variant was treated as an independent polymorphism. The retrieved SNPs were assigned to three functional categories: (i) Nonsense or Stop_gained mutations (N), which change an amino acid-encoding codon into a stop codon, (ii) Missense mutations (M), which change an amino acid into a mutant amino acid, and (iii) Synonymous or silent mutations (S), which do not change the amino acid. We have focused only on SNPs of genes that were also found to contain at least 100 confirmed somatic, non polymorphic mutations in the COSMIC database (Supplementary file 5). Supplementary file 6 shows the numbers and fractions of SNPs affecting the coding sequences of the various human genes, according to the functional categories of the point mutations.

Substitution metrics

The 61 sense codons can undergo 549 single base substitutions and, depending on the wild type and mutant codon, each substitution can be assigned to the silent, missense or nonsense mutation category. Out of the 549 single-base substitutions, 392 result in missense mutation, 134 lead to silent mutation, and 23 generate nonsense mutation, thus – assuming equal codon frequency, equal probability of the different types of substitutions and neutrality – the expected fractions of nonsense, missense and silent substituions are fN = 0.04189, fM = 0.71403, and fS = 0.24408, respectively.

Codons, however, differ significantly in the probability that their mutation would lead to nonsense (N), missense (M), or silent (S) mutation (Supplementary files 14, 15, 16) and since the 61 sense codons (amino acids) do not occur with the same frequency in the coding region of human genes this may have a significant influence on the expected fN, fM, and fS values. We have calculated the probability that a substitution would lead to nonsense, missense or silent mutation taking into account the codon frequency of the proteome of Homo sapiens (https://www.kazusa.or.jp/codon/cgi-bin/showcodon.cgi?species=9606). This calculation yielded values of fN = 0.0419, fM = 0.7299, fS = 0.2282 for the proteome, slightly different from the values of fN = 0.0419, fM = 0.7140, fS = 0.2441, assuming equal frequency of codons.

The amino acid composition and codon usage of some individual proteins (especially short fragments) may deviate significantly from average, therefore we have calculated the expected proportion of silent, missense, and nonsense mutations for all transcripts, assuming equal probability of different substitutions classes (Supplementary file 17). For these calculations, we have downloaded the coding sequences of 53,190 transcripts of human protein coding genes (All_COSMIC_Genes.fasta.gz) from the COSMIC database (https://cancer.sanger.ac.uk/cosmic) and their codon usage and amino acid composition were determined using the SMS server (https://www.bioinformatics.org/sms2/codon_usage.html, Stothard, 2000).

Different classes of substitutions, however, do not occur with equal probability, moreover the various normal and tumor tissues show characteristic differences in the spectrum of substitutions classes (Alexandrov et al., 2013; Alexandrov et al., 2020). Substitutions are assigned to six classes (C>A, C>G, C>T, T>A, T>C, and T>G) referred to by the pyrimidine of the mutated Watson–Crick base pair. It is of crucial importance to take differences in the probability of the six mutation classes into account since—due to the unique structure of the genetic code—the six types of substitutions differ markedly in the probability that they would lead to nonsense (N), missense (M), or silent (S) mutation of the coding region of protein-coding genes. As shown in Supplementary files 18–25, there are significant differences in the impact of different substitution classes on the expected proportion of missense, silent, and nonsense mutations of codons (assuming equal codon frequency). For example, the dominance of C>G increases the proportion of missense substitutions, whereas higher rates of C>T and T>C substitutions increase the proportion of silent substitutions. Since mutation bias favoring C>T substitutions is expected to decrease the ratio of missense to silent mutations, decreased dN/dS values may not be taken as evidence for negative selection in the case of tumors, such as malignant melanoma, where the vast majority of all somatic mutations is C>T substitution (Van den Eynden and Larsson, 2017).

To take into account differences in mutation bias, we have calculated the contribution of the C>A, C>G, C>T, T>A, T>C, and T>G mutations to the pattern of single base substitutions in tumors. We have downloaded the files containing ‘Mutational Signatures v3.1’ and ‘Attributions of the SBS Signatures to Mutations in Tumors’ from the COSMIC website (https://cancer.sanger.ac.uk/cosmic/signatures/SBS/index.tt). The contributions of C>A, C>G, C>T, T>A, T>C, and T>G mutations to the pattern of Single Base Substitutions in the tumors listed in the PCAWG_sigProfiler_SBS_signatures_in_samples file are summarized in Supplementary file 26. The C>T substitution accounts for the largest fraction of substitutions in most tumors (0.3726), followed by T>C (0.1842), C>A (0.1583), C>G (0.1162), T>G (0.0891), and T>A (0.0796). There are, however, differences in the relative contribution of the six mutation classes to different tumors. For example, the contribution of C>A mutation is higher than average for colon cancer and lung cancer, the role of C>G mutation is above average for bladder cancer and some breast cancers. The contribution of C>T mutation is very high in the case of skin-melanoma, whereas the T>A mutation contributes significantly to some kidney cancers. The T>C mutation plays a significant role in biliary and liver cancer, whereas the T>G mutation is more significant in colon cancer and esophageal cancer than in other tumors (see Supplementary file 26).

In order to correct for the influence of mutation bias on fN, fM, and fS values of transcripts in tumor tissues, we have calculated the expected fN*, fM*, and fS* values for all human transcripts using the average values of the six substitution types observed across tumors (Supplementary file 27). It is noteworthy that the average values of expected fN*, fM*, and fS* (fN*=0.04483, fM*=0.69114, and fS*=0.26402) are similar to those (fN = 0.04189, fM = 0.71403, and fS = 0.24408) assuming equal codon frequency and equal probability of the different types of substitutions.

In the case of germline cells, we have also calculated the expected fN**, fM**, and fS** values for all human transcripts using the mutation probabilities characteristic of these cells (Supplementary file 28). It has been shown earlier that the human germline mutation spectrum can be recapitulated by a combination of the cancer signatures SBS1 and SBS5 (Alexandrov et al., 2015; Rahbari et al., 2016; Heredia-Genestar et al., 2020). In the present work, we have combined the effect of mutation signatures SBS1 and SBS5 on the germline mutation spectrum of proteins according to the formula (0.1 × SBS1 + 0.9 × SBS5) recommended by Heredia-Genestar et al., 2020. It is noteworthy that the average values of expected fN**, fM**, and fS** (fN**=0.03791, fM**=0.68653, and fS**=0.27556) are similar to those expected for tumor tissues (fN*=0.04483, fM*=0.69114, and fS*=0.26402).

Per-gene detection of selection signals in tumor tissues

We have used two approaches to determine the observed fM, fS, and fN values of transcripts: one in which we have restricted our analyses to single nucleotide substitutions (hereafter referred to as SO for 'substitution only') and a version in which we have also taken into account subtle indels (hereafter referred to as SSI for 'substitutions and subtle indels').

In the first case, we have calculated for each transcript the fraction of somatic substitutions that could be assigned to the synonymous (fS), nonsynonymous (fM), and nonsense mutation (fN) category (Supplementary file 5 and 9). In the version that also included data for subtle indels, we have calculated the fraction of mutations corresponding to synonymous substitutions (indel_fS), but have merged nonsynonymous substitutions and short inframe indels in the category of mutations that lead to changes in the amino acid sequence (indel_fM). Nonsense substitutions and short frame-shift indels were included in the third category of mutations (indel_fN) as both types of mutation lead eventually to stop codons that truncate the protein (Supplementary file 5 and 9).

Analyses of datasets (Supplementary file 5) containing substitutions only have shown that in 3D scatter plots transcripts form a cluster (Figure 2A) characterized by values of 0.2436 ± 0.0619, 0.7090 ± 0.0556, and 0.0475 ± 0.0322 for fractions of silent, missense, and nonsense substitutions, respectively. The mean fS, fM, and fN values of the transcripts in this cluster are close to those expected if we assume that the structure of the genetic code has the most important role in determining the probability of somatic substitutions during tumor evolution of human genes (Supplementary file 29). Based on the structure of the genetic code, assuming equal usage of the codons and equal probability of different point mutations, in the absence of selection one would expect that a fraction of 0.24408 would be silent, 0.71403 of the single-base substitutions would be missense and 0.04189 would be nonsense mutations.

It is noteworthy, however, that the fS, fM, and fN values of the best known cancer genes (Vogelstein et al., 2013) deviate from those characteristic of the majority of human genes (Figure 2B). The genes in the central cluster, deviating from mean fM, fS, and fN values by ≤1 SD, are characterized by fraction values of 0.24548 ± 0.03079, 0.71084 ± 0.0274, and 0.04368 ± 0.01572 for synonymous, nonsynonymous and nonsense substitutions, respectively. Note that these values are very close to those expected from the structure of the genetic code in the absence of selection, assuming equal frequency of codons and equal probability of the different classes of mutations (Supplementary file 29). This central cluster of genes (Supplementary file 5) is hereafter referred to as PG_SO^f_1SD (for Passenger Gene_Substitution Only deviating from mean fM, fS, and fN values by ≤1 SD) because it is likely to be enriched in genes that play no major role in carcinogenesis.

In harmony with earlier observations, the values for OGs show a significant (p<0.05) shift of fM to higher values (0.8563 ± 0.08224) relative to those of PGs (0.71084 ± 0.0274), reflecting positive selection for missense mutations (Supplementary file 29). On the other hand, the fN values of TSGs are significantly (p<0.05) higher (0.1964 ± 0.11063) than those of PGs (0.04368 ± 0.01572), reflecting positive selection for truncating nonsense mutations (Supplementary file 29).

The genes (1060 transcripts) with values that deviate from mean values of fS, fM, and fN by more than 2SD, however, are likely to be subject to selection. In harmony with this expectation, this group contains transcripts of the majority of known driver genes (62 OG and 119 TSG driver gene transcripts). This gene set, defined by 2SD cut-off value, is hereafter referred to as CG_SO^f_2SD (for Cancer Gene_Substitution Only deviating from mean fM, fS, and fN values by more than 2SD) because it is likely to be enriched in cancer genes (Supplementary file 29). Out of the 1060 transcripts present in CG_SO^f_2SD, 737 transcripts are derived from genes that are not included in the OG, TSG, and CGC cancer gene lists (Supplementary files 5 and 29). Since the majority of these 737 transcripts have parameters that cluster them with known OGs or TSGs, we assume that they qualify as candidate OGs or TSGs. However, a group of genes deviates from both the central PG cluster and the clusters of OGs and TSGs (Figure 2C). The high fS and low fM and fN values of the genes in this cluster suggest that they experience purifying selection during tumor evolution, raising the possibility that they may correspond to TEGs important for the growth and survival of tumors.

Known cancer genes (OGs and TSGs) also separate from the majority of human genes in 3D scatter plots of parameters rSM, rNM, rNS defined as the ratio of fS/fM, fN/fM, fN/fS, respectively (Figure 3). The central cluster of genes that deviate from mean rSM, rNM and rNS values by ≤1 SD is hereafter referred to as PG_SO^r2_1SD (Passenger Gene_Substitution Only deviating from mean rSM, rNM, and rNS values by ≤1 SD) since it is likely to be enriched in PGs. Conversely, the group of transcripts that deviate from mean rSM, rNM, and rNS values by more than 2SD is referred to as CG_SO^r2_2SD (Cancer Gene_Substitution Only deviating from mean rSM, rNM, rNS values by more than 2SD) because it is likely to be enriched in cancer genes (Supplementary file 29). The CG_SOr2_2SD gene set (780 transcripts) contains the majority of driver gene transcripts (40 transcripts of OGs, 103 transcripts of TSGs genes), 79 transcripts of CGC genes and 558 transcripts derived from 468 genes that are not found in the OG, TSG, and CGC cancer gene lists (Supplementary file 29).

In these scatter plots OGs separate from the central cluster in having significantly (p<0.05) lower rSM (0.13971 ± 0.10621) and rNM (0.03936 ± 0.0313) values than those of the central cluster of PGs (rSM: 0.34523 ± 0.06137; rNM: 0.0607 ± 0.02595, Supplementary file 29), reflecting positive selection for missense mutations and negative selection of nonsense mutations. Interestingly, in these plots some OGs (e.g. BCL2) have unusually high values of rSM and low values of rNM (e.g. Figure 3A1, A2 and Supplementary file 5) suggesting that in the case of these OGs purifying selection may dominate over positive selection for amino acid changing mutations.

TSGs also separate from the central cluster: they have significantly (p<0.05) higher rNS (3.92588 ± 5.66261) and rNM (0.31524 ± 0.31575) values than those of PGs (rNS: 0.18403 ± 0.09138; rNM: 0.0607 ± 0.02595; Figure 3A1, A2. Supplementary file 29), reflecting the dominance of positive selection for inactivating mutations.

As mentioned above, the candidate cancer gene set defined by a cut-off value of 2SD also contains 558 transcripts derived from 468 genes that are not found in the OG, TSG, or CGC lists.

Since the majority of these 558 transcripts have parameters that cluster them with known OGs or TSGs, they can be regarded as candidate OGs or TSGs. There is, however, a group of genes that deviate from the clusters of PGs, OGs, and TSGs in that they have unusually high rSM values and low rNM and rNS values. Since these values may be indicative of purifying selection, we assumed that they might correspond to TEGs important for the growth and survival of tumors.

The separation of known cancer genes from the majority of human genes is even more obvious in 3D scatter plots of parameters rSMN, rMSN, and rNSM defined as the ratio of fS/(fM+fN), fM/(fS+fN), and fN/(fS+fM), respectively (Figure 4 A1, A2). In these plots, the gene transcripts are present in a three-pronged cluster, with OGs and TSG being present on separate spikes of this cluster (Figure 4).

We refer to the central cluster of genes, deviating from mean rSMN, rMSN, and rNSM values by ≤1 SD as PG_SO^r3_1SD (Passenger Gene_Substitution Only deviating from mean rSMN, rMSN, and rNSM values by ≤1 SD) as they are likely to be enriched in PGs. Similarly, we refer to the gene set defined by 2SD cut-off value (Supplementary files 5 and 29) as CG_SO^r3_2SD (Cancer Gene_Substitution Only deviating from mean rSMN, rMSN, and rNSM values by more than 2SD) as it is likely to be enriched in candidate cancer genes. This gene set has 751 transcripts, containing the majority of transcripts of known driver genes (35 OGs, 103 TSGs), 80 transcripts of CGC genes and 533 transcripts (derived from 448 genes) not found in the OG, TSG, and CGC cancer gene lists (Supplementary files 5 and 29).

The mean parameters of TSGs differ significantly (p<0.05) from those of PGs in as much as rNSM values of TSGs are higher (0.27937 ± 0.2783) but rSMN (0.10865 ± 0.06128) values are lower than those of PGs (rNSM: 0.04812 ± 0.02561; rSMN: 0.3259 ± 0.09265, Supplementary file 29), reflecting the dominance of positive selection for inactivating nonsense mutations.

In the case of OGs the rMSN values are significantly (p<0.05) higher (15.35971 ± 30.07472) and the rSMN values are significantly lower (0.13363 ± 0.10266) than those of PGs (rMSN: 2.58911 ± 0.68355; rSMN: 0.3259±0.09265 Supplementary file 29), reflecting positive selection for missense mutations. The rNSM values of OGs (0.03394 ± 0.02621) are also significantly (p<0.05) lower than those of PGs (0.04812 ± 0.02561), reflecting purifying selection avoiding nonsense mutations. Interestingly, some OGs have unusually high scores of rSMN (Figure 4 A1, A2, Supplementary file 5) suggesting that in these cases (e.g. BCL2) purifying selection dominates over positive selection for amino acid changing mutations.

As mentioned above, the candidate cancer gene set defined by a cut-off value of 2SD contains 533 transcripts (derived from 448 genes) not found in the OG, TSG, or CGC lists. Since the majority of these genes have parameters that assign them to the clusters containing OGs or TSGs, they can be regarded as candidate OGs or TSGs. There is, however, a group of genes that deviates from the clusters of PGs, OGs, and TSGs (Figure 4). Their high rSMN and low rMSN and rNSM values suggest that they experience purifying selection during tumor evolution, raising the possibility that this group may be enriched in genes essential for the survival of tumors as pro-oncogenes or TEGs.

The three types of analyses described for Substitutions Only (illustrated in Figures 2–4) were also carried out for datasets in which both substitutions and subtle indels (Substitutions and Subtle Indels, SSI) were used (for details of these analyzes see Appendix 2).

Comparison of the data obtained by SO and SSI analyses (Supplementary file 5) revealed that inclusion of indels has only minor influence on the separation of the clusters of PGs and CGs. The lists of PGs identified with 1SD cut-off values for SO analyes (PG_SO^f_1SD, PG_SO^r2_1SD, PG_SO^r3_1SD) and SSI analyses (PG_SSI^f_1SD, PG_SSI^r2_1SD, PG_SSI^r3_1SD) show more than 90% identity in the case of the relevant SO/SSI pairs (Supplementary file 30). Similarly, the lists of CGs identified with 2SD cut-off values for SO analyses (CG_SO^f_2SD, CG_SO^r2_2SD, CG_SO^r3_2SD) and SSI analyses (CG_SSI^f_2SD, CG_SSI^r2_2SD, CG_SSI^r3_2SD) show 78%, 87%, and 92% identity, respectively, for the relevant SO/SSI pairs (Supplementary file 30).

The parameters of the 1158 transcripts present in at least one of the various CG_SO^2SD lists and the 1333 transcripts present in at least one of the various CG_SSI^2SD lists (Supplementary file 31) were used to assign them to three distinct clusters. (1) Cluster of genes positively selected for missense mutations and negatively selected for nonsense mutations; (2) Cluster of genes positively selected for nonsense mutations; (3) Clusters of negatively selected genes (see Figure 2C, Figure 3 B1, B2 and Figure 4 B1, B2). To check the validity and predictive value of the assumption that the genes assigned to these clusters play significant roles in carcinogenesis, we have selected a number of genes for further analyses from the 1457 transcripts present in the combined list (CG_SO^2SD_SSI^2SD) of candidate cancer genes (Supplementary file 31). The results of these analyses are summarized in the Results section.

As outlined in the section on Substitution metrics, a limitation of the analyses discussed above is that they did not take into account the impact of differences in mutation probability on the fN, fM, and fS values of transcripts. In order to eliminate this source of error, we have calculated the expected fN*, fM*, and fS* values for all human transcripts using the probability of the six substitution types observed across tumors (Supplementary file 27). The various types of observed/expected ratios (rN*, rM*, rS*; rSM*, rNM*, rNS*; rSMN*, rMSN*, and rNSM*) of somatic mutations were calculated for all transcripts (Supplementary file 32) and the data were analyzed in 3D scatter plots as described above for the observed values.

As shown in Figures 5–7, the distribution of transcripts in these 3D scatter plots are similar to those observed in the corresponding Figures 2–4, in that known OGs, TSGs, and TEGs are separated from the central cluster of PGs as well as from each other (Supplementary file 32).

Per-gene detection of selection signals in the database of human single-nucleotide polymorphisms

As a reference, we have carried out similar analyses of the fN, fM, and fS parameters of germline mutations, through the analysis of the human database of human single-nucleotide polymorphisms (SNPs; Supplementary file 6). Supplementary file 33 contains the various types of observed/expected ratios (rN**, rM**, rS**; rSM**, rNM**, rNS**; rSMN**, rMSN**, and rNSM**) of germline mutations calculated for all transcripts. Data were analyzed in 3D scatter plots as described for somatic mutations. Details of these analyses are presented in the Results section.

Cancer gene list

As the gold standard of 'known' cancer genes we have used the lists of OG and TSGs identified by Vogelstein et al., 2013. As another list of known cancer genes we have also used the genes of the Cancer Gene Census (Sondka et al., 2018).

Statistical analyses

The statistical package of Origin 2018 was used for all data processing and statistical analysis. We report details of statistical tests in the Supplementary files of the respective sections. Statistical significance was set as a p value of < 0.05.

AuthorAffiliation

1 Institute of Enzymology, Research Centre for Natural Sciences Budapest Hungary
2 Department of Pathogenetics, National Institute of Oncology Budapest Hungary
Australian National University Australia
University of Michigan United States

Word count: 18178

Show less

© 2021, Bányai et al. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

A major goal of cancer genomics is to identify all genes that play critical roles in carcinogenesis. Most approaches focused on genes positively selected for mutations that drive carcinogenesis and neglected the role of negative selection. Some studies have actually concluded that negative selection has no role in cancer evolution. We have re-examined the role of negative selection in tumor evolution through the analysis of the patterns of somatic mutations affecting the coding sequences of human genes. Our analyses have confirmed that tumor suppressor genes are positively selected for inactivating mutations, oncogenes, however, were found to display signals of both negative selection for inactivating mutations and positive selection for activating mutations. Significantly, we have identified numerous human genes that show signs of strong negative selection during tumor evolution, suggesting that their functional integrity is essential for the growth and survival of tumor cells.

Details

Title

Use of signals of positive and negative selection to distinguish cancer genes and passenger genes

Author

Bányai László; Trexler, Maria; Kerekes Krisztina; Csuka Orsolya; Patthy László

University/institution

U.S. National Institutes of Health/National Library of Medicine

Publication year

2021

Publication date

2021

Publisher

eLife Sciences Publications Ltd.

e-ISSN

2050084X

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.7554/eLife.59629

ProQuest document ID

2595180008

Use of signals of positive and negative selection to distinguish cancer genes and passenger genes

Jump to:

Full text

Abstract

Details

Suggested sources