Article

Recurrent somatic mutations reveal new insights into consequences of mutagenic processes in cancer

Stobbe, Miranda D.; Thun, Gian A.; Dieguez-Docampo, Andrea; Oliva, Meritxell; Whalley, Justin P.; Raineri, Emanuele; Gut, Ivo G.

Computer Science

PLOS COMPUTATIONAL BIOLOGY

2019

VL / 15 - BP / - EP /

abstract

Author summary Mutations found in the DNA of a tumour are expected to be largely unique to each tumour as there are three billion places in the DNA that can be mutated. However, despite these odds, in a cancer study with 2,583 participants covering 37 tumour types we observe in total over a million non-unique mutations. Based on this observation, we hypothesize that these mutations can be highly informative of the biological processes that caused them. Using characteristics of these non-unique mutations and general statistics like the total number of mutations, we classify the tumours into 16 groups. These groups not only delineate various mutational processes, but also characterize them in more detail. Moreover, we can link the groups to several clinically actionable phenotypes. Our work is a crucial step towards the development of a generic and personalized cancer diagnostic test that only uses the mutations found in the tumour. The sheer size of the human genome makes it improbable that identical somatic mutations at the exact same position are observed in multiple tumours solely by chance. The scarcity of cancer driver mutations also precludes positive selection as the sole explanation. Therefore, recurrent mutations may be highly informative of characteristics of mutational processes. To explore the potential, we use recurrence as a starting point to cluster >2,500 whole genomes of a pan-cancer cohort. We describe each genome with 13 recurrence-based and 29 general mutational features. Using principal component analysis we reduce the dimensionality and create independent features. We apply hierarchical clustering to the first 18 principal components followed by k-means clustering. We show that the resulting 16 clusters capture clinically relevant cancer phenotypes. High levels of recurrent substitutions separate the clusters that we link to UV-light exposure and deregulated activity of POLE from the one representing defective mismatch repair, which shows high levels of recurrent insertions/deletions. Recurrence of both mutation types characterizes cancer genomes with somatic hypermutation of immunoglobulin genes and the cluster of genomes exposed to gastric acid. Low levels of recurrence are observed for the cluster where tobacco-smoke exposure induces mutagenesis and the one linked to increased activity of cytidine deaminases. Notably, the majority of substitutions are recurrent in a single tumour type, while recurrent insertions/deletions point to shared processes between tumour types. Recurrence also reveals susceptible sequence motifs, including TT[C>A]TTT and AAC[T>G]T for the POLE and 'gastric-acid exposure' clusters, respectively. Moreover, we refine knowledge of mutagenesis, including increased C/G deletion levels in general for lung tumours and specifically in midsize homopolymer sequence contexts for microsatellite instable tumours. Our findings are an important step towards the development of a generic cancer diagnostic test for clinical practice based on whole-genome sequencing that could replace multiple diagnostics currently in use.