High-throughput (HT) techniques elevate the volume of genomic information in molecular pathology, with increasing cohort sizes offering molecular and phenotypic sub-classifications. This is driving a more holistic understanding of the phenotypic traits that underpin disease. Such data generation, its convergence with other data and the need to fast-track biomarker discovery and evaluation is causing a major bottleneck in interpretation. Diagnostic pathology departments globally store many millions of tissue samples across different cancer types and patient populations in the form of formalin-fixed paraffin-embedded (FFPE) blocks. This in itself provides an enormously rich cohort of samples which can be used in molecular discovery and translational medicine in solid tumours. For example, many studies utilise pathology archives to generate tissue microarrays (TMAs) for the HT-analysis of tissue biomarkers across multiple patient samples in a single assay. However, the handling, preparation and storage of tissue samples can be highly variable, potentially impacting on the quality of downstream molecular evaluation that is sensitive to these pre-analytical variables. The goal of biobanks or bio-repositories is to overcome variability by planned prospective collection of samples where several samples per patient are collected concomitantly with fully controlled variables, specifically for the purposes of research. Collating clinical and analytical data on these samples using dedicated informatics-based methods provides us with unique opportunities to understand these complex clinico-genomic data sets. What impact this HT-integrative approach has on treatment has been the focus of many recent seminal papers and is the foundation of stratified medicine (Garnett et al. 2012; Misale, S. et al. 2012; Diaz et al. 2012).
In an interpretative context, clinical/genomic resources are becoming increasingly conspicuous, such as the cancer genome atlas (Cancer Genome Atlas Research Network, 2008) offering large data collections on multiple cancer types. Kristensen and colleagues discuss in their recent review the tools and complexity required in order to analyse integrative genomic methodologies to discover patient subgroups for prognostic and diagnostic insight (Kristensen et al., 2014). Recent publications in cancer research from an integrative context are starting to depict a landscape of increased scrutiny, complexity and discovery (Ding et al. 2014; Schroeder et al. 2013; Wang et al. 2013). Synergistically, it is also possible to analyse HT-tissue biomarker data by uploading and analysing with TMANavigator (Lubbock et al. 2013) as well as cohort data analysis by designing an open platform for web based biomedical queries (Pennington et al. 2014). However, we still have a dearth of methods supporting the merger of disparate data sets spanning patients, their samples, tissue and genomic biomarkers and the interrogation of these complex data sets for biomarker discovery, translational research and patient stratification.
An additional issue with developing an integrated data analysis system is that clinical phenogenomic data tends to be dynamic. Most studies are longitudinal, requiring on-going patient follow-up where new data is constantly being added to the system. Digital image evaluation is also an important component in tissue-based research, particularly in heterogeneous biomarkers requiring HT-evaluation where data integration and concordance are of particular interest in biomarker discovery, adding performance, robustness and reproducibility. In addition, new analytical methods are constantly evolving, requiring analysis and re-analysis of the sample cohort. As such, the statistical approaches to data-mining would also need to be updated constantly and with a modern adaptable framework designed to support it.
Materials and methodsWe have adopted an integrative “omics” (‘integromics’) method that allows the streamlined merging of data from diverse sources (Searles, 2005; Lê Cao et al. 2009). This we have called PICan (Pathology Integromics in Cancer). PICan is a comprehensive clinical genomics data management approach for molecular tissue pathology using MySQL and
Overarching view of the PICan system. Resolving data to minimal information templates allows the creation of a knowledge base we implement into database tables which can be resolved to hold molecular and digital pathology information. We can then, in turn, statistically analyse data using stratified selections. These are streamlined and instantly expandable both on the cohort level as well as at a data entry level, in order to place new biomarkers in a molecular pathology context.
Digitally scanned IHC TMA information can be seamlessly matched to the patient clinical and pathological information. A) Users can select cores from the map interface or scoring interface to retrieve the core in higher resolution and supporting information, progress is tracked on both maps. B) Also, as cores are virtually de-arrayed we can rank the cores by their associated scoring metrics across multiple maps, an important resource in collaboration, consistency and training.
With PICan offering a browser based interface, we segregate the data collections into separate navigational browser pages for data interpretation and digital image assessment (Figure 3B) (supp methods 2). PICan's utility differentiates it from statistical software packages in its capacity to integrate different types of patient, sample and biomarker analytic data, facilitating the concatenation of data into a single searchable interface where molecular and phenotypic data can be combined to generate new signatures of disease. The system will not only run routine statistical tests, but with the added HT-data we can perform supervised and unsupervised analytical techniques factoring in other known molecular information. This approach will tolerate research flexibility in prognostic evaluation and the selection of a ‘best scoring metric criteria’ in IHC either using receiver operator characteristic curves, Kaplan Meier curves or Cox proportional hazards models (Supp Methods 5). The integration of digital pathology technology within PICan allows us to review tissue morphology and protein expression profiles within the system. Patient stratification will resolve the difference in frequencies in protein biomarkers and transcriptomic signatures integrating in additional mutational knowledge.
A) Resolved patient information, includes clinical and pathological data matched to genomic and transcriptomic information with interactive graphical functionality (inset). B) Navigational pages for: data-exchange/upload (Top); HT-analysis for differential expression and supervised analysis (Middle); and single biomarker statistical analyses (Bottom).
As an example of the power and the utility of the PICan method we used this approach as the primary data integration and analysis platform for a large breast cancer tissue biomarker study (Boyle et al. 2014). Taking one of the best known genes and proteins in cancer biology, p53, we address two main test hypotheses. Firstly, that a PICan driven analysis of image related information in TMAs is able to achieve the same biological and clinical relevance in breast cancer, after the IHC results are digitally scanned, de-arrayed and information resolved so that it can be analysed against other baseline biomarkers (Figure 3A) and from here it becomes a marker that can be used in other browser pages (i.e. stratifying HT-data) or downloaded for the researcher to use elsewhere (Figure 3B). Secondly, we explore if known HT-genomic signatures of p53 biology hold their statistical relevance when analysing, in the PICan context, on independent sets of HT-generated results in breast cancer. The latter is one of the most important uses of this method, which we demonstrate here by way of understanding the p53 correlations at the 3 levels of the central dogma of molecular biology (DNA mutation, RNA gene amplification and protein expression), and understand the clinical relevance of such holistic analysis for individual cases and for the whole cohort.
With the number of scoring metrics (H Score, percentage, quick score, allred score, intensity scores or user defined scores) being easily integrated into the platform (because we can resolve TMA maps against the patient information) we obtain in this instance 288 p53 IHC cores from the 293 patients. The cores digitally de-arrayed and matched to the individual can be visualised, integrating the p53 IHC data by direct data-exchange and linked by unique identifiers to the individual. It can then seamlessly integrate our associated scoring metrics against the clinical cohort and across TMA maps as can be achieved with leading digital pathology software (Leica
A) We use the patient integromics analysis to reveal candidates with specific mutations of interest. B) In line with the seminal Miller et al. paper in 2005, we examined the 18/20 identifiable genes from their signature in our microarrays that survived a median filter, using the probe with the highest variance of each gene (if more than one) to examine the integrated information clustered in our data. Top column bar P53 mutation: Green – Mt; Blue – Wt; second column bar IHC: Red – Aberrant extreme (by Boyle et al.); Black – non-extreme (intermediate patterns in IHC). C) The clinical significance of this data is obtained demonstrating a significant prognostic IHC threshold for aberrant extreme p53, which can be visualised by the thresholded de-arrayed cores. D) Pvclust bootstrap resampling for the estimation of uncertainty in the data evaluated in B. with bootstrap n = 1000.
This new ‘integromic’ framework allows us to test HT-signatures and confirm their biological value when correlating the resulting taxonomy with the mutation and aberrant protein status for the same biomarker. To do so, we query which patients in our breast cancer cohort had mutational aberrations in TP53 (Figure 4A) based on the TP53 signature hypothesis and mutational analysis by Miller et al. (Miller et al. 2005). This summarises patients individually by their genomic and clinical/pathological/treatment profiles and call upon mutational information such as Sanger sequencing or NGS variants. Employing this additional information we can create a new analysis framework by A) stratifying and clustering the expression data deriving differential expression signatures and B) submitting gene lists for unsupervised signature validation (Figure 4B and D). Our analysis, using the signature stated above was able to differentiate p53 mutant versus wild-type cases, which also correlated with the aberrant protein status and with patient survival.
The PICan method has a wide range of emerging applications, improving the resolution of the complex, multivariate, multiplex analysis of modern tissue-based research and its role in the delivery of personalised medicine. Over time, the expansion of PICan will demand the incorporation of new technologies and their associated datatypes, as and when these techniques become available. In our setting, this has led to the need to expand our translational bioinformatics team. In addition, safeguarding and expanding high quality curated clinical and pathological data has required dedicated resource to support expansion. Ensuring high quality data expansion and well defined ontologies are going to be in high demand for data integromics in the future, and where lack of control across different analytical platforms may decrease analytical sensitivity. In our setting, the fact that the clinical/pathological data is collected and curated by in-house pathologists and clinicians, allows us to control statistical selection and safe-guard data analysis. However, the system is designed for exploratory mining of the multivariate data where findings will hopefully enhance discovery, but where validation and more sophisticated statistical interrogation will be required. This approach is, however, flexible across all cancer types and analytical tools and will allow researchers to statistically analyse complex datasets within or across different diseases. This novel approach would be beneficial to other molecular pathology laboratories and clinical trials facilities as the method to support data integration, improve tissue-based analysis and fast-track biomarker discovery and validation (Salto-Tellez et al., 2014).
Author contributionsMST and PH lead the project, designed the concept and guided its development from a digital pathology and molecular pathology context. DMA developed the system and undertook the validation. JB, YW and DK constructed table entities and statistical methods to interpret the data. DB, GI, MM, PD designed the templates, in this instance for Breast, in the focus of this manuscript. PB and RH helped in TMA de-arraying and digital image interpretation. RK, PM, PH provided insight and interpretation on HT-data. MC provided the data on mutational analysis. JJ governed the ethical framework and integrity of the system surrounding clinical and pathological collections.
Conflict of interestProf. D. Paul Harkin, president and managing director, Prof. Richard D. Kennedy, VP and clinical director, currently at Almac diagnostics, manufacturers of the Breast DSA used, in part, in this methodology. Prof. Manuel Salto-Tellez was a Consultant for Almac during part of the duration of this project, and is currently member of the Advisory Panel of PathXL. Prof. Peter Hamilton is founder board member at PathXL.
AcknowledgementsThe authors wish to thank the NI Biobank for help in constructing the necessary framework and ethics involved in data collection. We would also like to thank the Northern Ireland molecular pathology laboratory for tissue microarray construction and also to Anne Carson for digital image scanning and hosting on dedicated servers along with the excellent team at PathXL. The clinicians deserve special mention for devising minimal information templates for their particular focus groups and dedicated teams. The research leading to these results has received funding from the People Programme (Marie Curie Actions) of the European Union's Seventh Framework Programme FP7/2007-2013/ under REA grant agreement no [285910].
Supplementary data related to this article can be found at
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2015. This work is published under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
Modern cancer research on prognostic and predictive biomarkers demands the integration of established and emerging high-throughput technologies. However, these data are meaningless unless carefully integrated with patient clinical outcome and epidemiological information. Integrated datasets hold the key to discovering new biomarkers and therapeutic targets in cancer. We have developed a novel approach and set of methods for integrating and interrogating phenomic, genomic and clinical data sets to facilitate cancer biomarker discovery and patient stratification. Applied to a known paradigm, the biological and clinical relevance of TP53, PICan was able to recapitulate the known biomarker status and prognostic significance at a DNA, RNA and protein levels.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details
1 Centre for Cancer Research and Cell Biology (CCRCB), Queen's University Belfast, Belfast, United Kingdom