Abstract
In the last years more and more multi-omics data are becoming available, that is, data featuring measurements of several types of omics data for each patient. While using multi-omics data as covariate data in outcome prediction is promising, it is also challenging due to the complex structure of such data. Random forest is a prediction method known for its ability to render complex dependency patterns between the outcome and the covariates. Against this background we developed five candidate random forest variants tailored to multi-omics covariate data. These variants modify the split point selection of random forest to incorporate the block structure of multi-omics data and can be applied to any outcome type for which a random forest variant exists, such as categorical, continuous and survival outcomes. Using 20 multi-omics data sets with survival outcome we compared the prediction performances of the block forest variants, using random survival forest as a reference method. We also considered the common special case of having clinical covariates and measurements of a single omics data type available. We identify one variant termed "block forest" that performed significantly better than standard random survival forest (adjusted p-value: 0.027). The two best performing variants have in common that the block choice is randomized in the split point selection procedure. In the case of having clinical covariates and a single omics data type available, the improvements of the variants over random survival forest were larger than in the case of the multi-omics data. In the former case four of the five variants performed significantly better than random survival forest. The degrees of improvements over random survival forest varied strongly across data sets. The new prediction method block forest for multi-omics data can significantly improve the prediction performance of random forest. Block forest is particularly effective for the special case of using clinical covariates in combination with measurements of a single omics data type.
Dokumententyp: | Paper |
---|---|
Keywords: | Multi-omics data, Prediction, Random forest, Machine learning, Statistics, Survival analysis, Cancer |
Fakultät: | Mathematik, Informatik und Statistik > Statistik > Technische Reports |
Themengebiete: | 500 Naturwissenschaften und Mathematik > 500 Naturwissenschaften |
URN: | urn:nbn:de:bvb:19-epub-59631-2 |
Sprache: | Englisch |
Dokumenten ID: | 59631 |
Datum der Veröffentlichung auf Open Access LMU: | 21. Dez. 2018, 06:59 |
Letzte Änderungen: | 04. Nov. 2020, 13:38 |
Literaturliste: | Zhao Q, Shi X, Xie Y, Huang J, Shia B, Ma S. Combining multidimensional genomic measurements for predicting cancer prognosis: observations from TCGA. Brief Bioinform. 2015;16(2):291–303. Huang S, Chaudhary K, Garmire LX. More is better: Recent progress in multi-omics data integration methods. Front Genet. 2017;8:84. Lanckriet GRG, De Bie T, Cristianini N, Jordan MI, Noble WS. A statistical framework for genomic data fusion. Bioinformatics. 2004;20(16):2626–2635. Simon N, Friedman J, Hastie T, Tibshirani R. A Sparse-Group Lasso. J Comput Graph Stat. 2013;22(2):231–245. Boulesteix AL, De Bin R, Jiang X, Fuchs M. IPF-LASSO: Integrative L1- penalized regression with penalty factors for prediction based on multi-omics data. Comput Math Method M. 2017;p. 1–14. Vazquez AI, Veturi Y, Behring M, Shrestha S, Kirst M, Resende MFR Jr, et al. Increased proportion of variance explained and prediction accuracy of survival of breast cancer patients with use of whole-genome multiomic profiles. Genetics. 2016;203:1425–1438. Mankoo PK, Shen R, Schultz N, Levine DA, Sander C. Time to recurrence and survival in serous ovarian tumors predicted from integrated genomic profiles. PLoS ONE. 2011;6(11):e24709. Park MY, Hastie T. L1-regularization path algorithm for generalized linear models. J R Stat Soc Ser B. 2007;69:659–677. Seoane JA, Day INM, Gaunt TR, Campbell CA. A pathway-based data integration framework for prediction of disease progression. Bioinformatics. 2014;30(6):838–845. Fuchs M, Beißbarth T, Wingender E, Jung K. Connecting high-dimensional mRNA and miRNA expression data for binary medical classification problems. Comput Meth Programs Biomed. 2013;111(3):592–601. Klau S, Jurinovic V, Hornung R, Herold T, Boulesteix AL. Priority-Lasso: a simple hierarchical approach to the prediction of clinical outcome using multiomics data. BMC Bioinform. 2018;19:322. Aben N, Vis DJ, Michaut M,Wessels LFA. TANDEM: a two-stage approach to maximize interpretability of drug response models based on multiple molecular data types. Bioinformatics. 2016;32(17):i413–i420. Boulesteix AL, Sauerbrei W. Added predictive value of high-throughput molecular data to clinical data and its validation. Brief Bioinform. 2011;12(3):215– 229. De Bin R, Sauerbrei W, Boulesteix AL. Investigating the prediction ability of survival models based on both clinical and omics data: two case studies. Stat Med. 2014;33:5310–5329. Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS. Random survival forests. Ann Appl Stat. 2008;2:841–860. Bou-Hamad I, Larocque D, Ben-Ameur H. A review of survival trees. Stat Surv. 2011;5:44–71. Yosefian I, Farkhani EM, Baneshi MR. Application of random forest survival models to increase generalizability of decision trees: A case study in acute myocardial infarction. Comput Math Methods Med. 2015;p. 1–6. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning. 2nd ed. New York, NY, USA: Springer; 2009. Strobl C, Boulesteix AL, Zeileis A, Hothorn T. Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinform. 2007;8(1):25. Wright MN, Ziegler A. ranger: A fast implementation of random forests for high dimensional data in C++ and R. J Stat Softw. 2017;77:1–17. Ley TJ, Miller C, Ding L, Raphael BJ, Mungall AJ, Robertson A, et al. Genomic and epigenomic landscapes of adult de novo acute myeloid leukemia. N Engl J Med. 2013;368:2059–2074. Torgo L. DMwR: Functions and data for ’Data Mining with R’; 2013. R package version 0.4.1. Hornung R, Bernau C, Truntzer C, Wilson R, Stadler T, Boulesteix AL. A measure of the impact of CV incompleteness on prediction error estimation with application to PCA and normalization. BMC Med Res Methodol. 2015;15:95. Geurts P, Ernst D, Wehenkel L. Extremely randomized trees. Mach Learn. 2006;63:3–42. Yousefi MR, Hua J, Sima C, Dougherty ER. Reporting bias when using real data sets to analyze classification performance. Bioinformatics. 2010;26(1):68– 76. Boulesteix AL, Hable R, Lauer S, Eugster MJA. A statistical framework for hypothesis testing in real data comparison studies. Am Stat. 2015;69(3):201– 212. Probst P, Bischl B, Boulesteix AL. Tunability: Importance of hyperparameters of machine learning algorithms; 2018. arXiv/1802.09596. |