Abstract
To date most medical tests derived by applying classification methods to high-dimensional molecular data are hardly used in clinical practice. This is partly because the prediction error resulting when applying them to external data is usually much higher than internal error as evaluated through within-study validation procedures. We suggest the use of addon normalization and addon batch effect removal techniques in this context to reduce systematic differences between external data and the original dataset with the aim to improve prediction performance. We evaluate the impact of addon normalization and seven batch effect removal methods on cross-study prediction performance for several common classifiers using a large collection of microarray gene expression datasets, showing that some of these techniques reduce prediction error. All investigated addon methods are implemented in our R-package "bapred".
Dokumententyp: | Paper |
---|---|
Keywords: | Classification, Machine learning, Prediction, Data analysis, Microarray data analysis |
Fakultät: | Mathematik, Informatik und Statistik > Statistik > Technische Reports |
Themengebiete: | 500 Naturwissenschaften und Mathematik > 500 Naturwissenschaften |
URN: | urn:nbn:de:bvb:19-epub-28298-0 |
Sprache: | Englisch |
Dokumenten ID: | 28298 |
Datum der Veröffentlichung auf Open Access LMU: | 08. Jun. 2016, 15:37 |
Letzte Änderungen: | 04. Nov. 2020, 13:07 |
Literaturliste: | Bernau, C., Riester, M., Boulesteix, A.-L., Parmigiani, G., Huttenhower, C., Waldron, L., and Trippa, L. (2014). Cross-study validation for the assessment of prediction algorithms. Bioinformatics, 30, i105–i112. Bolstad, B. M., Irizarry, R. A., Åstrand, M., and Speed, T. P. (2003). A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics, 19, 185–193. Boulesteix, A.-L. (2013). On representative and illustrative comparisons with real data in bioinformatics: response to the letter to the editor by Smith et al. Bioinformatics, 29, 2664–2666. Boulesteix, A.-L., Lauer, S., and Eugster, M. J. (2013). A plea for neutral comparison studies in computational sciences. PLoS ONE, 8, e61562. Boulesteix, A.-L., Hable, R., Lauer, S., and Eugster, M. J. A. (2015). A statistical framework for hypothesis testing in real data comparison studies. Am Stat , 69, 201–212. Bühlmann, P. and Hothorn, T. (2007). Boosting algorithms: regularization, prediction and model fitting. Stat Sci , 22, 477–505. Bühlmann, P. and Yu, B. (2008). Response to Mease and Wyner, evidence contrary to the statistical view of boosting. J Mach Learn Res, 9, 187–194. Bullard, J. H., Purdom, E., Hansen, K. D., and Dudoit, S. (2010). Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics, 11, 94. Castaldi, P. J., Dahabreh, I. J., and Ioannidis, J. P. (2011). An empirical assessment of validation practices for molecular classifiers. Brief. Bioinform, 12, 189–202. Gatto, L., Hansen, K. D., Hoopmann, M. R., Hermjakob, H., Kohlbacher, O., and Beyer, A. (2016). Testing and validation of computational methods for mass spectrometry. J. Proteome Res., 15, 809–814. Hansen, K. D. and Irizarry, R. A. (2012). Removing technical variability in RNAseq data using conditional quantile normalization. Biostatistics, 13, 204–216. Hornung, R. and Causeur, D. (2016). bapred: Batch effect removal and addon normalization (in phenotype prediction using gene data). R package version 1.0. Hornung, R., Boulesteix, A.-L., and Causeur, D. (2016). Combining location-and scale batch effect adjustment with data cleaning by latent factor adjustment. BMC Bioinformatics, 17, 27. Irizarry, R. A., Hobbs, H., Collin, F., Beazer-Barclay, Y. D., Antonellis, K. J., Scherf, U., and Speed, T. P. (2003). Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics, 4, 249–264. Johnson, W. E., Li, C., and Rabinovic, A. (2007). Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics, 8, 118–127. Kolesnikov, N. et al. (2015). ArrayExpress update–simplifying data submissions. Nucleic Acids Res, 43, D1113–D1116. Kostka, D. and Spang, R. (2008). Microarray based diagnosis profits from better documentation of gene expression signatures. PLoS Comput Biol , 4, e22. Leek, J. T. and Storey, J. D. (2007). Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet., 3, 1724–1735. Li, G.-Z., Zeng, X.-Q., Yang, J. Y., and Yang, M. Q. (2007). Partial least squares based dimension reduction with gene selection for tumor classification. In J. Y. Yang, M. Q. Yang, M. M. Zhu, Y. Zhang, H. R. Arabnia, Y. Deng, and N. G. Bourbakis, editors, Proceedings of the 7th IEEE International Conference on Bioinformatics and Bioengineering. Boston, pages 1439–1444. Luo, J. et al. (2010). A comparison of batch effect removal methods for enhancement of prediction performance using MAQC-II microarray gene expression data. Pharmacogenomics J , 10, 278–291. Nygaard, V. and Rødland, E. A. (2016). Methods that remove batch effects while retaining group differences may lead to exaggerated confidence in downstream analyses. Biostatistics, 17, 29–39. Okoniewski, M. J. and Miller, C. J. (2008). Comprehensive analysis of affymetrix exon arrays using BioConductor. PLoS Comput Biol , 4, e6. Parker, H. S., Bravo, H. C., and Leek, J. T. (2014). Removing batch effects for prediction problems with frozen surrogate variable analysis. PeerJ , 2, e561. Pohjalainen, J., Räsänen, O., and Kadioglu, S. (2015). Feature selection methods and their combinations in high-dimensional classification of speaker likability, intelligibility and personality traits. Comput Speech Lang, 29, 145–171. Scheerer, A., editor (2009). Batch Effects and Noise in Microarray Experiments: Sources and Solutions. Wiley Series in Probability and Statistics. Wiley, Hoboken. Schmid, R. et al. (2010). Comparison of normalization methods for Illumina Bead- Chip HumanHT-12 v3. BMC Genomics, 11, 349. Seibold, H., Bernau, C., Boulesteix, A.-L., and De Bin, R. (2016). On the choice and influence of the number of boosting steps. Technical report 188, Department of Statistics, LMU. Sonka, M., Hlavac, V., and Boyle, R., editors (2014). Image Processing, Analysis, and Machine Vision. Cengage Learning, Boston. Staaf, J., Vallon-Christersson, J., Lindgren, D., Juliusson, G., Rosenquist, R., Höglund, M., Borg, Å., and Ringnér, M. (2008). Normalization of Illumina Infinium whole-genome SNP data improves copy number estimates and allelic intensity ratios. BMC Bioinformatics, 9, 409. ’t Hoen, P. A. C. et al. (2008). Deep sequencing-based expression analysis shows major advances in robustness, resolution and inter-lab portability over five microarray platforms. Nucleic Acids Res, 36, e141. van’t Veer, L. J. and Bernards, R. (2008). Enabling personalized cancer medicine through analysis of gene-expression patterns. Nature, 452, 564–570. |