Abstract
In the context of high-throughput molecular data analysis it is common that the observations included in a dataset form distinct groups; for example, measured at different times, under different conditions or even in different labs. These groups are generally denoted as batches. Systematic differences between these batches not attributable to the biological signal of interest are denoted as batch effects. If ignored when conducting analyses on the combined data, batch effects can lead to distortions in the results. In this paper we present FAbatch, a general, model-based method for correcting for such batch effects in the case of an analysis involving a binary target variable. It is a combination of two commonly used approaches: location-and-scale adjustment and data cleaning by adjustment for distortions due to latent factors. We compare FAbatch extensively to the most commonly applied competitors on the basis of several performance metrics. FAbatch can also be used in the context of prediction modelling to eliminate batch effects from new test data. This important application is illustrated in a real data application. We implemented FAbatch and various other functionalities in the R package bapred available online from CRAN. FAbatch is seen to be competitive in many cases and above average in others. In our analyses, the only cases where it failed to adequately preserve the biological signal were when there were extremely outlying batches and when the batch effects were very weak compared to the biological signal. As seen in this paper batch effect structures found in real datasets are diverse. Current batch effect adjustment methods are often either too simplistic or make restrictive assumptions, which can be violated in real datasets. Due to the generality of its underlying model and its ability to perform well FAbatch represents a reliable tool for batch effect adjustment for most situations found in practice.
Dokumententyp: | Paper |
---|---|
Publikationsform: | Submitted Version |
Keywords: | Batch effects, High-dimensional data, Data preparation, Prediction, Latent factors |
Fakultät: | Mathematik, Informatik und Statistik > Statistik > Technische Reports |
Themengebiete: | 500 Naturwissenschaften und Mathematik > 500 Naturwissenschaften |
URN: | urn:nbn:de:bvb:19-epub-25331-3 |
Sprache: | Englisch |
Dokumenten ID: | 25331 |
Datum der Veröffentlichung auf Open Access LMU: | 23. Sep. 2015, 02:51 |
Letzte Änderungen: | 04. Nov. 2020, 13:06 |
Literaturliste: | T Barrett, S E Wilhite, P Ledoux, C Evangelista, I F Kim, M Tomashevsky, K A Marshal´l, K H Phillippy, P M Sherman, M Holko, A Yefanov, H Lee, N Zhang, C L Robertson, N Serova, S Davis, and A Soboleva. Ncbi geo: archive for functional genomics data sets–update. Nucleic Acids Res., 41:D991–D995, 2013. S Boltz, E Debreuve, and M Barlaud. High-dimensional statistical measure for region-of-interest tracking. Transactions in Image Processing, 18(6):1266– 1283, 2009. A-L Boulesteix. PLS dimension reduction for classification with microarray data. Stat. Appl. Genet. Mol. Biol., 3(1):33, 2004. C Chen, K Grennan, J Badner, D Zhang, E Gershon, L Jin, and C Liu. Removing batch effects in analysis of expression microarray data: An evaluation of six batch adjustment methods. PLoS ONE, 6(2):e17238, 2011. C Friguet, M Kloareg, and D Causeur. A factor model approach to multiple testing under dependence. J. Am. Stat. Assoc., 104(488):1406–1415, 2009. C J Geyer and G D Meeden. Fuzzy and randomized confidence intervals and p-values (with discussion). Stat. Sci., 20(4):358–387, 2005. R Hornung and D Causeur. bapred: Batch Effect Removal (in Phenotype Prediction using Gene Data), 2015. R package version 0.1. C-W Hsu, C-C Chang, and C-J Lin. A practical guide to support vector classification. Technical report, National Taiwan University, 2010. W E Johnson, A Rabinovic, and C Li. Adjusting batch effects in microarray expression data using empirical bayes methods. Biostatistics, 8:118–127, 2007. N Kolesnikov, E Hastings, M Keays, O Melnichuk, Y A Tang, E Williams, M Dylag, N Kurbatova, M Brandizi, T Burdett, K Megy, E Pilicheva, G Rustici, A Tikhonov, H Parkinson, R Petryszak, U Sarkans, and A Brazma. ArrayExpress update – simplifying data submissions. Nucleic Acids Res., 2015. C Lazar, S Meganck, J Taminau, D Steenhoff, A Coletta, C Molter, D Y Weiss-Solís, R Duque, H Bersini, and A Nowé. Batch effect removal methods for microarray gene expression data integration: a survey. Briefings in Bioinformatics, 14(4):469–490, 2012. J A Lee, K K Dobbin, and J Ahn. Covariance adjustment for batch effect in gene expression data. Stat. Med., 33:2681–2695, 2014. J T Leek and J D Storey. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genetics, 3:1724–1735, 2007. J Li, P Bushel, Chu T-M, and R D Wolfinger. Principal variance components analysis: Estimating batch effects in microarray gene expression data. In A Scherer, editor, Batch Effects and Noise in Microarray Experiments: Sources and Solutions, pages 141–154. John Wiley & Sons, Chichester, UK, 2009. J Luo, M Schumacher, A Scherer, D Sanoudou, D Megherbi, T Davison, T Shi, W Tong, L Shi, H Hong, C Zhao, F Elloumi, W Shi, R Thomas, S Lin, G Tillinghast, G Liu, Y Zhou, D Herman, Y Li, Y Deng, H Fang, P Bushel, MWoods, and J Zhang. A comparison of batch effect removal methods for enhancement of prediction performance using maqc-ii microarray gene expression data. The Pharmacogenomics Journal, 10:278–291, 2010. J N S Matthews. Introduction to Randomized Controlled Clinical Trials. Chapman & Hall, London, UK, 2006. H S Parker, H C Bravo, and J T Leek. Removing batch effects for prediction problems with frozen surrogate variable analysis. PeerJ, 2:e561, 2014. D B Rubin and D T Thayer. EM algorithms for ML factor analysis. Psychometrika, 47(1):69–76, 1982. A A Shabalin, H Tjelmeland, C Fan, C M Perou, and A B Nobel. Merging two gene-expression studies via cross-platform normalization. Bioinformatics, 24(9):1154–1160, 2008. C K Stein, P Qu, J Epstein, A Buros, A Rosenthal, J Crowley, G Morgan, and B Barlogie. Removing batch effects from purified plasma cell gene expression microarrays with modified combat. BMC Bioinformatics, 16:63, 2015. |