Abstract
Missing values are a common phenomenon in modern medical research of complex diseases. The data often contains nominal or categorical variables, for example, single nucleotide polymorphisms (SNPs) in genetic studies. If the missing values are not handled properly, the downstream statistical analysis of incomplete data may be biased. While various imputation methods are available for metrically scaled variables, methods for categorical data are scarce. An imputation method that has been shown to work well for high dimensional metrically scaled variables is the imputation by nearest neighbor methods. In this paper, we propose a weighted nearest neighbors approach to impute missing values in categorical variables in high dimensional datasets. The proposed method explicitly uses the information on the association among attributes. Using different simulation settings, the performance is compared with available imputation methods. A variety of real data sets, containing heart, DNA, and lymphatic cancer, is also used to support the results obtained by simulations. The results show that the weighting of attributes yields smaller imputation errors than existing approaches like random forest and MICE. (C) 2022 Elsevier Inc. All rights reserved.
Dokumententyp: | Zeitschriftenartikel |
---|---|
Fakultät: | Mathematik, Informatik und Statistik > Statistik |
Themengebiete: | 500 Naturwissenschaften und Mathematik > 510 Mathematik |
ISSN: | 0020-0255 |
Sprache: | Englisch |
Dokumenten ID: | 110982 |
Datum der Veröffentlichung auf Open Access LMU: | 02. Apr. 2024, 07:22 |
Letzte Änderungen: | 02. Apr. 2024, 07:22 |