**
**

**Mohajer, Mojgan; Englmeier, Karl-Hans and Schmid, Volker J. ORCID: https://orcid.org/0000-0003-2195-8130 (1. December 2010): A comparison of Gap statistic definitions with and without logarithm function. Department of Statistics: Technical Reports, No.96 [PDF, 766kB]**

## Abstract

The Gap statistic is a standard method for determining the number of clusters in a set of data. The Gap statistic standardizes the graph of $\log(W_{k})$, where $W_{k}$ is the within-cluster dispersion, by comparing it to its expectation under an appropriate null reference distribution of the data. We suggest to use $W_{k}$ instead of $\log(W_{k})$, and to compare it to the expectation of $W_{k}$ under a null reference distribution. In fact, whenever a number fulfills the original Gap statistic inequality, this number also fulfills the inequality of a Gap statistic using $W_{k}$, but not \textit{vice versa}. The two definitions of the Gap function are evaluated on several simulated data set and on a real data of DCE-MR images.

Item Type: | Paper |
---|---|

Form of publication: | Submitted Version |

Keywords: | average linkage, Gap statistic, log function, number of clusters, within cluster dispersion |

Faculties: | Mathematics, Computer Science and Statistics > Statistics > Technical Reports Mathematics, Computer Science and Statistics > Statistics > Chairs/Working Groups > Bioimaging |

Subjects: | 000 Computer science, information and general works > 000 Computer science, knowledge, and systems 500 Science > 510 Mathematics 600 Technology > 610 Medicine and health |

JEL Classification: | C38 |

URN: | urn:nbn:de:bvb:19-epub-11920-3 |

Language: | English |

Item ID: | 11920 |

Date Deposited: | 03. Dec 2010, 10:42 |

Last Modified: | 04. Nov 2020, 12:52 |

References: | Bellman, R. (1961). Adaptive control processes: a guided tour. A Rand Corporation Research Study Series. Princeton University Press. Beyer, K., J. Goldstein, R. Ramakrishnan, and U. Shaft (1999). When is nearest neighbor meaningful? In C. Beeri and P. Buneman (Eds.), Database Theory ICDT99, Volume 1540 of Lecture Notes in Computer Science, pp. 217–235. Springer Berlin / Heidelberg. Brix, G., F. Kiessling, R. Lucht, S. Darai, K. Wasser, S. Delorme, and J. Griebel (2004). Microcirculation and microvasculature in breast tumors: pharma- cokinetic analysis of dynamic MR image series. Magnetic Resonance in Medicine 52 (2), 420–429. Buadu, L., J. Murakami, S. Murayama, N. Hashiguchi, S. Sakai, S. Toyoshima, K. Masuda, S. Kuroki, and S. Ohno (1997). Patterns of peripheral enhancement in breast masses: correlation of findings on contrast medium enhanced MRI with histologic features and tumor angiogenesis. Journal of computer assisted tomography 21 (3), 421. Cali ́ski, T. and J. Harabasz (1974). A dendrite method for cluster analysis. Communications in Statistics-Theory and Methods 3 (1), 1–27. Castellani, U., M. Cristiani, A. Daducci, P. Farace, P. Marzola, V. Murino, and A. Sbarbati (2009). DCE-MRI data analysis for cancer area classification. Methods of information in medicine 48 (3), 248–253. Dudoit, S. and J. Fridlyand (2002). A prediction-based resampling method for estimating the number of clusters in a dataset. Genome biology 3 (7). Fischer, H. and J. Hennig (1999). Neural network-based analysis of MR time series. Magnetic Resonance in Medicine 41 (1), 124–131. Fisher, R. (1963). Irvine, CA: University of California, School of Information and Computer Science: UCI Machine Learning Repository. http://archive.ics.uci.edu/ml. German Cancer Research Center (DKFZ) (2004). Research program “Innovative Diagnosis and Therapy”. Heidelberg, Germany: German Cancer Research Center (DKFZ). Glenberg, A. and M. Andrzejewski (2008). Learning from data: An introduction to statistical reasoning. Taylor & Francis Group, LLC. Hartigan, J. (1975). Clustering algorithms. John Wiley & Sons, Inc. New York, NY, USA. Kaufman, L. and P. Rousseeuw (1990). Finding Groups in Data An Introduction to Cluster Analysis. New York: Wiley Interscience. Krzanowski, W. and Y. Lai (1988). A criterion for determining the number of groups in a data set using sum-of-squares clustering. Biometrics 44 (1), 23–34. Nattkemper, T., B. Arnrich, O. Lichte, W. Timm, A. Degenhard, L. Pointon, C. Hayes, and M. Leach (2005). Evaluation of radiological features for breast tumour classification in clinical screening with machine learning methods. Artificial Intelligence in Medicine 34 (2), 129–139. Santalo, L. A. (1976). Integral geometry and geometric probability / Luis A. Santalo ; with a foreword by Mark Kac, pp. 49. Addison-Wesley Pub. Co., Advanced Book Program, Reading, Mass. Schlossbauer, T., G. Leinsinger, A. Wismuller, O. Lange, M. Scherr, A. Meyer-Baese, and M. Reiser (2008). Classification of small contrast enhancing breast lesions in dynamic magnetic resonance imaging using a combination of morphological criteria and dynamic analysis based on unsupervised vector-quantization. Investigative radiology 43 (1), 56. Scott, A. and M. Symons (1971). Clustering methods based on likelihood ratio criteria. Biometrics 27 (2), 387–397. Sugar, C. A. and G. M. James (2003). Finding the Number of Clusters in a Data Set - An Information Theoretic Approach. J. Am. Statist. Ass. 98 (463), 750–763. Tibshirani, R., G. Walther, and T. Hastie (2001). Estimating the number of clusters in a data set via the gap statistic. J. R. Statist. Soc. B 63 (2), 411–423. Varini, C., A. Degenhard, and T. Nattkemper (2006). Visual exploratory analysis of DCE-MRI data in breast cancer by dimensional data reduction: A comparative study. Biomedical Signal Processing and Control 1 (1), 56–63. Wendl, M. and S. Yang (2004). Gap statistics for whole genome shotgun DNA sequencing projects. Bioinformatics 20 (10), 1527–1534. Wismueller, A., A. Meyer-B ̈se, O. Lange, T. Schlossbauer, M. Kallergi, M. Reiser, and G. Leinsinger (2006). Segmentation and classification of dynamic breast magnetic resonance image data. Journal of Electronic Imaging 15, 013020. Wolberg, W., W. Street, and O. Mangasarian (1993). Irvine, CA: University of California, School of Information and Computer Science: UCI Machine Learning Repository. http://archive.ics.uci.edu/ml. Yan, M. and K. Ye (2007). Determining the number of clusters using the weighted gap statistic. Biometrics 63 (4), 1031–1037. Yang, Q., L. Tang, W. Dong, and Y. Sun (2009). Image edge detecting based on gap statistic model and relative entropy. In Y. Chen, H. Deng, D. Zhang, and Y. Xiao (Eds.), FSKD (5), pp. 384–387. IEEE Computer Society. Yin, Z., X. Zhou, C. Bakal, F. Li, Y. Sun, N. Perrimon, and S. Wong (2008). Using iterative cluster merging with improved gap statistics to perform online phenotype discovery in the context of high-throughput RNAi screens. BMC bioinformatics 9 (1), 264. Zheng-Jun, Z. and Z. Yao-Qin (2009). Estimating the image segmentation number via the entropy gap statistic. In ICIC ’09: Proceedings of the 2009 Second International Conference on Information and Computing Science, Washington, DC, USA, pp. 14–16. IEEE Computer Society. |