Mohajer, Mojgan and Englmeier, KarlHans and Schmid, Volker J. (1. December 2010): A comparison of Gap statistic definitions with and without logarithm function. Department of Statistics: Technical Reports, No.96 

766kB 
Abstract
The Gap statistic is a standard method for determining the number of clusters in a set of data. The Gap statistic standardizes the graph of $\log(W_{k})$, where $W_{k}$ is the withincluster dispersion, by comparing it to its expectation under an appropriate null reference distribution of the data. We suggest to use $W_{k}$ instead of $\log(W_{k})$, and to compare it to the expectation of $W_{k}$ under a null reference distribution. In fact, whenever a number fulfills the original Gap statistic inequality, this number also fulfills the inequality of a Gap statistic using $W_{k}$, but not \textit{vice versa}. The two definitions of the Gap function are evaluated on several simulated data set and on a real data of DCEMR images.
Item Type:  Paper (Technical Report) 

Status:  Submitted Version 
Keywords:  average linkage, Gap statistic, log function, number of clusters, within cluster dispersion 
Collections:  Mathematics, Computer Science and Statistics > Statistics > Technical Reports 
Subjects:  500 Science > 510 Mathematics 
JEL Classification:  C38 
URN:  urn:nbn:de:bvb:19epub119203 
Language:  English 
ID Code:  11920 
Deposited On:  03. Dec 2010 10:42 
Last Modified:  11. Feb 2015 19:49 
References:  Bellman, R. (1961). Adaptive control processes: a guided tour. A Rand Corporation Research Study Series. Princeton University Press. Beyer, K., J. Goldstein, R. Ramakrishnan, and U. Shaft (1999). When is nearest neighbor meaningful? In C. Beeri and P. Buneman (Eds.), Database Theory ICDT99, Volume 1540 of Lecture Notes in Computer Science, pp. 217–235. Springer Berlin / Heidelberg. Brix, G., F. Kiessling, R. Lucht, S. Darai, K. Wasser, S. Delorme, and J. Griebel (2004). Microcirculation and microvasculature in breast tumors: pharma cokinetic analysis of dynamic MR image series. Magnetic Resonance in Medicine 52 (2), 420–429. Buadu, L., J. Murakami, S. Murayama, N. Hashiguchi, S. Sakai, S. Toyoshima, K. Masuda, S. Kuroki, and S. Ohno (1997). Patterns of peripheral enhancement in breast masses: correlation of findings on contrast medium enhanced MRI with histologic features and tumor angiogenesis. Journal of computer assisted tomography 21 (3), 421. Cali ́ski, T. and J. Harabasz (1974). A dendrite method for cluster analysis. Communications in StatisticsTheory and Methods 3 (1), 1–27. Castellani, U., M. Cristiani, A. Daducci, P. Farace, P. Marzola, V. Murino, and A. Sbarbati (2009). DCEMRI data analysis for cancer area classification. Methods of information in medicine 48 (3), 248–253. Dudoit, S. and J. Fridlyand (2002). A predictionbased resampling method for estimating the number of clusters in a dataset. Genome biology 3 (7). Fischer, H. and J. Hennig (1999). Neural networkbased analysis of MR time series. Magnetic Resonance in Medicine 41 (1), 124–131. Fisher, R. (1963). Irvine, CA: University of California, School of Information and Computer Science: UCI Machine Learning Repository. http://archive.ics.uci.edu/ml. German Cancer Research Center (DKFZ) (2004). Research program “Innovative Diagnosis and Therapy”. Heidelberg, Germany: German Cancer Research Center (DKFZ). Glenberg, A. and M. Andrzejewski (2008). Learning from data: An introduction to statistical reasoning. Taylor & Francis Group, LLC. Hartigan, J. (1975). Clustering algorithms. John Wiley & Sons, Inc. New York, NY, USA. Kaufman, L. and P. Rousseeuw (1990). Finding Groups in Data An Introduction to Cluster Analysis. New York: Wiley Interscience. Krzanowski, W. and Y. Lai (1988). A criterion for determining the number of groups in a data set using sumofsquares clustering. Biometrics 44 (1), 23–34. Nattkemper, T., B. Arnrich, O. Lichte, W. Timm, A. Degenhard, L. Pointon, C. Hayes, and M. Leach (2005). Evaluation of radiological features for breast tumour classification in clinical screening with machine learning methods. Artificial Intelligence in Medicine 34 (2), 129–139. Santalo, L. A. (1976). Integral geometry and geometric probability / Luis A. Santalo ; with a foreword by Mark Kac, pp. 49. AddisonWesley Pub. Co., Advanced Book Program, Reading, Mass. Schlossbauer, T., G. Leinsinger, A. Wismuller, O. Lange, M. Scherr, A. MeyerBaese, and M. Reiser (2008). Classification of small contrast enhancing breast lesions in dynamic magnetic resonance imaging using a combination of morphological criteria and dynamic analysis based on unsupervised vectorquantization. Investigative radiology 43 (1), 56. Scott, A. and M. Symons (1971). Clustering methods based on likelihood ratio criteria. Biometrics 27 (2), 387–397. Sugar, C. A. and G. M. James (2003). Finding the Number of Clusters in a Data Set  An Information Theoretic Approach. J. Am. Statist. Ass. 98 (463), 750–763. Tibshirani, R., G. Walther, and T. Hastie (2001). Estimating the number of clusters in a data set via the gap statistic. J. R. Statist. Soc. B 63 (2), 411–423. Varini, C., A. Degenhard, and T. Nattkemper (2006). Visual exploratory analysis of DCEMRI data in breast cancer by dimensional data reduction: A comparative study. Biomedical Signal Processing and Control 1 (1), 56–63. Wendl, M. and S. Yang (2004). Gap statistics for whole genome shotgun DNA sequencing projects. Bioinformatics 20 (10), 1527–1534. Wismueller, A., A. MeyerB ̈se, O. Lange, T. Schlossbauer, M. Kallergi, M. Reiser, and G. Leinsinger (2006). Segmentation and classification of dynamic breast magnetic resonance image data. Journal of Electronic Imaging 15, 013020. Wolberg, W., W. Street, and O. Mangasarian (1993). Irvine, CA: University of California, School of Information and Computer Science: UCI Machine Learning Repository. http://archive.ics.uci.edu/ml. Yan, M. and K. Ye (2007). Determining the number of clusters using the weighted gap statistic. Biometrics 63 (4), 1031–1037. Yang, Q., L. Tang, W. Dong, and Y. Sun (2009). Image edge detecting based on gap statistic model and relative entropy. In Y. Chen, H. Deng, D. Zhang, and Y. Xiao (Eds.), FSKD (5), pp. 384–387. IEEE Computer Society. Yin, Z., X. Zhou, C. Bakal, F. Li, Y. Sun, N. Perrimon, and S. Wong (2008). Using iterative cluster merging with improved gap statistics to perform online phenotype discovery in the context of highthroughput RNAi screens. BMC bioinformatics 9 (1), 264. ZhengJun, Z. and Z. YaoQin (2009). Estimating the image segmentation number via the entropy gap statistic. In ICIC ’09: Proceedings of the 2009 Second International Conference on Information and Computing Science, Washington, DC, USA, pp. 14–16. IEEE Computer Society. 