Оценка эффективности алгоритмов в задаче кластеризации биологических объектов

Vitalii Umanets; Bogdan Voinyk; Volodymyr Pavlov; Ievgen Nastenko

doi:10.20535/ibb.2018.2.2.133466

Authors

Vitalii Umanets Igor Sikorsky Kyiv Polytechnic Institute, Ukraine https://orcid.org/0000-0002-1682-6347
Bogdan Voinyk Igor Sikorsky Kyiv Polytechnic Institute, Ukraine https://orcid.org/0000-0002-6379-2725
Volodymyr Pavlov Igor Sikorsky Kyiv Polytechnic Institute, Ukraine https://orcid.org/0000-0002-3293-5308
Ievgen Nastenko Igor Sikorsky Kyiv Polytechnic Institute, Ukraine https://orcid.org/0000-0002-1076-9337

DOI:

https://doi.org/10.20535/ibb.2018.2.2.133466

Keywords:

Clustering, k-means, Biological object, Membership function, Estimation of a number of clusters, Fuzzy clustering

Abstract

Background. The task of determining the functional relationship between biophysical parameters is an integral part of the actual problem of finding the optimal impact on a biological object and is currently not completely resolved. One of the important tasks in this area is the partitioning of the original feature space into such areas (clusters) that relate to different functional relationships linking biophysical parameters and have, in general, an arbitrary shape. Such clusters in the future is logical to call functional. To obtain and analyze the functional clusters, there are a number of algorithms, each of which has its advantages and disadvantages. At the same time, the solution of a certain practical problem requires an evaluation of the efficiency of the algorithms in terms of the cluster separation adequacy.

Objective. In this paper, for a general example of the biological objects clustering problem (Fischer’s Iris Data Set), the efficiency of a typical clustering tools series is evaluated. The application of k-means classical algorithm, the Ward algorithm and developed in this work the fuzzy version of clustering for the k-means algorithm with a limited mass of the working area for the clusters’ formation was considered.

Methods. The algorithm includes a procedure for a priori estimation of the clusters quantity. The estimation is carried out according to the frequency histogram. To determine the optimal number of the histogram columns, the application of the Scott formula is justified. The algorithm allows forming clusters of arbitrary configuration with obtaining the value of the object's membership measure for each of the clusters. The comparative testing of the above algorithms was carried out on Fisher’s Iris Data Set.

Results. The best value of F₁-score is obtained for the algorithm proposed in this paper: F₁ = 0.92, the value F₁ = 0.90 is obtained for the Ward method and the value F₁ = 0.88 – for the classical k-means algorithm.

Conclusions. The obtained test results on the analysis problem of arbitrary-shaped clusters made it possible to give preference to the version of fuzzy k-means with a limited mass of the working area for the clusters’ formation. The calculating of the membership measure value allows us to obtain additional information on the structure of cluster formations, as well as to correct the result of clustering of k-means with a limited mass, which is especially important since the formation of clusters occurs in a single pass. Comparing the computational resources required for computing algorithms with relatively close test results also makes it possible to give preference to the developed algorithm. Compared with the Ward algorithm, it requires fewer computing resources since no additional memory is needed to store the distance matrix and no time is required to recalculate it.

References

Xiao Y, Yu J. Partitive clustering (K-means family). Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. 2012;2(3):209-25. DOI: 10.1002/widm.1049

Nastenko Y. The use of cluster analysis for partitioning mixtures of multidimensional functional characteristics of complex biomedical systems. J Automat Inform Sci. 1996;28(5-6):77-83. DOI: 10.1615/jautomatinfscien.v28.i5-6.100

Boldak AA, Suharev DL. Determining number of clusters in statistical data. Visnyk NTUU KPI. Informatika, Upravlinnia ta Obchislyuvalna Tehnika. 2011;54(2):118-22.

Shannon C. A mathematical theory of communication. Bell System Tech J. 1948;27(4):379-423, 623-656. DOI: 10.1002/j.1538-7305.1948.tb00917.x

Kullback S, Leibler R. On information and sufficiency. Annals Math Stat. 1951;22(1):79-86. DOI: 10.1214/aoms/1177729694

Scott D. On optimal and data-based histograms. Biometrika. 1979;66(3):605.

Bezdek JC. Pattern recognition with fuzzy objective function algoritms. New York: Plenum Press; 1981. DOI: 10.1007/978-1-4757-0450-1

Fisher R. UCI Machine learning repository: Iris data set [Internet]. Archive.ics.uci.edu. [cited 17 February 2018]. Available from: http://archive.ics.uci.edu/ml/datasets/Iris

Asch V. Macro- and micro-averaged evaluation measures [Internet]. Clips.uantwerpen.be. 2012 [cited 13 April 2018]. Available from: https://www.clips.uantwerpen.be/~vincent/pdf/microaverage.pdf

Estimation of Algorithms Efficiency in the Task of Biological Objects Clustering

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Current Issue