Estimation of Algorithms Efficiency in the Task of Biological Objects Clustering
DOI:
https://doi.org/10.20535/ibb.2018.2.2.133466Keywords:
Clustering, k-means, Biological object, Membership function, Estimation of a number of clusters, Fuzzy clusteringAbstract
Background. The task of determining the functional relationship between biophysical parameters is an integral part of the actual problem of finding the optimal impact on a biological object and is currently not completely resolved. One of the important tasks in this area is the partitioning of the original feature space into such areas (clusters) that relate to different functional relationships linking biophysical parameters and have, in general, an arbitrary shape. Such clusters in the future is logical to call functional. To obtain and analyze the functional clusters, there are a number of algorithms, each of which has its advantages and disadvantages. At the same time, the solution of a certain practical problem requires an evaluation of the efficiency of the algorithms in terms of the cluster separation adequacy.
Objective. In this paper, for a general example of the biological objects clustering problem (Fischer’s Iris Data Set), the efficiency of a typical clustering tools series is evaluated. The application of k-means classical algorithm, the Ward algorithm and developed in this work the fuzzy version of clustering for the k-means algorithm with a limited mass of the working area for the clusters’ formation was considered.
Methods. The algorithm includes a procedure for a priori estimation of the clusters quantity. The estimation is carried out according to the frequency histogram. To determine the optimal number of the histogram columns, the application of the Scott formula is justified. The algorithm allows forming clusters of arbitrary configuration with obtaining the value of the object's membership measure for each of the clusters. The comparative testing of the above algorithms was carried out on Fisher’s Iris Data Set.
Results. The best value of F1-score is obtained for the algorithm proposed in this paper: F1 = 0.92, the value F1 = 0.90 is obtained for the Ward method and the value F1 = 0.88 – for the classical k-means algorithm.
Conclusions. The obtained test results on the analysis problem of arbitrary-shaped clusters made it possible to give preference to the version of fuzzy k-means with a limited mass of the working area for the clusters’ formation. The calculating of the membership measure value allows us to obtain additional information on the structure of cluster formations, as well as to correct the result of clustering of k-means with a limited mass, which is especially important since the formation of clusters occurs in a single pass. Comparing the computational resources required for computing algorithms with relatively close test results also makes it possible to give preference to the developed algorithm. Compared with the Ward algorithm, it requires fewer computing resources since no additional memory is needed to store the distance matrix and no time is required to recalculate it.References
Xiao Y, Yu J. Partitive clustering (K-means family). Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. 2012;2(3):209-25. DOI: 10.1002/widm.1049
Nastenko Y. The use of cluster analysis for partitioning mixtures of multidimensional functional characteristics of complex biomedical systems. J Automat Inform Sci. 1996;28(5-6):77-83. DOI: 10.1615/jautomatinfscien.v28.i5-6.100
Boldak AA, Suharev DL. Determining number of clusters in statistical data. Visnyk NTUU KPI. Informatika, Upravlinnia ta Obchislyuvalna Tehnika. 2011;54(2):118-22.
Shannon C. A mathematical theory of communication. Bell System Tech J. 1948;27(4):379-423, 623-656. DOI: 10.1002/j.1538-7305.1948.tb00917.x
Kullback S, Leibler R. On information and sufficiency. Annals Math Stat. 1951;22(1):79-86. DOI: 10.1214/aoms/1177729694
Scott D. On optimal and data-based histograms. Biometrika. 1979;66(3):605.
Bezdek JC. Pattern recognition with fuzzy objective function algoritms. New York: Plenum Press; 1981. DOI: 10.1007/978-1-4757-0450-1
Fisher R. UCI Machine learning repository: Iris data set [Internet]. Archive.ics.uci.edu. [cited 17 February 2018]. Available from: http://archive.ics.uci.edu/ml/datasets/Iris
Asch V. Macro- and micro-averaged evaluation measures [Internet]. Clips.uantwerpen.be. 2012 [cited 13 April 2018]. Available from: https://www.clips.uantwerpen.be/~vincent/pdf/microaverage.pdf
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2018 The Authors
This work is licensed under a Creative Commons Attribution 4.0 International License.
The ownership of copyright remains with the Authors.
Authors may use their own material in other publications provided that the Journal is acknowledged as the original place of publication and National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute” as the Publisher.
Authors are reminded that it is their responsibility to comply with copyright laws. It is essential to ensure that no part of the text or illustrations have appeared or are due to appear in other publications, without prior permission from the copyright holder.
IBB articles are published under Creative Commons licence:- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under CC BY 4.0 that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.