EFFICIENCY ASSESSMENT OF EUCLIDEAN AND MAKHALANOBIS DISTANCES FOR SOLVING A MAJOR TEXT CLASSIFICATION PROBLEM
https://doi.org/10.21822/2073-6185-2017-44-1-86-93
Abstract
Abstract. Objectives The aim is to compare the efficiency of using the Euclidean and Mahalanobis metrics to solve the problem of determining the category of potential text recipients. The relevance of the task is determined by the need to develop a means of identifying the recipients of electronic documents. This has been complicated with the introduction of age restrictions on the content of Internet webpages and text resources. Moreover, there has been little coverage of this issue in the works of Russian researchers. Method A comparison of the relative efficiencies of using Euclid and Mahalanobis distances was carried out within the framework of the implementation of an intelligent system for text automatic classification based on the age category of their recipients. Results The main approaches to establishing proximity measures of objects represented as sets of classification characteristics are discussed and the choice of Euclidean and Mahalanobis metrics for numerical comparison of classification results is justified. A description of the sample texts and characteristics of category designations are given for a computational experiment. The computational experiment was carried out using texts included in the National Corpus of the Russian language. Conclusion The computational experiment allows the most effective method for solving the problem of determining the age category of potential text recipients to be selected. The results of the experiment showed the possibility of using Euclidean and Mahalanobis metrics for solving text classification problems; the preference for using Mahalanobis metrics for estimating distances by objects represented by correlated features was also confirmed. The presented comparison of the relative efficiencies of Euclid and Mahalanobis distances was carried out within the framework of the implementation of an intelligent system for automatic text classification based on the age category of their recipients.
About the Author
Anna V. GlazkovaRussian Federation
Assistant, department of software
15a Perekopskaya Str., Tyumen 625003
References
1. Kadiev P.A., Kadiev I.P., Mirzabekov T.M. Paket programm dlya skremblirovaniya informatsionnogo potoka. Vestnik Dagestanskogo gosudarstvennogo tekhnicheskogo universiteta. Tekhnicheskie nauki. 2016; 2:83-92. [Kadiev P.A., Kadiev I.P., Mirzabekov T.M. Software package for scrambling the information flow. Herald of Daghestan State Technical University. Technical Sciences. 2016; 2:83-92. (in Russ.)]
2. Shikhiev F.Sh. Grafovaya model' sintaksisa. Vestnik Dagestanskogo gosudarstvennogo tekhnicheskogo universiteta. Tekhnicheskie nauki. 2012; 25:32-37. [Shikhiev F.Sh. Graph model of syntax. Herald of Daghestan State Technical University. Technical Sciences. 2012; 25:32-37. (in Russ.)]
3. Nguyen D., Smith N., Rose C. Author Age Prediction from Text using Linear Regression. Proc. of ICASSP. New-York; 2011. P. 267-276.
4. Kubarev A.I., Mikhaleva K.A., Poddubnyy V.V. Sravnitel'nyy analiz effektivnosti raspoznavaniya avtorskogo stilya tekstov razlichnymi klassifikatorami. Izvestiya vysshikh uchebnykh zavedeniy. Fizika. 2015; 58(11-2):252-258. [Kubarev A.I., Mikhaleva K.A., Poddubnyy V.V. Comparative analysis of efficiency of author's style recognition of texts by various classifiers. Russian Physics Journal. 2015; 58(11-2):252-258. (in Russ.)]
5. Mukha A.V., Rozaliev V.L., Orlova Yu.A., Zaboleeva-Zotova A.V. Avtomatizirovannyy podkhod k opredeleniyu avtorstva teksta. Izvestiya Volgogradskogo gosudarstvennogo tekhnicheskogo universiteta. 2013; 17(14-117):51-54. [Mukha A.V., Rozaliev V.L., Orlova Yu.A., Zaboleeva-Zotova A.V. Automated approach to determining the authorship of the text. Izvestia VSTU. 2013; 17(14-117):51-54. (in Russ.)]
6. Akker R.A., Traum D. Comparison of addressee detection methods for multiparty conversations. Proc. of methods for multiparty conversations. Amsterdam; 2009. P. 99-106.
7. Choi D., Ko B., Kim H., Kim P. Text Analysis for Detecting Terrorism-Related Articles on the Web. Journal of Network and Computer Applications. 2013; 8(5):37-46.
8. Kolesnikova S.I. Metody analiza informativnosti raznotipnykh priznakov. Vestnik Tomskogo gosudarstvennogo universiteta. Upravlenie, vychislitel'naya tekhnika i informatika. 2009; 1(6):69-80. [Kolesnikova S.I. Methods for analysing the informativeness of different types of signs. Tomsk State University Journal of Control and Computer Science. 2009; 1(6):69-80. (in Russ.)]
9. Polyakov I.V., Sokolova T.V., Chepovskiy A.A., Chepovskiy A.M. Problema klassifikatsii tekstov i differentsiruyushchie priznaki. Vestnik Novosibirskogo gosudarstvennogo universiteta. Seriya: Informatsionnye tekhnologii. 2015; 13(2):55-63. [Polyakov I.V., Sokolova T.V., Chepovskiy A.A., Chepovskiy A.M. The problem of text classification and differentiating features. Novosibirsk State University Journal of Information Technologies. 2015; 13(2):55-63. (in Russ.)]
10. Tolcheev V.O. Modifitsirovannyy i obobshchennyy metod blizhayshego soseda dlya klassifikatsii bibliograficheskikh tekstovykh dokumentov. Zavodskaya laboratoriya. Diagnostika materialov. 2009; 7:63-70. [Tolcheev V.O. В.О. Modified and generalised method of the nearest neighbor for the classification of bibliographic text documents. Industrial Laboratory. Materials Diagnostics. 2009; 7:63-70. (in Russ.)]
11. Meshkova E.V. Metodika postroeniya klassifikatora teksta na osnove gibridnoy neyrosetevoy modeli. Izvestiya YuFU. Tekhnicheskie nauki. 2008; 4(81):212-215. [Meshkova E.V. Method for constructing a text classifier based on a hybrid neural network model. Izvestiya SFedU. Engineering sciences. 2008; 4(81):212-215. (in Russ.)]
12. Kozoborod A.V., Meshkov V.E., Meshkova E.V. Analiz arkhitektur gibridnykh neyrosetevykh modeley v zadachakh avtomaticheskoy klassifikatsii tekstovoy informatsii. Izvestiya YuFU. Tekhnicheskie nauki. 2010; 12 (113):185-190. [Kozoborod A.V., Meshkov V.E., Meshkova E.V. Architecture analysis of hybrid neural network models in problems of automatic classification of textual information. Izvestiya SFedU. Engineering sciences. 2010; 12(113):185-190. (in Russ.)]
13. Kim Dhz.-O., Myuller Ch.U., Klekka U.R., Oldenderfer M.S., Bleshfild R.K. Faktornyy, diskriminantnyy i klasternyy analiz: Per. s angl.Moscow: Finansy i statistika; 1989. 215 p. [Kim Dhz.-O., Myuller Ch.U., Klekka U.R., Oldenderfer M.S., Bleshfild R.K. Factor, discriminant and cluster analysis: translated from English. Moscow: Finansy i statistika; 1989. 215 p. (in Russ.)]
14. Khachumov M.V. Rasstoyaniya, metriki i klasternyy analiz. Iskusstvennyy intellekt i prinyatie resheniy. 2012; 1:81-89. [Khachumov M.V. Distances, metrics and cluster analysis. Iskusstvennyy intellekt i prinyatie resheniy. 2012; 1:81-89. (in Russ.)]
15. Tolmachev I.L., Khachumov M.V. Binarnaya klassifikatsiya na osnove var'irovaniya razmernosti prostranstva priznakov i vybora effektivnoy metriki. Iskusstvennyy intellekt i prinyatie resheniy. 2010; 2:3-10. [Tolmachev I.L., Khachumov M.V. Binary classification based on variation of the feature space dimension and the choice of an effective metric. Iskusstvennyy intellekt i prinyatie resheniy. 2010; 2:3-10. (in Russ.)]
16. Khachumov M.V. Primenenie neyrona i rasstoyaniya Evklida-Makhalanobisa v zadache binarnoy klassifikatsii. Nauka i sovremennost'. 2010; 2-3:82-86. [Khachumov M.V. The application of the neuron and the Euclidean-Mahalanobis distance in the binary classification problem. Science and Modernity. 2010; 2-3:82-86. (in Russ.)]
17. Shumskaya A.O. Otsenka effektivnosti metrik rasstoyaniya Evklida i rasstoyaniya Makhalanobisa v zadachakh identifikatsii proiskhozhdeniya teksta. Doklady Tomskogo gosudarstvennogo universiteta sistem upravleniya i radioelektroniki. 2013; 3(29):141-145. [Shumskaya A.O. Estimation of the effectiveness of Euclidean distance metrics and the Mahalanobis distance in the problems of text origin identification. Proceedings of TUSUR University. 2013; 3(29):141- 145. (in Russ.)]
18. ―Baza dannykh metatekstovoy razmetki Natsional'nogo korpusa russkogo yazyka» (kollektsiya detskoy literatury)‖. 2014. [―Database of metatext marking of the National Corpus of the Russian language "(collection of children's literature))‖. 2014. (in Russ.)]
19. Natsional'nyy korpus russkogo yazyka [Elektronnyy resurs]. 2015. URL: http:// ruscorpora.ru/ (data obrasjcheniya: 26.07.2016). [The National Corpus of the Russian language [Electronic resource]. 2015. URL: http:// ruscorpora.ru/ (access date: 26.07.2016).]
20. Glazkova A.V. Proverka informativnosti klassifikatsionnykh priznakov v zadache avtomaticheskoy klassifikatsii tekstov na estestvennom yazyke. Materialy konferentsii ―Otkrytye semanticheskie tekhnologii proektirovaniya intellektual'nykh sistem (OSTIS-2015)‖. Minsk; 2015. S. 541-544. [Glazkova A.V. Checking the informativeness of classification characteristics in the task of text automatic classification in natural language. Proceedings of conference ―Open Semantic Technology for Intelligent Systems (OSTIS-2015)‖. Minsk; 2015. P. 541-544. (in Russ.)]
21. Bureeva N.N. Mnogomernyy statisticheskiy analiz s ispol'zovaniem PPP ―STATISTICA‖. Nizhny Novgorod: Nizhegorodskiy gosudarstvennyy universitet im. N.I. Lobachevskogo; 2007. 112 s. [Bureeva N.N. Multidimensional statistical analysis using "STATISTICA". Nizhniy Novgorod: Lobachevsky State University of Nizhni Novgorod; 2007. 112 p. (in Russ.)]
Review
For citations:
Glazkova A.V. EFFICIENCY ASSESSMENT OF EUCLIDEAN AND MAKHALANOBIS DISTANCES FOR SOLVING A MAJOR TEXT CLASSIFICATION PROBLEM. Herald of Dagestan State Technical University. Technical Sciences. 2017;44(1):86-93. (In Russ.) https://doi.org/10.21822/2073-6185-2017-44-1-86-93