Mixed Data Type Analysis: A Systematic Literature Review

Hasta Pratama; Fetty Fitriyanti Lubis; Jaka Sembiring

doi:10.36080/idealis.v7i2.3168

Hasta Pratama Institut Teknologi Bandung
Fetty Fitriyanti Lubis Program Doktor Teknik Elektro dan Informatika, STEI ITB
Jaka Sembiring Program Doktor Teknik Elektro dan Informatika, STEI ITB

DOI: https://doi.org/10.36080/idealis.v7i2.3168

Keywords: data analysis, mixed data analysis, data mining, systematic literature survey, picoc

Abstract

This research aims to determine the direction of research in the analysis of mixed data types. The world is currently filled with increasingly diverse data, especially in terms of data types, which are not only numerical or categorical but can be both (mixed). In Data Mining, the analysis of mixed data poses significant challenges because numerical and categorical data exhibit different properties. The research methodology employed in this study utilizes the PICOC framework (Population, Intervention, Comparison, Outcome, Context) to collect and review relevant literature. The primary findings from this comprehensive literature survey reveal that a majority of the research related to mixed data is published in reputable journals Q1, indicating sustained interest in the topic of mixed data analysis. Clustering models emerge as the most frequently used models in the field of mixed data analysis. However, it's noteworthy that accuracy metrics remain the predominant evaluation benchmark, often leading to comparisons with the ideal clustered data. The management of mixed data typically involves normalization techniques, specifically normalizing the scale to amalgamate the two types of data. The conclusion drawn from the results of the literature review is the necessity to develop unlabeled mixed data, encompassing both the model and metrics required to assess the outcomes. Additionally, this research emphasizes the significance of a comprehensive development model, ranging from feature selection to evaluation models. Therefore, the analysis of mixed data types remains a field with ample opportunities for exploration and potential innovation. This potential is particularly evident in the areas of dynamic model development and the ability to handle structured and extensive data.

Downloads

Download data is not yet available.

References

[1] A. Ahmad dan S. S. Khan, “Survey of State-of-the-Art Mixed Data Clustering Algorithms,” IEEE Access, vol. 7, hal. 31883–31902, 2019, doi: 10.1109/ACCESS.2019.2903568.
[2] R. S. Wahono, “A Systematic Literature Review of Software Defect Prediction: Research Trends, Datasets, Methods and Frameworks,” J. Softw. Eng., no. Vol 1, No 1 (2015), hal. 1–16, 2015, [Daring]. Tersedia pada: http://journal.ilmukomputer.org/index.php/jse/article/view/47.
[3] S. Xu, L. Feng, S. Liu, dan H. Qiao, “Self-adaption neighborhood density clustering method for mixed data stream with concept drift,” Eng. Appl. Artif. Intell., vol. 89, no. November 2019, hal. 103451, 2020, doi: 10.1016/j.engappai.2019.103451.
[4] A. J. M. S. Arockiam dan E. S. Irudhayaraj, “Reclust: an efficient clustering algorithm for mixed data based on reclustering and cluster validation,” Indones. J. Electr. Eng. Comput. Sci., vol. 29, no. 1, hal. 545 – 552, 2023, doi: 10.11591/ijeecs.v29.i1.pp545-552.
[5] R. Bi, D. Guo, Y. Zhang, R. Huang, L. Lin, dan J. Xiong, “Outsourced and Privacy-Preserving Collaborative K-Prototype Clustering for Mixed Data via Additive Secret Sharing,” IEEE Internet Things J., vol. 10, no. 18, hal. 15810–15821, 2023, doi: 10.1109/JIOT.2023.3266028.
[6] Y. Xu, X. Gao, dan X. Wang, “Nonparametric Clustering of Mixed Data Using Modified Chi-Squared Tests,” Entropy, vol. 24, no. 12, 2022, doi: 10.3390/e24121749.
[7] Y.-G. Choi, S. Ahn, dan J. Kim, “Model-Based Clustering of Mixed Data With Sparse Dependence,” IEEE Access, vol. 11, hal. 75945–75954, 2023, doi: 10.1109/ACCESS.2023.3296790.
[8] H. Rezaei dan N. Daneshpour, “Mixed data clustering based on a number of similar features,” Pattern Recognit., vol. 143, 2023, doi: 10.1016/j.patcog.2023.109815.
[9] R. J. Kuo, P. Amornnikun, dan T. P. Q. Nguyen, “Metaheuristic-based possibilistic multivariate fuzzy weighted c-means algorithms for market segmentation,” Appl. Soft Comput. J., vol. 96, hal. 106639, 2020, doi: 10.1016/j.asoc.2020.106639.
[10] K. R. Nirmal dan K. V. V. Satyanarayana, “Map reduce based removing dependency on K and initial centroid selection MR-REDIC algorithm for clustering of mixed data,” Int. J. Adv. Comput. Sci. Appl., vol. 11, no. 2, hal. 733–740, 2020, doi: 10.14569/ijacsa.2020.0110292.
[11] K. Balaji, “Machine learning algorithm for feature space clustering of mixed data with missing information based on molecule similarity,” J. Biomed. Inform., vol. 125, 2022, doi: 10.1016/j.jbi.2021.103954.
[12] K. Balaji dan K. Lavanya, “Machine learning algorithm for cluster analysis of mixed dataset based on instance-cluster closeness metric,” Chemom. Intell. Lab. Syst., vol. 215, 2021, doi: 10.1016/j.chemolab.2021.104346.
[13] T. P. Q. Nguyen, R. J. Kuo, M. D. Le, T. C. Nguyen, dan T. H. A. Le, “Local search genetic algorithm-based possibilistic weighted fuzzy c-means for clustering mixed numerical and categorical data,” Neural Comput. Appl., vol. 34, no. 20, hal. 18059–18074, 2022, doi: 10.1007/s00521-022-07411-1.
[14] M. Li, X. Li, dan J. Li, “High-Dimensional Clustering for Incomplete Mixed Dataset Using Artificial Intelligence,” IEEE Access, vol. 8, hal. 69629–69638, 2020, doi: 10.1109/ACCESS.2020.2986813.
[15] L. Chen, L. Zeng, Y. Mu, dan L. Chen, “Global Combination and Clustering based Differential Privacy Mixed Data Publishing,” IEEE Trans. Knowl. Data Eng., hal. 1–12, 2023, doi: 10.1109/TKDE.2023.3237822.
[16] P. D’Urso dan R. Massari, “Fuzzy clustering of mixed data,” Inf. Sci. (Ny)., vol. 505, hal. 513–534, 2019, doi: 10.1016/j.ins.2019.07.100.
[17] J. Zhou, K. Chen, dan J. Liu, “A clustering algorithm based on the weighted entropy of conditional attributes for mixed data,” Concurr. Comput. Pract. Exp., vol. 33, no. 17, hal. 1–13, 2021, doi: 10.1002/cpe.6293.
[18] F. A. Mazarbhuiya, M. Y. Alzahrani, dan A. K. Mahanta, “Detecting anomaly using partitioning clustering with merging,” ICIC Express Lett., vol. 14, no. 10, hal. 951 – 960, 2020, doi: 10.24507/icicel.14.10.951.
[19] O. Koren, carina A. Hallin, nir Perel, dan D. Bendet, “Decision-Making Enhancement in a Big Data Environment: Application of the K-Means Algorithm to Mixed Data,” J. Artif. Intell. Soft Comput. Res., vol. 9, no. 4, hal. 293 – 302, 2019, doi: 10.2478/jaiscr-2019-0010.
[20] Y. Li, X. Chu, D. Tian, J. Feng, dan W. Mu, “Customer segmentation using K-means clustering and the adaptive particle swarm optimization algorithm,” Appl. Soft Comput., vol. 113, hal. 107924, 2021, doi: 10.1016/j.asoc.2021.107924.
[21] S. Behzadi, N. S. Müller, C. Plant, dan C. Böhm, “Clustering of mixed-type data considering concept hierarchies: problem specification and algorithm,” Int. J. Data Sci. Anal., vol. 10, no. 3, hal. 233–248, 2020, doi: 10.1007/s41060-020-00216-2.
[22] F. Li, Y. Qian, J. Wang, F. Peng, dan J. Liang, “Clustering mixed type data: a space structure-based approach,” Int. J. Mach. Learn. Cybern., vol. 13, no. 9, hal. 2799–2812, 2022, doi: 10.1007/s13042-022-01602-x.
[23] J. Ji, W. Pang, Z. Li, F. He, G. Feng, dan X. Zhao, “Clustering Mixed Numeric and Categorical Data with Cuckoo Search,” IEEE Access, vol. 8, hal. 30988–31003, 2020, doi: 10.1109/ACCESS.2020.2973216.
[24] J. Ji, Y. Chen, G. Feng, X. Zhao, dan F. He, “Clustering mixed numeric and categorical data with artificial bee colony strategy,” J. Intell. Fuzzy Syst., vol. 36, no. 2, hal. 1521–1530, 2019, doi: 10.3233/JIFS-18146.
[25] S. B. Kather dan B. K. Tripathy, “Clustering mixed data using neighbourhood rough sets,” Int. J. Adv. Intell. Paradig., vol. 15, no. 1, hal. 1 – 16, 2020, doi: 10.1504/IJAIP.2020.104103.
[26] B. Duan, L. Han, Z. Gou, Y. Yang, dan S. Chen, “Clustering mixed data based on density peaks and stacked denoising autoencoders,” Symmetry (Basel)., vol. 11, no. 2, 2019, doi: 10.3390/sym11020163.
[27] K. Balaji, K. Lavanya, dan A. G. Mary, “Clustering algorithm for mixed datasets using density peaks and Self-Organizing Generative Adversarial Networks,” Chemom. Intell. Lab. Syst., vol. 203, no. April, hal. 104070, 2020, doi: 10.1016/j.chemolab.2020.104070.
[28] K. Balaji dan K. Lavanya, “Cluster analysis of mixed data based on Feature Space Instance Cluster Closeness Metric,” Chemom. Intell. Lab. Syst., vol. 215, no. May, hal. 104370, 2021, doi: 10.1016/j.chemolab.2021.104370.
[29] E. Mousavi dan M. Sehhati, “A generalized multi-aspect distance metric for mixed-type data clustering,” Pattern Recognit., vol. 138, 2023, doi: 10.1016/j.patcog.2023.109353.
[30] Z. Lv et al., “An Optimizing and Differentially Private Clustering Algorithm for Mixed Data in SDN-Based Smart Grid,” IEEE Access, vol. 7, hal. 45773–45782, 2019, doi: 10.1109/ACCESS.2019.2909048.
[31] X. Yao, J. Wang, M. Shen, H. Kong, dan H. Ning, “An improved clustering algorithm and its application in IoT data analysis,” Comput. Networks, vol. 159, hal. 63 – 72, 2019, doi: 10.1016/j.comnet.2019.04.022.
[32] H. Petwal dan R. Rani, “An efficient clustering algorithm for mixed dataset of postoperative surgical records,” Int. J. Comput. Intell. Syst., vol. 13, no. 1, hal. 757–770, 2020, doi: 10.2991/ijcis.d.200601.001.
[33] M. Marbac dan V. Vandewalle, “A tractable multi-partitions clustering,” Comput. Stat. Data Anal., vol. 132, hal. 167 – 179, 2019, doi: 10.1016/j.csda.2018.06.013.
[34] S. Liu, H. Zhang, dan X. Liu, “A study on two-stage mixed attribute data clustering based on density peaks,” Int. Arab J. Inf. Technol., vol. 18, no. 5, hal. 634–643, 2021, doi: 10.34028/iajit/18/5/2.
[35] M. Salman, “A novel clustering method with consistent data in a three-dimensional graphical format over existing clustering mechanisms,” Inf. Sci. (Ny)., vol. 649, no. February, hal. 119634, 2023, doi: 10.1016/j.ins.2023.119634.
[36] J. Ji, R. Li, W. Pang, F. He, G. Feng, dan X. Zhao, “A Multi-View Clustering Algorithm for Mixed Numeric and Categorical Data,” IEEE Access, vol. 9, hal. 24913–24924, 2021, doi: 10.1109/ACCESS.2021.3057113.
[37] G. Xu, L. Zhang, C. Ma, dan Y. Liu, “A mixed attributes oriented dynamic SOM fuzzy cluster algorithm for mobile user classification,” Inf. Sci. (Ny)., vol. 515, hal. 280 – 293, 2020, doi: 10.1016/j.ins.2019.12.019.
[38] X. Li, Z. Wu, Z. Zhao, F. Ding, dan D. He, “A mixed data clustering algorithm with noise-filtered distribution centroid and iterative weight adjustment strategy,” Inf. Sci. (Ny)., vol. 577, hal. 697 – 721, 2021, doi: 10.1016/j.ins.2021.07.039.
[39] Z. Yuan, H. Chen, T. Li, Z. Yu, B. Sang, dan C. Luo, “Unsupervised attribute reduction for mixed data based on fuzzy rough sets,” Inf. Sci. (Ny)., vol. 572, hal. 67–87, 2021, doi: 10.1016/j.ins.2021.04.083.
[40] L. Sun, L. Wang, W. Ding, Y. Qian, dan J. Xu, “Neighborhood multi-granulation rough sets-based attribute reduction using Lebesgue and entropy measures in incomplete neighborhood decision systems,” Knowledge-Based Syst., vol. 192, hal. 105373, 2020, doi: 10.1016/j.knosys.2019.105373.
[41] J. Matute dan L. Linsen, “Hinted Star Coordinates for Mixed Data,” Comput. Graph. Forum, vol. 39, no. 1, hal. 117 – 133, 2020, doi: 10.1111/cgf.13666.
[42] C.-W. Chen, Y.-H. Tsai, F.-R. Chang, dan W.-C. Lin, “Ensemble feature selection in medical datasets: Combining filter, wrapper, and embedded feature selection results,” Expert Syst., vol. 37, no. 5, 2020, doi: 10.1111/exsy.12553.
[43] J. Wan, H. Chen, T. Li, X. Yang, dan B. Sang, “Dynamic interaction feature selection based on fuzzy rough set,” Inf. Sci. (Ny)., vol. 581, hal. 891 – 911, 2021, doi: 10.1016/j.ins.2021.10.026.
[44] S. Solorio-Fernández, J. F. Martínez-Trinidad, dan J. A. Carrasco-Ochoa, “A Supervised Filter Feature Selection method for mixed data based on Spectral Feature Selection and Information-theory redundancy analysis,” Pattern Recognit. Lett., vol. 138, hal. 321 – 328, 2020, doi: 10.1016/j.patrec.2020.07.039.
[45] A. Dutt dan M. A. Ismail, “A partition-based feature selection method for mixed data: A filter approach,” Malaysian J. Comput. Sci., vol. 33, no. 2, hal. 152–169, 2020, doi: 10.22452/mjcs.vol33no2.5.
[46] N. N. Thuy dan S. Wongthanavasu, “A Novel Feature Selection Method for High-Dimensional Mixed Decision Tables,” IEEE Trans. Neural Networks Learn. Syst., vol. 33, no. 7, hal. 3024–3037, 2022, doi: 10.1109/TNNLS.2020.3048080.
[47] A. Taha, A. S. Hadi, B. Cosgrave, dan S. McKeever, “A multiple association-based unsupervised feature selection algorithm for mixed data sets,” Expert Syst. Appl., vol. 212, 2023, doi: 10.1016/j.eswa.2022.118718.
[48] F. Rodriguez-Sanchez, P. Larrañaga, dan C. Bielza, “Incremental Learning of Latent Forests,” IEEE Access, vol. 8, hal. 224420–224432, 2020, doi: 10.1109/ACCESS.2020.3027064.
[49] Q. Li, Q. Xiong, S. Ji, Y. Yu, C. Wu, dan M. Gao, “Incremental semi-supervised Extreme Learning Machine for Mixed data stream classification,” Expert Syst. Appl., vol. 185, 2021, doi: 10.1016/j.eswa.2021.115591.
[50] K. Baati, T. M. Hamdani, A. M. Alimi, dan A. Abraham, “A new possibilistic classifier for mixed categorical and numerical data based on a bi-module possibilistic estimation and the generalized minimum-based algorithm,” J. Intell. Fuzzy Syst., vol. 36, no. 4, hal. 3513–3523, 2019, doi: 10.3233/JIFS-181383.
[51] Q. Li, Q. Xiong, S. Ji, Y. Yu, C. Wu, dan H. Yi, “A method for mixed data classification base on RBF-ELM network,” Neurocomputing, vol. 431, hal. 7–22, 2021, doi: 10.1016/j.neucom.2020.12.032.
[52] T. Kuo dan K. J. Wang, “A hybrid k-prototypes clustering approach with improved sine-cosine algorithm for mixed-data classification,” Comput. Ind. Eng., vol. 169, no. February, hal. 108164, 2022, doi: 10.1016/j.cie.2022.108164.
[53] J. Muller et al., “Integrated Dual Analysis of Quantitative and Qualitative High-Dimensional Data,” IEEE Trans. Vis. Comput. Graph., vol. 27, no. 6, hal. 2953 – 2966, 2021, doi: 10.1109/TVCG.2021.3056424.
[54] Y. Lee, C. Park, dan S. Kang, “Deep Embedded Clustering Framework for Mixed Data,” IEEE Access, vol. 11, hal. 33–40, 2023, doi: 10.1109/ACCESS.2022.3232372.
[55] D. T. Dinh, V. N. Huynh, dan S. Sriboonchitta, “Clustering mixed numerical and categorical data with missing values,” Inf. Sci. (Ny)., vol. 571, hal. 418–442, 2021, doi: 10.1016/j.ins.2021.04.076.
[56] A. Grané, G. Manzi, dan S. Salini, “Dynamic Mixed Data Analysis and Visualization,” Entropy, vol. 24, no. 10, hal. 1–12, 2022, doi: 10.3390/e24101399.
[57] L. Cheng, Y. Wang, dan X. Ma, “An end-to-end distance measuring for mixed data based on deep relevance learning,” Intell. Data Anal., vol. 24, no. 1, hal. 83–99, 2020, doi: 10.3233/IDA-184399.
[58] S.-K. Ng, R. Tawiah, dan G. J. McLachlan, “Unsupervised pattern recognition of mixed data structures with numerical and categorical features using a mixture regression modelling framework,” Pattern Recognit., vol. 88, hal. 261 – 271, 2019, doi: 10.1016/j.patcog.2018.11.022.
[59] A. Y. Rodriguez-Gonzalez, J. F. Martinez-Trinidad, J. A. Carrasco-Ochoa, J. Ruiz-Shulcloper, dan M. Alvarado-Mentado, “Frequent similar pattern mining using non boolean similarity functions,” J. Intell. Fuzzy Syst., vol. 36, no. 5, hal. 4931 – 4944, 2019, doi: 10.3233/JIFS-179040.
[60] T. Kuo dan K.-J. Wang, “A hybrid k-prototypes clustering approach with improved sine-cosine algorithm for mixed-data classification,” Comput. Ind. Eng., vol. 169, 2022, doi: 10.1016/j.cie.2022.108164.
[61] G. Xu, L. Zhang, C. Ma, dan Y. Liu, “A mixed attributes oriented dynamic SOM fuzzy cluster algorithm for mobile user classification,” Inf. Sci. (Ny)., vol. 515, hal. 280–293, 2020, doi: 10.1016/j.ins.2019.12.019.
[62] J. Ji, R. Li, W. Pang, F. He, G. Feng, dan X. Zhao, “A Multi-View Clustering Algorithm for Mixed Numeric and Categorical Data,” IEEE Access, vol. 9, hal. 24913–24924, 2021, doi: 10.1109/ACCESS.2021.3057113.
[63] J. Zhou, K. Chen, dan J. Liu, “A clustering algorithm based on the weighted entropy of conditional attributes for mixed data,” Concurr. Comput. Pract. Exp., vol. 33, no. 17, 2021, doi: 10.1002/cpe.6293.

ANALISIS TIPE DATA CAMPURAN: SEBUAH TINJAUAN LITERATUR SISTEMATIS

Abstract

Downloads

References

OFFICE: